Incident Report · AI Systems · 2022–2025

Why Hands Broke

Five compounding failures drove diffusion model hand generation to a ~30% success rate. Recovery took 30 months and four distinct architectural interventions, reaching ~90% by 2025. ^[18]

Sparse Training Data CLIP Counting Blindness VAE Resolution U-Net Locality Mode Interpolation

30% 2022 baseline

~90% 2025 success rate

5 root causes

6 milestones

21 sources

AI-generated malformed hands — the inciting problem. Source: PetaPixel

Root Cause Analysis 5 causes · all compounding

📊 Critical

Sparse Training Data

Faces dominate internet photos — hands are small, peripheral, or obscured. LAION-5B captions almost never describe hand anatomy. "Man smiling in park" gives zero signal about finger count or pose. ^[1] ^[4]

MJ v5 hand-prioritized data + DALL-E 3 recaptioning (2023)

🔢 High

CLIP Counting Blindness

CLIP operates on continuous embeddings — no discrete cardinality mechanism. "Five fingers" produces a fuzzy vector, not a countable constraint. Early SD routinely generated ≥6 fingers. ^[7] ^[8]

Hand1000 gesture fusion, richer caption alignment (2024)

🔬 High

VAE Resolution Bottleneck

512×512 compressed 8× to a 64×64 latent. Four fingers fit into ~5–8 latent pixels. The VAE decoder cannot reconstruct what was never encoded — fused blobs are the direct output. ^[15]

SDXL 1024×1024 native: ~40–60 pixels per finger (Oct 2023)

🏗️ High

U-Net Locality

U-Net's limited receptive field treats each finger region independently. Global constraints — all fingers attach to the same palm — are lost across successive downsampling layers. ^[12]

FLUX.1 MMDiT: full global attention every layer (Aug 2024)

🌀 Medium

Mode Interpolation

Reverse diffusion interpolates between incompatible hand-pose modes, producing impossible joints. Counterintuitively, increasing sampling steps 25→100 worsens counting accuracy — more refinement amplifies, not corrects. ^[7]

Pixel-level segmentation masks: −23% hallucination rate (2024)

Remediation Timeline 6 milestones · 30 months

2022
Baseline

SD 1.4/1.5 · Midjourney v3/v4

Baseline

All five failure modes active simultaneously. 512×512 resolution, U-Net backbone, LAION-5B captions with near-zero hand annotation. Malformed hands are the norm — six fingers, merged digits, backwards joints appear in the majority of outputs. ^[1] ^[2]

~30%

Mar
2023

Midjourney v5

Data Curation

Explicitly prioritized training images with clearly visible hands; deprioritized obscured or peripheral examples. New neural architecture trained on Google Cloud's AI supercluster for ~5 months. The most impactful single step — nearly doubling the acceptable-generation rate. ^[14] ^[19]

~65%

Oct
2023

Stable Diffusion XL 1.0

Resolution + Dual Encoders

Native 1024×1024 (4× the pixel count of SD 1.5). At this scale, each finger gets ~40–60 latent pixels — enough for the VAE to encode and recover distinct digits. Dual text encoders (OpenCLIP ViT-bigG + CLIP ViT-L in parallel). U-Net grew 860M → 2.6B parameters. ^[11] ^[20]

~70%

Dec
2023

Midjourney v6 · DALL-E 3

Full Retrains + Caption Quality

MJ v6: trained from scratch over nine months — not a fine-tune of v5. DALL-E 3 attacked captions directly: a specialized captioner generated long, highly-descriptive synthetic captions (95% GPT-4, 5% human) covering body-part details previously absent from LAION. ^[16] ^[10]

~75–80%

Aug
2024

FLUX.1 [dev] — Black Forest Labs

Transformer Architecture

Replaced U-Net with a Multimodal Diffusion Transformer (MMDiT). Transformers compute attention across the entire image in every layer — a finger at position A is explicitly related to the palm and every other finger simultaneously. The architectural fix most directly targeting U-Net locality. "The most anatomically correct humans, correct finger count in the vast majority of generations." ^[12] ^[13]

~85%

Apr
2025

Midjourney v7

Ground-Up Rebuild

CEO David Holz: "a totally different architecture" from v6. Third-party analysis: ~40% reduction in anatomical errors. Six-fingered hands moved from "common" to "occasional." ^[21] ^[17] ^[18]

~85–90%

Community Parallel Track No model update required

HandRefiner · ControlNet-based pipeline

ControlNet ⭐ 33.9k — depth-conditioned inpainting ^[9]

While model teams rebuilt architectures, the community built a parallel fix. ControlNet adds structural control signals — depth maps, pose maps, edge maps — without retraining the base model. HandRefiner extends this: Mesh Graphormer reconstructs 3D hand geometry, renders a depth map, then depth-conditioned ControlNet inpaints only the hand region. ^[5]

Key discovery: control strength ~0.55 is a "phase transition" point — below it, the signal adjusts structure; above it, texture degrades. This enabled training on synthetic depth data without texture artifacts. Segmentation-mask constraints (Joint Diffusion Model) brought a further ~23% reduction in counting hallucinations and ~84% drop in non-counting failures. ^[7] ^[5]

70%

user preference in blind survey · 50 participants

+10 FID

improvement on hand-only evaluation

0.55

optimal control strength — phase transition point

Status 2025–2026 4 open · 5 resolved

Remaining Open Issues

Open · Hard

Hand-Object Interactions

Gripping a textured object requires physical-contact reasoning beyond 2D statistics. ^[6]

Open · Hard

Multi-Hand Scenes

Two hands interacting multiply the 3D projection-ambiguity problem. No reliable solution yet.

Open · Moderate

Unusual Pose Angles

Wide-spread fingers from a low angle, or partial fists, hit the edges of training distribution.

Open · Moderate

In-Context Data Mismatch

Isolated-hand datasets don't generalize when hands hold objects in natural environments. ^[6]

2025 state of AI-generated hands — before/after comparison. Source: Vertu

Why Hands Specifically

Why generative AI gets hands, fingers, and teeth wrong — Decrypt

Hands aren't merely difficult — they expose every diffusion failure mode simultaneously. 27 degrees of freedom across 16 joints ^[5] make any finger-angle error immediately obvious. Constant occlusion in training images teaches the model "thumbs are sometimes absent" rather than "thumbs are hidden." ^[2] And the uncanny valley effect means we notice instantly — we evolved to read hands for social signaling. ^[4]

"These are 2D image generators that have absolutely no concept of the three-dimensional geometry of something like a hand." — Prof. Peter Bentley, UCL ^[3]

Related Research from Diffusion Models: From Noise to Image

expedition

Reverse denoising: architectures and the score network

survey

Text conditioning via CLIP (and beyond)

recon

Latent space and VAE

recon

ControlNet and structural conditioning

survey

Forward diffusion and noise schedules

expedition

Midjourney v6 prompt craft