← Default view
Incident Report · AI Systems · 2022–2025

Why Hands Broke

Five compounding failures drove diffusion model hand generation to a ~30% success rate. Recovery took 30 months and four distinct architectural interventions, reaching ~90% by 2025. [18]

Sparse Training Data CLIP Counting Blindness VAE Resolution U-Net Locality Mode Interpolation
30% 2022 baseline
~90% 2025 success rate
5 root causes
6 milestones
21 sources
AI-generated malformed hands — the inciting problem. Source: PetaPixel
Root Cause Analysis 5 causes · all compounding
📊 Critical
Sparse Training Data
Faces dominate internet photos — hands are small, peripheral, or obscured. LAION-5B captions almost never describe hand anatomy. "Man smiling in park" gives zero signal about finger count or pose. [1] [4]
MJ v5 hand-prioritized data + DALL-E 3 recaptioning (2023)
🔢 High
CLIP Counting Blindness
CLIP operates on continuous embeddings — no discrete cardinality mechanism. "Five fingers" produces a fuzzy vector, not a countable constraint. Early SD routinely generated ≥6 fingers. [7] [8]
Hand1000 gesture fusion, richer caption alignment (2024)
🔬 High
VAE Resolution Bottleneck
512×512 compressed 8× to a 64×64 latent. Four fingers fit into ~5–8 latent pixels. The VAE decoder cannot reconstruct what was never encoded — fused blobs are the direct output. [15]
SDXL 1024×1024 native: ~40–60 pixels per finger (Oct 2023)
🏗️ High
U-Net Locality
U-Net's limited receptive field treats each finger region independently. Global constraints — all fingers attach to the same palm — are lost across successive downsampling layers. [12]
FLUX.1 MMDiT: full global attention every layer (Aug 2024)
🌀 Medium
Mode Interpolation
Reverse diffusion interpolates between incompatible hand-pose modes, producing impossible joints. Counterintuitively, increasing sampling steps 25→100 worsens counting accuracy — more refinement amplifies, not corrects. [7]
Pixel-level segmentation masks: −23% hallucination rate (2024)
Remediation Timeline 6 milestones · 30 months
2022
Baseline
SD 1.4/1.5 · Midjourney v3/v4
Baseline
All five failure modes active simultaneously. 512×512 resolution, U-Net backbone, LAION-5B captions with near-zero hand annotation. Malformed hands are the norm — six fingers, merged digits, backwards joints appear in the majority of outputs. [1] [2]
~30%
Mar
2023
Midjourney v5
Data Curation
Explicitly prioritized training images with clearly visible hands; deprioritized obscured or peripheral examples. New neural architecture trained on Google Cloud's AI supercluster for ~5 months. The most impactful single step — nearly doubling the acceptable-generation rate. [14] [19]
~65%
Oct
2023
Stable Diffusion XL 1.0
Resolution + Dual Encoders
Native 1024×1024 (4× the pixel count of SD 1.5). At this scale, each finger gets ~40–60 latent pixels — enough for the VAE to encode and recover distinct digits. Dual text encoders (OpenCLIP ViT-bigG + CLIP ViT-L in parallel). U-Net grew 860M → 2.6B parameters. [11] [20]
~70%
Dec
2023
Midjourney v6 · DALL-E 3
Full Retrains + Caption Quality
MJ v6: trained from scratch over nine months — not a fine-tune of v5. DALL-E 3 attacked captions directly: a specialized captioner generated long, highly-descriptive synthetic captions (95% GPT-4, 5% human) covering body-part details previously absent from LAION. [16] [10]
~75–80%
Aug
2024
FLUX.1 [dev] — Black Forest Labs
Transformer Architecture
Replaced U-Net with a Multimodal Diffusion Transformer (MMDiT). Transformers compute attention across the entire image in every layer — a finger at position A is explicitly related to the palm and every other finger simultaneously. The architectural fix most directly targeting U-Net locality. "The most anatomically correct humans, correct finger count in the vast majority of generations." [12] [13]
~85%
Apr
2025
Midjourney v7
Ground-Up Rebuild
CEO David Holz: "a totally different architecture" from v6. Third-party analysis: ~40% reduction in anatomical errors. Six-fingered hands moved from "common" to "occasional." [21] [17] [18]
~85–90%
Community Parallel Track No model update required
HandRefiner · ControlNet-based pipeline
ControlNet ⭐ 33.9k — depth-conditioned inpainting [9]
While model teams rebuilt architectures, the community built a parallel fix. ControlNet adds structural control signals — depth maps, pose maps, edge maps — without retraining the base model. HandRefiner extends this: Mesh Graphormer reconstructs 3D hand geometry, renders a depth map, then depth-conditioned ControlNet inpaints only the hand region. [5]

Key discovery: control strength ~0.55 is a "phase transition" point — below it, the signal adjusts structure; above it, texture degrades. This enabled training on synthetic depth data without texture artifacts. Segmentation-mask constraints (Joint Diffusion Model) brought a further ~23% reduction in counting hallucinations and ~84% drop in non-counting failures. [7] [5]
70%
user preference in blind survey · 50 participants
+10 FID
improvement on hand-only evaluation
0.55
optimal control strength — phase transition point
Status 2025–2026 4 open · 5 resolved
Remaining Open Issues
Open · Hard
Hand-Object Interactions
Gripping a textured object requires physical-contact reasoning beyond 2D statistics. [6]
Open · Hard
Multi-Hand Scenes
Two hands interacting multiply the 3D projection-ambiguity problem. No reliable solution yet.
Open · Moderate
Unusual Pose Angles
Wide-spread fingers from a low angle, or partial fists, hit the edges of training distribution.
Open · Moderate
In-Context Data Mismatch
Isolated-hand datasets don't generalize when hands hold objects in natural environments. [6]
Practical recommendation: Simple natural poses now generate reliably on FLUX.1 and MJ v7. For complex grip shots or multi-hand compositions, ControlNet conditioning or iterative inpainting remains the most reliable path until the next architectural leap.
2025 state of AI-generated hands — before/after comparison. Source: Vertu
Why Hands Specifically
Why generative AI gets hands, fingers, and teeth wrong — Decrypt
Hands aren't merely difficult — they expose every diffusion failure mode simultaneously. 27 degrees of freedom across 16 joints [5] make any finger-angle error immediately obvious. Constant occlusion in training images teaches the model "thumbs are sometimes absent" rather than "thumbs are hidden." [2] And the uncanny valley effect means we notice instantly — we evolved to read hands for social signaling. [4]
"These are 2D image generators that have absolutely no concept of the three-dimensional geometry of something like a hand." — Prof. Peter Bentley, UCL [3]