While model teams rebuilt architectures, the community built a parallel fix.
ControlNet adds structural control signals — depth maps,
pose maps, edge maps — without retraining the base model.
HandRefiner extends this: Mesh Graphormer reconstructs 3D hand geometry,
renders a depth map, then depth-conditioned ControlNet inpaints only the hand region.
[5]
Key discovery: control strength
~0.55 is a "phase transition" point —
below it, the signal adjusts structure; above it, texture degrades.
This enabled training on synthetic depth data without texture artifacts.
Segmentation-mask constraints (Joint Diffusion Model) brought a further
~23% reduction in counting hallucinations and ~84% drop in non-counting failures.
[7]
[5]