Sketch-based caricature synthesis suffers from a fundamental failure mode: when identity
and shape conditions are combined in diffusion models, they create destructive interference
that causes inevitable collapse toward either bland portraits or unrecognizable distortions.
We identify the root cause as condition signal contamination – competing probability
distributions in the denoising trajectory that make balanced generation impossible.
We present CaricHarmony, the first training-free method that explicitly resolves this
contamination through parallel uncontaminated diffusion paths. During inference, we maintain
three paths: 𝒫i (pure identity), 𝒫s (pure shape), and
𝒫i+s (harmonized output). Novel energy functions operating on cross-attention
features provide gradient guidance that steers 𝒫i+s toward optimal balance:
ℰshape ensures sketch fidelity through layout and semantic alignment, while
ℰid employs token-level correspondence matching robust to extreme distortions.
Unlike DemoCaricature requiring 70 seconds per-identity fine-tuning or CaricatureBooth
constrained to Bezier curves, CaricHarmony accepts any sketch format and generates
in under 16 seconds. Experiments demonstrate state-of-the-art performance: 0.8615 shape
CLIP score (vs. 0.8450) under comparable identity consistency score, with 7.81 overall
user preference score (vs. 6.06). Our method fundamentally reconceptualizes the ID-shape
conflict as conditioning signal contamination for diffusion models, enabling unprecedented
creative control while preserving recognition.