CaricHarmony: Contrastive Diffusion Paths for Identity-Preserving Caricature Synthesis

Abstract

Sketch-based caricature synthesis suffers from a fundamental failure mode: when identity and shape conditions are combined in diffusion models, they create destructive interference that causes inevitable collapse toward either bland portraits or unrecognizable distortions. We identify the root cause as condition signal contamination – competing probability distributions in the denoising trajectory that make balanced generation impossible.

We present CaricHarmony, the first training-free method that explicitly resolves this contamination through parallel uncontaminated diffusion paths. During inference, we maintain three paths: 𝒫ⁱ (pure identity), 𝒫^s (pure shape), and 𝒫^i+s (harmonized output). Novel energy functions operating on cross-attention features provide gradient guidance that steers 𝒫^i+s toward optimal balance: ℰ_shape ensures sketch fidelity through layout and semantic alignment, while ℰ_id employs token-level correspondence matching robust to extreme distortions.

Unlike DemoCaricature requiring 70 seconds per-identity fine-tuning or CaricatureBooth constrained to Bezier curves, CaricHarmony accepts any sketch format and generates in under 16 seconds. Experiments demonstrate state-of-the-art performance: 0.8615 shape CLIP score (vs. 0.8450) under comparable identity consistency score, with 7.81 overall user preference score (vs. 6.06). Our method fundamentally reconceptualizes the ID-shape conflict as conditioning signal contamination for diffusion models, enabling unprecedented creative control while preserving recognition.

Method

The core of CaricHarmony lies in resolving the ID-shape conflict through three parallel denoising trajectories. Instead of mixing contaminated signals, we maintain separate reference paths — 𝒫ⁱ conditioned only on identity and 𝒫^s conditioned only on shape — alongside the main path 𝒫^i+s conditioned on both.

Specialized energy functions operate on intermediate cross-attention features to provide per-step gradient guidance. Shape alignment (ℰ_shape) ensures sketch fidelity through layout alignment (ℰ_layout) and semantic consistency (ℰ_sem). ID alignment (ℰ_id) employs novel token-level correspondence matching that adapts to arbitrary shape distortions. Timestep-constrained guidance activates each energy function at appropriate denoising stages for optimal coarse-to-fine generation.

CaricHarmony architecture: inference pipeline with contrastive diffusion paths and cross-attention alignments.

Inference pipeline of CaricHarmony. The output path 𝒫^i+s is jointly guided by two contrastive paths 𝒫^s and 𝒫ⁱ. Shape alignment through ℰ_shape comprises ℰ_layout and ℰ_sem. ID alignment through ℰ_id uses token-level correspondence matching between cross-attention outputs.

Feature Comparison

Methods	Training-free	Controllable generation	Free-form conditioning
StyleCariGAN	✗	✗	✓
WarpGAN	✗	✗	✓
AutoToon	✗	✗	✓
DemoCaricature	✗	✓	✓
CaricatureBooth	✗	✓	✗
Ours	✓	✓	✓

CaricHarmony is the only method that is training-free, offers controllable generation, and accepts free-form sketch conditioning.

CaricHarmony: Contrastive Diffusion Paths for Identity-Preserving Caricature Synthesis

High-quality caricatures generated by CaricHarmony. Left: Results generated with the same identity and different sketches. Right: Results generated with the same sketch and different identities and specified styles.

Abstract

Method

Feature Comparison