⚡Diffusion model caching: TeaCache

Intro

If you’ve been following along, we’ve already covered vLLM-Omni and how diffusion models work. But here’s the dirty secret of diffusion models: they don’t run a single expensive computation, they run it many times per generation.
50 steps means 50 full forward passes through a multi-billion-parameter transformer. That’s a lot of GPU hours burned at scale for low throughput and slow user experience. Not exactly a profitable inference business. And the worst part? Everything is recomputed, even when almost nothing has changed. That’s where diffusion caching comes in.


Today, we’re exploring TeaCache, a deceptively simple insight that has become a de facto standard acceleration technique for diffusion pipelines, enabling up to 3× faster inference.

By the end, you’ll understand not just that TeaCache works, but when and why you can trust it with your inference..

I. Diffusion Refresher — What Matters for TeaCache

New to diffusion? Check out our diffusion models deep dive here for the full and detailed explanation. But before we crack open the TeaCache engine, we have to understand the chassis it’s bolted onto:

1. What is Reverse Diffusion? 🖼️

Think of it as un-blurring (denoising) a photo step by step (forward means training/ reverse means inference):

  1. Start with noise (random static)
  2. Gradually refine it over many steps
  3. End up with a clear image/video

💡Each step makes tiny improvements to get closer to the final result.

2. Timesteps? 🧮

Timesteps are the step numbers of the denoising process. A countdown that tells the model exactly where it is in the journey from noise to image. In most models, we count backwards:

T = 100 Pure, heavy noise (the very beginning).
T = 50 Halfway there — shapes are emerging, noise is thinning.
T = 1 Almost finished — fine details and texture refinement.
T = 0 Final result (sharp image) — the denoising loop is complete.
💡 Each step is a full forward pass through the transformer, that’s why generation is slow.

3. Text Embeddings 💬

Text prompts are encoded into embeddings that guide the denoising process at every step. Your prompt conditions the model throughout the entire generation.

4. Timestep embedding

A timestep embedding is a vector representation of the timestep that tells the model where it is in the denoising journey. At each timestep, it’s encoded alongside the image(latent) and text, then injected into every transformer block as a scale & shift modulation signal. This means the model behaves differently at each timestep.

Timestep embedding acts as a remote control

Key Concepts — Timestep & Modulation:

  • Timestep (t) → where we are in the denoising process
  • Timestep embedding → embedding vector of a timestep inside the model
  • Modulation → scale & shift applied to noisy input → modulated input
  • Result → same network, different behavior at each timestep
  • TeaCache measures how much Timestep Embedding Modulated Noisy Input changes between timesteps.
timestep → embedding → (scale & shift) → modulated input → transformer blocks → model output

5. The Denoising Loop

The denoising loop is where all the work happens, and where TeaCache operates (skips redundant computations).

Stage What Happens Time (%) TeaCache
Pre-processing Text encoding and prompt preparation before the iterative loop begins. 5 – 10%
Denoising Loop Iterative refinement (50 → 0). This is where TeaCache lives. 70 – 85%
VAE Decoding After all timesteps are processed, the VAE converts latents into the final video or image file. 5 – 15%

II. What Is TeaCache⚡?

📄 Paper Timestep Embedding Tells: It’s Time to Cache for Video Diffusion Model Feng Liu et al., 2025

70–85% of every generation is spent inside transformer blocks. TeaCache is a training-free optimization designed to eliminate this “GPU Tax” by ditching any timestep redundant computation between two similar denoising steps.

Model-agnostic by design, TeaCache works across Image, Video, and Audio diffusion models. The same core logic accelerates Wan2.1, CogVideoX, FLUX, and more, requiring only model-specific calibration with no retraining required. TeaCache has now been integrated into engines like vllm-omni and ComfyUI.

📛 Traditional Diffusion Inference
Lengthy Time
Computational Redundancy
Resource Intensive
Limits Real-Time Apps
Recalculates Similar Steps
High Energy Consumption
⚡ TeaCache Solution
Analyzes Timestep Similarity
Caches & Reuses Intermediate Results
Accelerates Inference Process
⏱️ 1.3x–2.7x Speedup

Core Benefits

  • Blazing Speed: 1.3x–2.7x faster inference with minimal quality degradation
  • ✅ No retraining required — It’s “Plug-and-Play. Works on existing pretrained models (Wan2.1, FLUX, etc.)
  • ✅ Minimal memory overhead — caches residuals, not full states. Which are always overwritten on cache miss
  • ✅ Configurable trade-off — full control to adjust rel_l1_thresh to prioritize speed or quality
  • ✅ Broad applicability — video, image, and audio diffusion models
  • ✅ Simple, non-invasive implementation — monkey-patching of the forward pass, no model changes

Which Models Actually Use This?

Modality Task Models
Video Text-to-Video Wan2.1, Cosmos, CogVideoX1.5, LTX-Video, Mochi, HunyuanVideo
Video Image-to-Video Wan2.1, Cosmos, CogVideoX1.5, ConsisID
Image Text-to-Image FLUX, Lumina-T2X, Lumina2, HiDream-I1
Audio Text-to-Audio TangoFlux

TeaCache’s applicability across different modalities is done thanks to different polynomial coefficients tuned per model.

TeaCache vs. LMCache — Not the Same Thing

While both are caching technologies, they are fundamentally different systems designed for different types of AI.
LMCache remembers long conversations (KVCache). TeaCache skips redundant math between similar denoising steps.

Dimension 🟠LMCache ⚡TeaCache
Target Autoregressive LLMs Diffusion transformers
What’s cached KV pairs (grows with sequence) Residuals (fixed-size latents)
Access pattern Sequential token-by-token Iterative refinement across timesteps
Cache across generations Reusable indefinitely One per generation, discarded after
Integration point Attention layers Diffusion forward pass

Community Ecosystem

TeCache ComfyUI nodes

TeaCache has achieved substantial community adoption across four categories:

  • ⚙️ Model Frameworks: FramePack, FastVideo, EasyAnimate, Ruyi-Models, ConsisID
  • 🖥️ UI Integrations: ComfyUI plugins for various models
  • 🚀 Engine Support: SD.Next, DiffSynth Studio, vllm-Omni
  • Parallelism Tools: TeaCache-xDiT for multi-GPU inference

III. How TeaCache Works

1. Input Preparation Steps

Before the forward pass, three inputs are prepared: the noisy latent, the timestep embedding, and the text embedding. This also produces TeaCache’s key signal from modulated input of timestep embeddings to determine cache reuse.

💡TeaCache insight:

When timestep embeddings between consecutive steps are similar, the outputs of transformer computations will also be similar. Once quantified this similarity is used by TeaCache to selectively skip expensive computations.

2. What is L1 Distance ?

L1 distance is simply a metric to measure how different two things are. Think of comparing two photos, if they’re very similar, the L1 distance is small. If they’re very different, the distance is large. TeaCache uses a relative version of this to detect how much the timestep embedding changed between consecutive steps.

The Relative L1 Distance: formula

Current Timestep Embedding Previous Modulated Input Absolute Difference Mean Absolute Diff Mean Absolute Value Division % Relative L1 Distance Polynomial Rescaling Add to Accumulated Distance

The subtraction compares current timestep embedding signal with previous timestep’s modulated input

Relative L1 Distance =
(t_emb – previous_modulated_input).abs().mean()
previous_modulated_input.abs().mean()
  1. Absolute Difference: (t_emb - previous_modulated_input).abs()
  2. Mean of Difference: .mean()
  3. Normalize by previous magnitude: / previous_modulated_input.abs().mean()

Polynomial Rescaling: the Calibrator

The raw L1 distance varies wildly across models (e.g., 0.001 vs 1.0). TeaCache applies model-specific polynomial rescaling to normalize it into a ~[0, 1] scale. One scale, enabling reliable caching decisions across every model.

Raw L1 Distance Space Polynomial Rescaling Normalized Space CogVideoX-2B Range: [0.001, 0.1] CogVideoX-5B Range: [0.0001, 0.01] Latte Range: [0.01, 1.0] Apply model-specific polynomial function All Models Range: ~[0, 1] Comparable magnitudes

IV. The Caching Decision Loop

The Accumulated L1 Distance

Rather than comparing a single step’s distance against the threshold, TeaCache accumulates the rescaled distances across consecutive steps, allowing small changes to build up before triggering a recomputation.

The rel_l1 Threshold

The rel_l1_thresh parameter controls when recomputation is forced. When the accumulated distance exceeds it, TeaCache resets and runs a full forward pass.

Example Scenario: Threshold (δ) = 0.20
Step 1: L1 = 0.05 ➔ Accumulated L1 = 0.05 Cache Hit
Step 2: L1 = 0.04 ➔ Accumulated L1 = 0.09 Cache Hit
Step 3: L1 = 0.06 ➔ Accumulated L1 = 0.15 Cache Hit
Step 4: L1 = 0.07 ➔ Accumulated L1 = 0.22 Cache Miss (Reset)

Caching logic

  1. Modulate (The Setup): Compute the modulated input based on the current timestep.
  2. Measure (Compare Steps): Calculate relative L1 distance (current - previous_input).
  3. Rescale (Adjust Score): Apply polynomial rescaling to the l1_distance [0, 1].
  4. Accumulate (Track Error): Update the accumulated l1_distance since last reset.
  5. Decide (Cache/No_cache): Compare the accumulated total against the rel_l1_thresh parameter.
    If < threshold (Cache Hit): Input changes are minimal. Reuse the cached residual (x + delta).
    If > threshold (Cache Miss): Compute the full forward pass, Update the cache + reset accumulated_l1.

V. TeaCache Workflow (Complete Forward Pass)

The Guardrails: Retention Steps

TeaCache implements one crucial safety net to ensure the video quality doesn’t degrade. It uses Retention Steps:

  • Initial Steps: The early steps are always calculated fully to establish a baseline (var:ret_steps).
  • Final Steps: The last steps are always calculated fully to ensure the final result is sharp (var: cutoff_steps).
  • Middle Steps: Only the middle steps are subject to the threshold caching logic.

DiT Forward Pass

Now when you hit “generate” and the denoising starts, here is the exact sequence of events playing out under the hood:

%%{init: {'theme': 'base', {'htmlLabels': true}, 'themeVariables': { 'background': 'transparent', 'lineColor': '#888888', 'fontFamily': '-apple-system, BlinkMacSystemFont, sans-serif'}}}%%
flowchart TD
    %% Custom Styling to match the article's aesthetic
    classDef default fill:#ffffff,stroke:#d8d2ea,stroke-width:1.5px,color:#444,rx:6px,ry:6px;
    classDef decision fill:#f0f7ff,stroke:#1971c2,stroke-width:1.5px,color:#0366d6,rx:6px,ry:6px;
    classDef hit fill:#f4fce3,stroke:#2b8a3e,stroke-width:1.5px,color:#2b8a3e,rx:6px,ry:6px;
    classDef miss fill:#fff5f5,stroke:#c92a2a,stroke-width:1.5px,color:#c92a2a,rx:6px,ry:6px;
    classDef endpoint fill:#2C3E50,stroke:#2C3E50,stroke-width:1.5px,color:#ffffff,rx:6px,ry:6px;

    A["DiT Forward Pass"] --> B[Extract Timestep Embedding]
    B --> C[Compute Modulated Input]
    C --> D[teacache_forward]
    
    D --> E{"First or Last step?"}:::decision
    
    E -->|Yes| F[Always calculate fully]
    E -->|No| G[Calculate relative L1 distance]
    
    G --> H[Apply polynomial rescaling]
    H --> I[Update accumulated distance]
    
    I --> J{"Accumulated Distance < <b>rel_l1_thresh</b>"}:::decision
    
    %% The Green Cache Hit Path
    J -->|"Yes (Similar to previous)"| K[Reuse previous_residual 
x = x + prev_residual]:::hit
    
    %% The Red Cache Miss Path
    J -->|"No (Different enough)"| L["Calculate full forward pass 
x = block(x, **kwargs)"]:::miss
    L --> M[Store new residual residual = x - ori_x]:::miss
    M --> N[Reset accumulated distance  Accu_L1 = 0]:::miss
    
    %% Convergence
    F --> O[Proceed to next step]
    K --> O
    N --> O
    
    %% The Loop
    O --> P{"Is it the last step?"}:::decision
    P -->|No, loop again| B
    
    %% The End
    P -->|Yes| Q[VAE Decode latents to Video/Image]:::endpoint
    Q --> R([Return final media]):::endpoint

Classifier-Free Guidance (CFG)

Classifier-Free Guidance (CFG) improves prompt alignment by running the model twice per step:
1. Conditional pass (with prompt)
2. Unconditional pass (without prompt)

Since CFG runs two passes, TeaCache keeps separate residual caches for each path (conditional/unconditional):

Cache Type Used For What It Tracks
Even Steps Conditional passes Accumulated distance, residuals
Odd Steps Unconditional passes Separate accumulated distance, residuals
View CFG caching Workflow
flowchart LR
    subgraph subGraph0["Diffusion Step i"]
        STEP[Step i]
    end

    subgraph subGraph1["CFG Dual Forward Passes"]
        COND["Conditional Forward
cnt % 2 == 0
(even)"] UNCOND["Unconditional Forward
cnt % 2 == 1
(odd)"] end subgraph subGraph2["Separate State Tracking"] EVEN_STATE["accumulated_rel_l1_distance_even
previous_e0_even
previous_residual_even"] ODD_STATE["accumulated_rel_l1_distance_odd
previous_e0_odd
previous_residual_odd"] end subgraph subGraph3["CFG Combination"] COMBINE["noise_pred = uncond + guide_scale * (cond - uncond)"] end STEP --> COND STEP --> UNCOND COND --> EVEN_STATE UNCOND --> ODD_STATE EVEN_STATE --> COMBINE ODD_STATE --> COMBINE

VI. Integration & Implementation

TeaCache uses monkey-patching — injecting caching logic directly into existing models without touching the original source code. Three steps: define the logic, set the config, overwrite the forward pass.

The full integration requires just a few lines. No model modifications, no retraining:

# Full TeaCache setup for CogVideoX
pipe.transformer.__class__.enable_teacache = True
pipe.transformer.__class__.rel_l1_thresh = 0.15  # Balance speed/quality
pipe.transformer.__class__.cnt = 0
pipe.transformer.__class__.num_steps = num_inference_steps
pipe.transformer.__class__.accumulated_rel_l1_distance = 0
pipe.transformer.__class__.previous_modulated_input = None
pipe.transformer.__class__.previous_residual = None
pipe.transformer.__class__.coefficients = [-1.54e3, 8.43e2, -1.34e2, 7.97, -5.23e-2]
pipe.transformer.__class__.forward = teacache_forward

You can find the full code per model in the teacache repo (i.e. TeaCache4CogVideoX1.5)

# STEP 1: Define the replacement forward pass
def teacache_forward(self, hidden_states, timestep, **kwargs):
    
    # Guardrail: Always calculate the first and last steps fully
    if self.cnt == 0 or self.cnt == self.num_steps - 1:
        should_calc = True
        self.accumulated_rel_l1_distance = 0
    else: 
        # Calculate distance and apply polynomial rescaling
        rescale_func = np.poly1d(self.coefficients)
        distance = ((emb - self.previous_modulated_input).abs().mean() 
                   / self.previous_modulated_input.abs().mean()).item()
        self.accumulated_rel_l1_distance += rescale_func(distance)
        
        # Threshold decision: evaluate if we should compute or cache
        should_calc = self.accumulated_rel_l1_distance >= self.rel_l1_thresh

    # Execute full pass or use cache
    if should_calc:
        ori_x = hidden_states.clone()  # Save original input
        
        # Run the heavy transformer blocks
        for block in self.blocks:  
            hidden_states = block(hidden_states, **kwargs)  
            
        # Cache the mathematical difference (residual)
        self.previous_residual = hidden_states - ori_x
        self.accumulated_rel_l1_distance = 0  # Reset the running tab
    else:
        # CACHE HIT: Skip transformer blocks and apply the saved residual
        hidden_states += self.previous_residual

    # Update state variables for the next timestep
    self.previous_modulated_input = emb
    self.cnt += 1
    
    return hidden_states

Multi-GPU Parallelism (xDiT)

TeaCache-xDiT implements multi-GPU parallelization of TeaCache within the xDiT framework. Each GPU maintains its own caching decisions, coordinated through a synchronization layer before the distributed transformer model:

View Multi-GPU TeaCache-xDiT
flowchart LR
    subgraph Multi["Multi-GPU TeaCache-xDiT"]
        G0[GPU 0] --> C[TeaCache Logic]
        G1[GPU 1] --> C
        GN[GPU N] --> C
        C --> S[Sync Decisions]
        S --> M[Distributed Model]
    end
      

VII. Production Serving with vLLM-Omni

Best news? vLLM-Omni has TeaCache natively built-in as a hook-based cache backend, enable it with a single flag.

You can Enable TeaCache by setting cache_backend to "tea_cache":

from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

# Simple configuration - model_type is automatically extracted from pipeline.*class*name
omni = Omni(
    model="Qwen/Qwen-Image",
    cache_backend="tea_cache",
    cache_config={
        "rel_l1_thresh": 0.2  # Optional, defaults to 0.2
    }
)
outputs = omni.generate(
    "A cat sitting on a windowsill",
    OmniDiffusionSamplingParams(
        num_inference_steps=50,
    ),
)

Enable TeaCache for online serving by passing --cache-backend tea_cache when starting the server:

vllm serve Qwen/Qwen-Image --omni --port 8091 \
  --cache-backend tea_cache \
  --cache-config '{"rel_l1_thresh": 0.2}'
export DIFFUSION_CACHE_BACKEND=tea_cache

Performance Tuning :
Start with rel_l1_thresh=0.2 and adjust based on your needs:

Supported Architectures
Architecture Model Family HF Model
QwenImagePipeline Qwen-Image Qwen/Qwen-Image
QwenImageEditPipeline Qwen-Image-Edit Qwen/Qwen-Image-Edit
QwenImageEditPlusPipeline Qwen-Image-Edit-2509 Qwen/Qwen-Image-Edit-2509
QwenImageLayeredPipeline Qwen-Image-Layered Qwen/Qwen-Image-Layered
BagelForConditionalGen BAGEL (DiT-only) ByteDance-Seed/BAGEL-7B-MoT

VIII. Performance & Evals

TeaCache is a speed–quality dial for accelerating diffusion inference across video, image, and audio models.

The Quality–Speed Trade-off

The core tradeoff is simple: higher threshold = faster speed, lower quality. The curve below shows where the cliff is:

CogVideoX1.5/ConsisID speed x quality ratio

Note:
Image models consistently use higher thresholds (0.17-0.5) compared to video models (0.05-0.3) (more sensitive).

The Threshold Cheat Sheet

Baseline ranges for Video, Image, and Audio diffusion models.

Modality Conservative Balanced Aggressive Speedup
Video 0.05 – 0.1 0.15 – 0.2 0.25 – 0.3 1.5× – 2.1×
Image 0.1 – 0.2 0.3 – 0.5 0.6 – 1.0+ 1.3× – 2.5×
Audio 0.05 0.1 0.2 – 0.4 1.2× – 1.8×

Recommended settings for Production, Iteration, and Rapid Preview workflows.

Use Case Threshold Speedup Quality Impact Best For
Production 0.1 – 0.15 1.3× – 1.6× Minimal Client-facing apps, final renders, premium content.
Content Workflows 0.2 – 0.25 1.8× – 2.2× Slight Artist iteration, concept development, A/B testing.
Rapid Prototyping 0.3 – 0.5 2.1× – 2.7× Noticeable Quick previews, parameter exploration, dev testing.

Recommended configs for Flux, Wan, and Hunyuan and more.

Model Family rel_l1_thresh ret_steps (%) cutoff_steps (%) Max Speedup
FLUX.1 (Std/PuLID) 0.40 0.0 1.0 ~2.0×
Lumina-Image-2.0 0.38 0.2 1.0 ~1.7×
HiDream-I1 (Full) 0.35 0.1 1.0 ~2.0×
Wan2.1 T2V (1.3B) 0.08 – 0.15 0.0 – 0.1 1.0 ~1.6× – 2.2×
Wan2.1 T2V (14B) 0.20 0.0 1.0 ~1.8×
Wan2.1 I2V (14B-720P) 0.30 0.1 1.0 ~2.0×
Wan2.1 T2V (14B) Ret-Mode 0.20 0.1 1.0 ~2.1×
HunyuanVideo (13B) 0.15 0.0 1.0 ~1.9×
CogVideoX 1.5 (5B) 0.30 0.0 1.0 ~2.0×
LTX-Video 0.06 0.0 1.0 ~1.7×

start_percent  → baseline steps (var:ret_steps).
end_percent   → to the final steps (var: cutoff_steps).

💡
Creator tip

Use aggressive thresholds for rapid iteration. Once you find the right composition, replay without the cache for a pixel-perfect render.

Testing & Evaluation

All speedups are measured against the model’s native (uncached) inference time. Quality is evaluated using:

  • VBench → temporal quality (video consistency)
  • PSNR / SSIM → pixel-level similarity to uncached output
  • LPIPS → perceptual similarity
  • MOS → human visual quality (when applicable)

For full scripts and setup, see the TeaCache evaluation repository.

Conclusion

That’s a wrap — TeaCache, end to end.
If you made it here, bravo. You now understand what most people miss when they first hear about diffusion caching: it’s not magic, it’s not a KV cache, and it’s not skipping steps. It’s measuring how much the model changes between steps and avoiding recomputation when it doesn’t. With the explosion of image and video generation, caching is becoming vital for cost effective production diffusion systems.

Thanks again for reading this piece and both Part 1 and Part 2!

Run AI Your Way — In Your Cloud


Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

👋🏻Want to chat about your challenges?
We’d love to hear from you! 

Share this…

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Start your Cloud journey with us today .