
Intro
If you’ve been following along, we’ve already covered vLLM-Omni and how diffusion models work. But here’s the dirty secret of diffusion models: they don’t run a single expensive computation, they run it many times per generation.
50 steps means 50 full forward passes through a multi-billion-parameter transformer. That’s a lot of GPU hours burned at scale for low throughput and slow user experience. Not exactly a profitable inference business. And the worst part? Everything is recomputed, even when almost nothing has changed. That’s where diffusion caching comes in.
Today, we’re exploring TeaCache, a deceptively simple insight that has become a de facto standard acceleration technique for diffusion pipelines, enabling up to 3× faster inference.
I. Diffusion Refresher — What Matters for TeaCache
New to diffusion? Check out our diffusion models deep dive here for the full and detailed explanation. But before we crack open the TeaCache engine, we have to understand the chassis it’s bolted onto:
1. What is Reverse Diffusion? 🖼️
Think of it as un-blurring (denoising) a photo step by step (forward means training/ reverse means inference):

- Start with noise (random static)
- Gradually refine it over many steps
- End up with a clear image/video
💡Each step makes tiny improvements to get closer to the final result.
2. Timesteps? 🧮
Timesteps are the step numbers of the denoising process. A countdown that tells the model exactly where it is in the journey from noise to image. In most models, we count backwards:
3. Text Embeddings 💬
Text prompts are encoded into embeddings that guide the denoising process at every step. Your prompt conditions the model throughout the entire generation.
4. Timestep embedding
A timestep embedding is a vector representation of the timestep that tells the model where it is in the denoising journey. At each timestep, it’s encoded alongside the image(latent) and text, then injected into every transformer block as a scale & shift modulation signal. This means the model behaves differently at each timestep.

Key Concepts — Timestep & Modulation:
- Timestep (t) → where we are in the denoising process
- Timestep embedding → embedding vector of a timestep inside the model
- Modulation → scale & shift applied to noisy input → modulated input
- Result → same network, different behavior at each timestep
- TeaCache measures how much Timestep Embedding Modulated Noisy Input changes between timesteps.
timestep → embedding → (scale & shift) → modulated input → transformer blocks → model output5. The Denoising Loop
The denoising loop is where all the work happens, and where TeaCache operates (skips redundant computations).
II. What Is TeaCache⚡?
📄 Paper Timestep Embedding Tells: It’s Time to Cache for Video Diffusion Model — Feng Liu et al., 2025 ↗70–85% of every generation is spent inside transformer blocks. TeaCache is a training-free optimization designed to eliminate this “GPU Tax” by ditching any timestep redundant computation between two similar denoising steps.

Model-agnostic by design, TeaCache works across Image, Video, and Audio diffusion models. The same core logic accelerates Wan2.1, CogVideoX, FLUX, and more, requiring only model-specific calibration with no retraining required. TeaCache has now been integrated into engines like vllm-omni and ComfyUI.
Core Benefits
- ✅ Blazing Speed: 1.3x–2.7x faster inference with minimal quality degradation
- ✅ No retraining required — It’s “Plug-and-Play. Works on existing pretrained models (Wan2.1, FLUX, etc.)
- ✅ Minimal memory overhead — caches residuals, not full states. Which are always overwritten on cache miss
- ✅ Configurable trade-off — full control to adjust
rel_l1_threshto prioritize speed or quality - ✅ Broad applicability — video, image, and audio diffusion models
- ✅ Simple, non-invasive implementation — monkey-patching of the forward pass, no model changes
Which Models Actually Use This?
| Modality | Task | Models |
|---|---|---|
| Video | Text-to-Video | Wan2.1, Cosmos, CogVideoX1.5, LTX-Video, Mochi, HunyuanVideo |
| Video | Image-to-Video | Wan2.1, Cosmos, CogVideoX1.5, ConsisID |
| Image | Text-to-Image | FLUX, Lumina-T2X, Lumina2, HiDream-I1 |
| Audio | Text-to-Audio | TangoFlux |
TeaCache’s applicability across different modalities is done thanks to different polynomial coefficients tuned per model.
TeaCache vs. LMCache — Not the Same Thing
While both are caching technologies, they are fundamentally different systems designed for different types of AI.
LMCache remembers long conversations (KVCache). TeaCache skips redundant math between similar denoising steps.
| Dimension | 🟠LMCache | ⚡TeaCache |
|---|---|---|
| Target | Autoregressive LLMs | Diffusion transformers |
| What’s cached | KV pairs (grows with sequence) | Residuals (fixed-size latents) |
| Access pattern | Sequential token-by-token | Iterative refinement across timesteps |
| Cache across generations | Reusable indefinitely | One per generation, discarded after |
| Integration point | Attention layers | Diffusion forward pass |
Community Ecosystem

TeaCache has achieved substantial community adoption across four categories:
- ⚙️ Model Frameworks: FramePack, FastVideo, EasyAnimate, Ruyi-Models, ConsisID
- 🖥️ UI Integrations: ComfyUI plugins for various models
- 🚀 Engine Support: SD.Next, DiffSynth Studio, vllm-Omni
- ⚡ Parallelism Tools: TeaCache-xDiT for multi-GPU inference
III. How TeaCache Works
1. Input Preparation Steps
Before the forward pass, three inputs are prepared: the noisy latent, the timestep embedding, and the text embedding. This also produces TeaCache’s key signal from modulated input of timestep embeddings to determine cache reuse.

💡TeaCache insight:
When timestep embeddings between consecutive steps are similar, the outputs of transformer computations will also be similar. Once quantified this similarity is used by TeaCache to selectively skip expensive computations.
2. What is L1 Distance ?
L1 distance is simply a metric to measure how different two things are. Think of comparing two photos, if they’re very similar, the L1 distance is small. If they’re very different, the distance is large. TeaCache uses a relative version of this to detect how much the timestep embedding changed between consecutive steps.
The Relative L1 Distance: formula
The subtraction compares current timestep embedding signal with previous timestep’s modulated input
- Absolute Difference:
(t_emb - previous_modulated_input).abs() - Mean of Difference:
.mean() - Normalize by previous magnitude:
/ previous_modulated_input.abs().mean()
Polynomial Rescaling: the Calibrator
The raw L1 distance varies wildly across models (e.g., 0.001 vs 1.0). TeaCache applies model-specific polynomial rescaling to normalize it into a ~[0, 1] scale. One scale, enabling reliable caching decisions across every model.
IV. The Caching Decision Loop
The Accumulated L1 Distance
Rather than comparing a single step’s distance against the threshold, TeaCache accumulates the rescaled distances across consecutive steps, allowing small changes to build up before triggering a recomputation.
The rel_l1 Threshold
The rel_l1_thresh parameter controls when recomputation is forced. When the accumulated distance exceeds it, TeaCache resets and runs a full forward pass.
Caching logic
- Modulate (The Setup): Compute the modulated input based on the current timestep.
- Measure (Compare Steps): Calculate relative L1 distance
(current - previous_input). - Rescale (Adjust Score): Apply polynomial rescaling to the
l1_distance[0, 1]. - Accumulate (Track Error): Update the accumulated
l1_distancesince last reset. - Decide (Cache/No_cache): Compare the accumulated total against the
rel_l1_threshparameter.✅ If < threshold (Cache Hit): Input changes are minimal. Reuse the cached residual (x + delta).❌ If > threshold (Cache Miss): Compute the full forward pass, Update the cache + reset accumulated_l1.
V. TeaCache Workflow (Complete Forward Pass)
The Guardrails: Retention Steps
TeaCache implements one crucial safety net to ensure the video quality doesn’t degrade. It uses Retention Steps:

- Initial Steps: The early steps are always calculated fully to establish a baseline (
var:ret_steps). - Final Steps: The last steps are always calculated fully to ensure the final result is sharp (
var: cutoff_steps). - Middle Steps: Only the middle steps are subject to the threshold caching logic.
DiT Forward Pass
Now when you hit “generate” and the denoising starts, here is the exact sequence of events playing out under the hood:
%%{init: {'theme': 'base', {'htmlLabels': true}, 'themeVariables': { 'background': 'transparent', 'lineColor': '#888888', 'fontFamily': '-apple-system, BlinkMacSystemFont, sans-serif'}}}%%
flowchart TD
%% Custom Styling to match the article's aesthetic
classDef default fill:#ffffff,stroke:#d8d2ea,stroke-width:1.5px,color:#444,rx:6px,ry:6px;
classDef decision fill:#f0f7ff,stroke:#1971c2,stroke-width:1.5px,color:#0366d6,rx:6px,ry:6px;
classDef hit fill:#f4fce3,stroke:#2b8a3e,stroke-width:1.5px,color:#2b8a3e,rx:6px,ry:6px;
classDef miss fill:#fff5f5,stroke:#c92a2a,stroke-width:1.5px,color:#c92a2a,rx:6px,ry:6px;
classDef endpoint fill:#2C3E50,stroke:#2C3E50,stroke-width:1.5px,color:#ffffff,rx:6px,ry:6px;
A["DiT Forward Pass"] --> B[Extract Timestep Embedding]
B --> C[Compute Modulated Input]
C --> D[teacache_forward]
D --> E{"First or Last step?"}:::decision
E -->|Yes| F[Always calculate fully]
E -->|No| G[Calculate relative L1 distance]
G --> H[Apply polynomial rescaling]
H --> I[Update accumulated distance]
I --> J{"Accumulated Distance < <b>rel_l1_thresh</b>"}:::decision
%% The Green Cache Hit Path
J -->|"Yes (Similar to previous)"| K[Reuse previous_residual
x = x + prev_residual]:::hit
%% The Red Cache Miss Path
J -->|"No (Different enough)"| L["Calculate full forward pass
x = block(x, **kwargs)"]:::miss
L --> M[Store new residual residual = x - ori_x]:::miss
M --> N[Reset accumulated distance Accu_L1 = 0]:::miss
%% Convergence
F --> O[Proceed to next step]
K --> O
N --> O
%% The Loop
O --> P{"Is it the last step?"}:::decision
P -->|No, loop again| B
%% The End
P -->|Yes| Q[VAE Decode latents to Video/Image]:::endpoint
Q --> R([Return final media]):::endpointClassifier-Free Guidance (CFG)
Classifier-Free Guidance (CFG) improves prompt alignment by running the model twice per step:
1. Conditional pass (with prompt)
2. Unconditional pass (without prompt)
Since CFG runs two passes, TeaCache keeps separate residual caches for each path (conditional/unconditional):
VI. Integration & Implementation
TeaCache uses monkey-patching — injecting caching logic directly into existing models without touching the original source code. Three steps: define the logic, set the config, overwrite the forward pass.
The full integration requires just a few lines. No model modifications, no retraining:
# Full TeaCache setup for CogVideoX
pipe.transformer.__class__.enable_teacache = True
pipe.transformer.__class__.rel_l1_thresh = 0.15 # Balance speed/quality
pipe.transformer.__class__.cnt = 0
pipe.transformer.__class__.num_steps = num_inference_steps
pipe.transformer.__class__.accumulated_rel_l1_distance = 0
pipe.transformer.__class__.previous_modulated_input = None
pipe.transformer.__class__.previous_residual = None
pipe.transformer.__class__.coefficients = [-1.54e3, 8.43e2, -1.34e2, 7.97, -5.23e-2]
pipe.transformer.__class__.forward = teacache_forwardYou can find the full code per model in the teacache repo (i.e. TeaCache4CogVideoX1.5)
# STEP 1: Define the replacement forward pass
def teacache_forward(self, hidden_states, timestep, **kwargs):
# Guardrail: Always calculate the first and last steps fully
if self.cnt == 0 or self.cnt == self.num_steps - 1:
should_calc = True
self.accumulated_rel_l1_distance = 0
else:
# Calculate distance and apply polynomial rescaling
rescale_func = np.poly1d(self.coefficients)
distance = ((emb - self.previous_modulated_input).abs().mean()
/ self.previous_modulated_input.abs().mean()).item()
self.accumulated_rel_l1_distance += rescale_func(distance)
# Threshold decision: evaluate if we should compute or cache
should_calc = self.accumulated_rel_l1_distance >= self.rel_l1_thresh
# Execute full pass or use cache
if should_calc:
ori_x = hidden_states.clone() # Save original input
# Run the heavy transformer blocks
for block in self.blocks:
hidden_states = block(hidden_states, **kwargs)
# Cache the mathematical difference (residual)
self.previous_residual = hidden_states - ori_x
self.accumulated_rel_l1_distance = 0 # Reset the running tab
else:
# CACHE HIT: Skip transformer blocks and apply the saved residual
hidden_states += self.previous_residual
# Update state variables for the next timestep
self.previous_modulated_input = emb
self.cnt += 1
return hidden_statesMulti-GPU Parallelism (xDiT)
TeaCache-xDiT implements multi-GPU parallelization of TeaCache within the xDiT framework. Each GPU maintains its own caching decisions, coordinated through a synchronization layer before the distributed transformer model:
VII. Production Serving with vLLM-Omni
Best news? vLLM-Omni has TeaCache natively built-in as a hook-based cache backend, enable it with a single flag.
You can Enable TeaCache by setting cache_backend to "tea_cache":
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams
# Simple configuration - model_type is automatically extracted from pipeline.*class*name
omni = Omni(
model="Qwen/Qwen-Image",
cache_backend="tea_cache",
cache_config={
"rel_l1_thresh": 0.2 # Optional, defaults to 0.2
}
)
outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(
num_inference_steps=50,
),
)Enable TeaCache for online serving by passing --cache-backend tea_cache when starting the server:
vllm serve Qwen/Qwen-Image --omni --port 8091 \
--cache-backend tea_cache \
--cache-config '{"rel_l1_thresh": 0.2}'export DIFFUSION_CACHE_BACKEND=tea_cachePerformance Tuning :
Start with rel_l1_thresh=0.2 and adjust based on your needs:
| Architecture | Model Family | HF Model |
|---|---|---|
| QwenImagePipeline | Qwen-Image | Qwen/Qwen-Image |
| QwenImageEditPipeline | Qwen-Image-Edit | Qwen/Qwen-Image-Edit |
| QwenImageEditPlusPipeline | Qwen-Image-Edit-2509 | Qwen/Qwen-Image-Edit-2509 |
| QwenImageLayeredPipeline | Qwen-Image-Layered | Qwen/Qwen-Image-Layered |
| BagelForConditionalGen | BAGEL (DiT-only) | ByteDance-Seed/BAGEL-7B-MoT |
VIII. Performance & Evals
TeaCache is a speed–quality dial for accelerating diffusion inference across video, image, and audio models.
The Quality–Speed Trade-off
The core tradeoff is simple: higher threshold = faster speed, lower quality. The curve below shows where the cliff is:

Note:
Image models consistently use higher thresholds (0.17-0.5) compared to video models (0.05-0.3) (more sensitive).
The Threshold Cheat Sheet
Baseline ranges for Video, Image, and Audio diffusion models.
| Modality | Conservative | Balanced | Aggressive | Speedup |
|---|---|---|---|---|
| Video | 0.05 – 0.1 | 0.15 – 0.2 | 0.25 – 0.3 | 1.5× – 2.1× |
| Image | 0.1 – 0.2 | 0.3 – 0.5 | 0.6 – 1.0+ | 1.3× – 2.5× |
| Audio | 0.05 | 0.1 | 0.2 – 0.4 | 1.2× – 1.8× |
Recommended settings for Production, Iteration, and Rapid Preview workflows.
| Use Case | Threshold | Speedup | Quality Impact | Best For |
|---|---|---|---|---|
| Production | 0.1 – 0.15 | 1.3× – 1.6× | Minimal | Client-facing apps, final renders, premium content. |
| Content Workflows | 0.2 – 0.25 | 1.8× – 2.2× | Slight | Artist iteration, concept development, A/B testing. |
| Rapid Prototyping | 0.3 – 0.5 | 2.1× – 2.7× | Noticeable | Quick previews, parameter exploration, dev testing. |
Recommended configs for Flux, Wan, and Hunyuan and more.
| Model Family | rel_l1_thresh | ret_steps (%) | cutoff_steps (%) | Max Speedup |
|---|---|---|---|---|
| FLUX.1 (Std/PuLID) | 0.40 | 0.0 | 1.0 | ~2.0× |
| Lumina-Image-2.0 | 0.38 | 0.2 | 1.0 | ~1.7× |
| HiDream-I1 (Full) | 0.35 | 0.1 | 1.0 | ~2.0× |
| Wan2.1 T2V (1.3B) | 0.08 – 0.15 | 0.0 – 0.1 | 1.0 | ~1.6× – 2.2× |
| Wan2.1 T2V (14B) | 0.20 | 0.0 | 1.0 | ~1.8× |
| Wan2.1 I2V (14B-720P) | 0.30 | 0.1 | 1.0 | ~2.0× |
| Wan2.1 T2V (14B) Ret-Mode | 0.20 | 0.1 | 1.0 | ~2.1× |
| HunyuanVideo (13B) | 0.15 | 0.0 | 1.0 | ~1.9× |
| CogVideoX 1.5 (5B) | 0.30 | 0.0 | 1.0 | ~2.0× |
| LTX-Video | 0.06 | 0.0 | 1.0 | ~1.7× |
start_percent → baseline steps (var:ret_steps).
end_percent → to the final steps (var: cutoff_steps).
Use aggressive thresholds for rapid iteration. Once you find the right composition, replay without the cache for a pixel-perfect render.
Testing & Evaluation
All speedups are measured against the model’s native (uncached) inference time. Quality is evaluated using:
- VBench → temporal quality (video consistency)
- PSNR / SSIM → pixel-level similarity to uncached output
- LPIPS → perceptual similarity
- MOS → human visual quality (when applicable)
For full scripts and setup, see the TeaCache evaluation repository.
Conclusion
That’s a wrap — TeaCache, end to end.
If you made it here, bravo. You now understand what most people miss when they first hear about diffusion caching: it’s not magic, it’s not a KV cache, and it’s not skipping steps. It’s measuring how much the model changes between steps and avoiding recomputation when it doesn’t. With the explosion of image and video generation, caching is becoming vital for cost effective production diffusion systems.
Thanks again for reading this piece and both Part 1 and Part 2!

Run AI Your Way — In Your Cloud
Want full control over your AI backend? The CloudThrill VLLM Private Inference POC is still open — but not forever.
📢 Secure your spot (only a few left), 𝗔𝗽𝗽𝗹𝘆 𝗻𝗼𝘄!
Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control