Diffusion Models 101: From Noise to Pixels

Intro

Today, most of us have used nano banana, Midjourney, Kling AI, Luma, or Sora to generate silly videos or catchy images on socials. But what do they share in common? They all rely on Diffusion Models as their core engine, even the brand new Seedance. While many of these are proprietary, the open-source world has exploded with high-quality models that allow anyone, from hobbyists on ComfyUI to enterprises on vLLM-omni, to run their own multimodal platforms.

Following our vllm-omni deep dive, this blog explores how these models actually work in practice.

💡More Logic less mathing

My goal isn’t to bury you in deep math, but to demonstrate the logic behind the basics. I’ll use simplified annotations to ensure that even without a PhD, you’ll walk away understanding the “Big Picture” of AI media generation.

It is my attempt to translate popular publications. You can learn the math side of it in this amzing video By Welsh Labs

I. What is Diffusion in AI?

Every block of stone has a statue inside it and it is the task of the sculptor to discover it.
Michelangelo

In physics, diffusion is what happens when you drop ink into water or leave perfume in a room: it spreads, it fills the space, it doesn’t reverse. Structure dissolves into chaos, and entropy wins. Diffusion models expand on this natural phenomenon by learning how to run it backwards. From sharp image to noise (train), and back again (inference).

Hence the two-act story: first you destroy, then learn to create. The magic isn’t in the noise, but teaching a model to find order within it. Precisely what Michelangelo’s metaphor captures: forward diffusion hides the statue inside the marble, reverse diffusion carves it free, one step at a time, guided by your prompt. Can’t have the 2nd without the first.

1. Forward Diffusion (Training)

During training, real images are gradually corrupted, by adding tiny amounts of random noise over 100s of steps, until the original image is pure noise. Each step creates a data point: (image, how noisy it is, what noise was added). Text descriptions are encoded and paired with their corresponding images to build the full training corpus.

Credit: Welsh labs
  • Start: A real image or video paired with its text description (i.e “a golden retriever on a beach.”)
  • Destroy: Add a tiny amount of noise. Record the result. Repeat multiple of times, until nothing remains but static.
  • Finish: Pure noise (random pixels).
  • Training loss is typically MSE(Mean Squared Error) between predicted ε(noise) and true ε
💡The model studies millions of these cycles until it knows, for any noisy image, what noise was added and when.

2. Reverse diffusion (inference)

Diffusion generation is like un-blurring a photo step by step. You aren’t “drawing” from scratch; you’re sculpting an image out of a block of marble made of static, guided by your prompt.

Credit: Welsh labs
  • Start: Pure noise (random pixels).
  • Denoise(refine): Gradually remove “noise” over many steps guided by text embeddings (your prompt).
  • Finish: A clear, high-fidelity image or video.
💡 Each step makes tiny improvements to get closer to the target image or video requested by the prompt.

3. Latent Space(VAE): Diffusion’s ZIP File

Processing a 4K video frame-by-frame is a computational suicide mission. But in 2022, High-Resolution Image Synthesis with Latent Diffusion Models paper — which gave birth to Stable Diffusion — changed the game with VAE (Variational Autoencoder): a high-powered zipper for visual data.

Encoder “Zips” a high-res image into a tiny compressed construct called Latent Space — 8× to 64× smaller.
Workspace The denoising loop runs here. We go from pixels to math concepts: the image is now an idea.
Decoder Once denoising loop is over, VAE “unzips” the latents back into the full-resolution image or video.

II. Inside a Diffusion Step

Every concept below lives inside one place: the “Denoising Loop” — running hundreds of times per generation, and showing where the GPU cycles are being spent. Before diving into diffusion paths, we must understand the loop first.

1. Timesteps

Timesteps are simply step numbers in the denoising process, a counter that tells the model exactly where it is in the journey from noise to image. The model uses this number to calibrate how aggressively to denoise: early steps make big structural changes, late steps refine fine details.

T = 1000 Pure, heavy noise (the very beginning).
T = 500 Halfway there — shapes are emerging, noise is thinning.
T = 1 Almost finished — fine details and texture refinement.
T = 0 Final result (sharp image) — the denoising loop is complete.

2. Text Embeddings — The “Brain”

Your prompt (i.e., A cat in a space suit) is turned into a vectors called embeddings that encodes meaning and guide the denoising process. So the model doesn’t just make a random pretty picture—it makes yours.

3. Timestep Embeddings — The “GPS”

At every step, a Timestep Embedding (GPS signal) is generated. It helps the model decide what noise to remove, by signaling how noisy the current state is in the noise to image journey. The timestep is encoded into its own embedding alongside the image and text and injected into every transformer block (transformer layer) as a modulation signal.

Timestep embedding acts as a remote control

In the diagram: the Timestep Embedding modulates the Noisy Input through two operations before anything else runs:
Shift (+)— It adjusts the bias of the data (addition).
Scale — It adjusts the “volume” or strength of the features (multiplication).
This re-wires each transformer block at every step — Step 1000 focus on global shapes, Step 10 on fine textures.

Input:
Timestep tensor "t" (“denoising progress bar”, i.e step 500).
Output:
Timestep Embedding "t_emb" (translated progress bar i.e. step 500, into a “time-aware” vector).
t_emb  = TimestepProjection(t)               ← encode the step number
output = Scale(t_emb) · x  +  Shift(t_emb)   ← modulate the noisy input
💡where x = current noisy latent (Noisy Input from previous step)

4. Vector Fields

At each denoising step, the model outputs a vector — a direction and magnitude pointing toward cleaner data. Across the entire latent space, these vectors form a field: a map of which way to move from any noisy state to reach a coherent image. This is the navigation system introduced in Section I.

The text embedding steers which region of that field you’re navigating toward. Timestep tells you how far you’ve come.

5. Unconditional vs. Conditional (prompt)

Credit: Welsh labs

Without a prompt, the model generates freely — no destination, no instruction, meaning you get a totally random image or video. When a text prompt is added, the model computes a directed prediction, steering toward the concept your words describe. The difference between those two predictions is what makes guidance possible.

  • f(x,t) — is the unconditional prediction, drifting toward “NOT CATS.”
  • f(x,t, cat) — is the conditional prediction, pointing toward the CATS cluster.
  • The green arrow is the difference between them. No scale applied yet, just two directions, and their gap.

6 Classifier-Free Guidance (CFG)

CFG amplifies that difference. It takes the green arrow and scales it by a guidance weight w — pushing the result harder toward your prompt and further from the generic baseline. Think of it as a creativity vs. control knob: low values give the model freedom, high values enforce the prompt literally.

Credit: Welsh labs

CFG Formula

\(
\epsilon_{guided} = \underbrace{f(x,t)}_{\text{unconditional prediction}} + \underbrace{w}_{\text{scale}} \cdot \underbrace{\color{green}{[f(x,t,cat) – f(x,t)]}}_{\text{prompt direction}}
\)
Term What it does
f(x,t) Unconditional prediction — no prompt, no bias
f(x,t, prompt) Conditional prediction — steered by your text
f(x,t, prompt) − f(x,t) The green arrow — direction toward your concept
w / α (guidance scale) How hard to push in that direction (Low:creative| High: strict)

7. Negative Prompts

Credit: Welsh labs

While CFG uses an unconditional prediction as its baseline — steering away from “nothing”. Wan model takes it even further by using Negative prompts and replacing that neutral baseline with something explicit: a description of what you don’t want. The model now steers away from a blacklisted concept rather than just away from nothing.

Negative Prompt Formula

\(
\epsilon_{guided} = \color{green}{f(x,t, \text{prompt})} + \underbrace{\alpha}_{\text{scale}} \cdot [\color{green}{f(x,t, \text{prompt})} – \color{red}{f(x,t, \text{neg prompt})}]
\)
Term What it does
f(x,t, prompt) Conditional prediction — what you want
f(x,t, neg prompt) The red vector — what you’re actively fleeing (“blurry”, “extra limbs”…)
f(x,t, prompt) − f(x,t, neg prompt) The green arrow — distance between want and don’t want
α / w (guidance scale) How hard to push away from the negative

III. Diffusion Paths (Solvers)

Every diffusion model starts with pure noise and ends with a coherent image. The question that divided researchers for years: what’s the best path between the two? Three generations of solvers answered that question very differently.

📖 Key Symbols
x_t Noisy latent at step t
x_{t-1} Denoised latent at the next step — the output of each reverse step
ε Noise
ε_θ Predicted noise (or equivalent)
v_t Velocity field (flow models)
Solver How you step through time — also called a sampler

1. DDPM (Denoising Diffusion Probabilistic Models)

📄 Paper Denoising Diffusion Probabilistic Models Ho et al., 2020

DDPM is like walking home drunk—random steps with few kicks, but you get there with some correction.

How it works:
The original diffusion solver. At every step, the model removes some noise then immediately adds a small random kick back in. Like a stroller stumbling home: moving in the right direction, but shuffling sideways at every step. It eventually reaches the destination, but the path is chaotic and expensive — requiring ~100+ steps minimum. The randomness isn’t a bug — different stumbles = different images from the same prompt. But it also meant you couldn’t skip steps.

DDPM Formula — SDE (Stochastic Differential Equation)

\(
x_{t-1} = x_t + \underbrace{a_t \epsilon_\theta}_{\text{Correction}} + \underbrace{\color{orange}{\eta w_t}}_{\text{Noise}}
\)

Term What it does
x_t Current noisy state
a_t Noise scaling factor — controls how much correction to apply at this step (Reconstruction)
ε_θ The predicted noise — what the model thinks should be removed
η Noise strength — how loud the random stumble is
w_t Random noise sample — the actual stumble injected at this step
x_{t-1} Next step = correction + stumble

DDPM — Full Step Walkthrough

┌───────────────────────────────────────────┐ │ ⚙️ STEP T — STOCHASTIC / SDE │ └───────────────────────────────────────────┘ Input: x_t (Current Noisy State) emb_t (Timestep Projection) ┌─────────────────────────────────────────┐ │ Modulation Layer (emb_t) │ │ x_t ──► Shift & Scale ──► x̃_t │ │ (prepares x_t for the DiT blocks) │ └─────────────────────────────────────────┘ ↓ emb_t re-injected at every block ┌─────────────────────────────────────────┐ │ INSIDE THE MODEL (DiT): │ │ │ │ 1. Predict the noise (ε_t) in x_t │ │ 2. Calculate the “Clean” direction │ │ │ │ for block in blocks: │ │ hidden = block(x̃_t, emb_t) │ <<<--- (re-modulates at each layer) └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ THE REVERSE STEP │ │ x_{t-1} = Denoise(x_t, ε_t) + ηw_t │ │ ↑ │ │ The “Stochastic” Kick │ └─────────────────────────────────────────┘
💡 Why add noise back?

The model was trained with noise at every step — remove it at inference and the process breaks. The reverse process is stochastic by design: without random kicks, every generation from the same prompt converges to the same output.

2. DDIM (Deterministic Diffusion Implicit Models)

DDIM is like skiing back home — a predictable slide but faster with smooth curves minus the random kicks.

📄 Paper Denoising Diffusion Implicit Models Song et al., 2021

How it works:
Song et al. asked one question: what if we removed the random stumble entirely? DDIM replaced the stochastic process with a deterministic ODE — the model carves a smooth, predictable curve toward the destination. No random kicks. Same starting noise + same prompt = same image every time. And because the path is smooth, you can skip steps (larger leaps)— reducing 100s steps to 50 without losing noticeable quality.

DDIM Formula — ODE (Ordinary Differential Equation)

\(
\underbrace{x_{t-1} = x_t + b_t \epsilon_\theta}_{\text{Reverse Diffusion Step}}
\)
Term What it does
x_t Current noisy state
b_t Velocity scaling factor — controls the step size along the smooth path (focus: Direction)
ε_θ The predicted noise — same as DDPM, but used deterministically. No random kick.
x_{t-1} Next step = pure correction only. Deterministic.

3. Flow Matching

Flow matching is like riding a “Bullet Train” — Straightening the path for maximum speed. Predicting velocity, not noise.

📄 Paper Flow Matching for Generative Modeling Lipman et al., 2023

How it works:
DDIM smoothed the path. Flow Matching is when you turn that “Curve” into a “straight Line”. Instead of predicting noise to remove at each step, the model learns to predict a velocity vector — the direct direction from noise to image in a straight line. No randomness, no curves, just the straightest possible path with a consistent speed. Fewer steps, less compute, better quality. This is the solver powering Wan 2.1, Hunyuan, and FLUX today. Perfect for diffusion caching.

Flow Matching Formula (ODE)

\(
x_{t-1} = x_t – \underbrace{\color{green}{\Delta t}}_{\text{StepSize}} \cdot \underbrace{\color{blue}{v(x,t)}}_{\text{Velocity}}
\)
Term What it does
x_t Current noisy state
v_t or v(xt​,t) Velocity — direction pointing straight toward the clean image at step t
Δt Step size — how far to move along the straight line per step
x_{t-1} Next step = current position − velocity × step size. No randomness. No curves.

Flow Matching — Full Step Walkthrough

┌─────────────────────────────────────────────┐ │ ⚙️ STEP T — DETERMINISTIC / ODE / FLOW │ └─────────────────────────────────────────────┘ Input: x_t (Noisy Latent) emb_t (Timestep Projection) text_emb (T5/CLIP Brain) ┌─────────────────────────────────────────┐ │ Modulation Layer (emb_t) │ │ x_t ──► Shift & Scale ──► x̃_t │ │ (prepares x_t for the DiT blocks) │ └─────────────────────────────────────────┘ ↓ emb_t re-injected at every block ┌─────────────────────────────────────────┐ │ INSIDE THE TRANSFORMER (DiT): │ │ │ │ hidden = x̃_t │ │ for block in blocks: │ │ hidden = block(hidden, text_emb, │ │ emb_t) │ │ │ │ return hidden as v_t (Velocity Vector) │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ ODE INTEGRATION │ │ x_{t-1} = x_t – (Δt · v_t) │ └─────────────────────────────────────────┘

Solvers Comparison Table

Solver Full Formula Predictability Speed Path
DDPM xt-1 = xt + atεθ + ηwt Low (Hard to skip steps) 🐌 Very Slow 🪶 Jagged
DDIM xt-1 = xt + btεθ High (Easier to skip steps) 🏎️ Fast ↩️ Smooth Curve
Flow Matching New xt-1 = xt − Δt · vt Extreme (Support Caching) ✈️ Ultra Fast 📏 Straight Line

IV. The Architecture Evolution 🚀

Every diffusion model needs two things: 1.Something to generate and 2.Something to understand what you asked for. So far we covered the generation side. This section covers the language understanding sid, and how it evolved from basic to genuinely intelligent.

1. CLIP (OpenAI, 2021) — Giving the Model “Eyes”

CLIP encodes image and text into shared embedding space. It has no “hands” to draw; it only has “eyes” to see.

📄 Paper Learning Transferable Visual Models From Natural Language Supervision Radford et al., 2021
Credit: Welsh labs

How it works:
Before CLIP, images and text lived in separate worlds. CLIP changed that by training on 400 million image-text pairs, learning to map both into a shared embedding space. A photo of a cat and the words “a cat” end up together in that space. A photo of a dog ends up far away. This gave diffusion models their first real “eyes”, a way to understand what a prompt means visually. The embedding from CLIP is what gets fed into the denoising loop as the conditioning signal.

💡 CLIP alone isn’t here to generate images. It just tells the diffusion model what it is looking at (Conditioning Signal).

2. Imagen (Google, 2022) — Giving the Model a “Brain”

Meet Google’s Imagen the discovery that replaced the model’s eyes (VLM) with a brain (LLM).

📄 Paper Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding Saharia et al., 2022

How it works:
CLIP struggled with complex language (nuanced prompts, spatial relationships, etc). But Google team asked: what if instead, we used an LLM already trained on the complexity of human text? The answer was T5, an LLM trained on text alone and revealed something surprising: a model that had never seen an image still produced far better prompt understanding than CLIP. The “brain” didn’t need to see to understand.

💡 The language model already knows what a complex prompt means, it just needs a diffusion model to act on it.

V. The Jump to Video

Images are a single point in time. Video is a sequence of them, and keeping that sequence coherent, consistent, and physically plausible is a completely different problem. These two papers defined how diffusion crossed that threshold.

1. VideoPoet (Google, 2023) — The “LLM for Video”

📄 Paper VideoPoet: A Large Language Model for Zero-Shot Video Generation Kondratyuk et al., 2023
Multi-inputs for
multi video generation tasks

How it works:
VideoPoet treated video frames the same way a language model treats words, as tokens in a sequence. By tokenizing video into discrete units and training a transformer to predict the next token, it showed that the same architecture powering text generation could handle temporal coherence in video. No special video-specific architecture needed. Just a big enough transformer and the right tokenization.

💡 If you can predict the next word, you can predict the next frame.

2. Lumiere (Google, 2024) — Space-Time Diffusion

Google took Imagen and ditched jitter by adding temporal layers to generate the entire video at once.

📄 Paper Lumiere: A Space-Time Diffusion Model for Video Generation Bar-Tal et al., 2024

How it works:
Frame-by-frame video generation’s problem is flicker, the frames are sharp but still unaware of each other. Lumiere solved this with a Space-Time U-Net which generates the entire video frames as a single 3D block (Height × Width × Time), treating time as just another dimension. With features like Consistent stylization and Inpainting.

Temporal consistency in generated videos

It uses Temporal Downsampling: reading the “compressed” motion blueprint of the entire clip simultaneously before filling in the details. The model knows where the ball lands at frame 80 before frame 1 is finished.

💡Consistent stylization — “i.e” apply a Van Gogh reference and the brushstrokes hold steady across every frame.

Conclusion

That’s a wrap — five sections, zero PhD required. If you made it here, congratulations 🏅.
You now understand what most practitioners take ages to piece together: why noise is added before it’s removed, why the path shape matters, why your prompt becomes a vector, and why Google had to inflate an image model just to make video stop flickering.
This was my attempt to translate the papers so you don’t have to read them. The logic was always there — it just needed a drunk stroller and a bullet train to make it click.

Next up: Diffusion caching — TeaCache and CacheDiT. Stay Tuned

Run AI Your Way — In Your Cloud


Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

🙋🏻‍♀️If you like this content please subscribe to our blog newsletter ❤️.

👋🏻Want to chat about your challenges?
We’d love to hear from you! 

Share this…

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Start your Cloud journey with us today .