What is vLLM-Omni? Beginners Intro

Intro

Any-to-any multimodal models combining text, images, video, and audio are advancing AI, but their complex architectures, mixing autoregressive LLMs and diffusion transformers, make efficient serving very difficult. Current systems like OpenAI’s ChatGPT (text) and Sora (video) run as separate engines, lacking unified any-to-any pipelines. vLLM-Omni solves just that with a fully disaggregated serving system designed for interconnected, multi-component models. 

I. What is vLLM-Omni?

vLLM-Omni is a purposeful extension of vLLM core that adds multimodal and non-autoregressive model support while preserving full compatibility with vLLMโ€™s scheduling and KV cache optimizations. That means anything from Image generation text to speech even video generation can be done through this new serving engine flavor.

Design intent

The secret sauce is to serve both text generation that relies on Autoregressive(AR) models and Multi-Modal generation using Diffusion(Dit) model. All in a heterogenous pipeline (multimedia input and output).

Two In One Architecture: AR vs. DiT

Feature Autoregressive (AR) Diffusion Transformer (DiT)
Core Examples Qwen3-Omni, Qwen2.5-Omni Wan 2.2, FLUX.2, BAGEL
Mechanism Token-by-token (Thinker/Talker) Step-based Denoising (MoE Experts)
Bottleneck Decode: Memory bound (KV Cache) Compute bound (High-throughput generation)
NPU Efficiency High (Memory-efficient decoding) Ultra-High (Dedicated Matrix Engines)

How does it extend vLLM

  • Intercepting the CLI --omni Switch
    • CLI checks for --omni flag; if absent, it calls vLLMโ€™s main directly .
    • If present, it loads vLLM-Omniโ€™s serve module and dispatches accordingly . If not the CLI behave like vLLM
  • Runtime patching of vLLM symbols at import
    • Replaces vLLM classes with Omni equivalents at import
    • Ensures all vLLM internals use Omni-enhanced classes
  • Inheriting key vLLM classes to add omni-specific capabilities.
    • Adds omni-specific configuration & connectors including an LLMEngine like vLLM 
    • AsyncOmniLLM extends vLLMโ€™s AsyncLLM for async, per-stage execution 
    • OmniEngineArgs extends vLLMโ€™s EngineArgs with omni fields like stage_id and model_stage
    • Schedulers and workers also inherit from vLLM bases, e.g., OmniARScheduler extends VLLMScheduler

II. Architecture Overview ๐Ÿš€

Image Not Found
  • Modality Encoders: Thinker stage that uses separate encoders for each input modality(Vision/Audio Encoders)
  • LLM AR (Autoregressive Language Model): Core Language model that generates text tokens sequentially
  • Modality Generator: Uses different generators depending on the output modality (txt-to-img,TTS,img-to-vid…)

Omni-Modality Serving Architecture

In this Advanced dispatching and orchestration stack for multi-modal AI workloads. AutoRegressive(AR) Module handles all text based workload like vLLM while Diffusion takes care of other modalities.

Image Not Found
OmniRouter
INTELLIGENCE
Provides an intelligent router for Omni-modality requests dispatch, ensuring requests are sent to the optimal processing unit based on modality and load.
EntryPoints
API LAYER
Defines the APIs for offline/online serving (APIServer, Omni/AsyncOmni) and provides the OmniStage abstraction to handle different AR/DiT execution stages.
AR
AUTOREGRESSIVE
Inherited from vLLM(KV cache management, PagedAttention) but adapted for omni-modality.
Diffusion
GENERATIVE
Natively implemented and optimized using specialized acceleration components to handle high-throughput image and video generation tasks.
Model/Layer/ops
DISAGGREGATION
parallelismใ€quantizationใ€attentionโ€ฆ
OmniConnector
DISAGGREGATION
Supports full disaggregation based on the E/P/D/G framework: Encoding, Processing, Decoding, and Generation across different stages of the pipeline.

Natively Disaggregated Serving

vLLM-Omni runs each model stage in its own process/device, connected via OmniConnector. This enables heterogeneous pipelines (AR + DiT) and dynamic resource allocation across stages

Image Not Found

AR Vs Diffusion Module Design

Below is the core architecture of each AR and Difusion modules that makes Omni-modality work under vLLm-Omni.

Omni-Modality Pipeline Stages

Thinker/Talker/Code2Wav is defined in Qwen-Omni stage configs to produce audio output via waveform decoding.

Omni-Modality Pipeline Stages
Stage Role & Description Outputs / Data Flow
Thinker
Understanding
Function Multimodal understanding stage. Processes text, image, video, or audio inputs to generate reasoning states.
Flow Generates text tokens and hidden state embeddings (e.g., layers 0 and 24) passed to the Talker stage.
Talker
Codec Logic
Function Speech synthesis stage. Converts the Thinkerโ€™s text and hidden embeddings into multi-layer RVQ codec codes.
Flow Produces 16 layers of RVQ codes passed downstream to the Code2Wav decoder.
Code2Wav
Waveform
Function Audio decoding stage. Transforms the RVQ codes into a high-fidelity audible waveform.
Flow Final 24 kHz audio waveform tensor returned as the terminal response to the client.
Omni-Modal Request Workflow

Diffusion models

Diffusion models use a single diffusion stage and expose image/video outputs directly, without intermediate stages

Other models: When adding new models, you define stages per the modelโ€™s architecture (cloud be one or more)

๐Ÿง Model Support

vLLM-Omni supports 20+ popular omni and diffusion model architectures(growing rapidly) including Stable Diffusion 3.5

Supported Models Matrix (NVIDIA / AMD)

Omni-Modality Serving Components & Support Matrix
Models Architecture Example HF Models / Identifiers
Qwen3-Omni Qwen3OmniMoeForConditionalGeneration Qwen/Qwen3-Omni-30B-A3B-Instruct
Qwen2.5-Omni Qwen2_5OmniForConditionalGeneration Qwen/Qwen2.5-Omni-7B, Qwen/Qwen2.5-Omni-3B
BAGEL (DiT-only) BagelForConditionalGeneration ByteDance-Seed/BAGEL-7B-MoT
Qwen-Image (2512) QwenImagePipeline Qwen/Qwen-Image, Qwen/Qwen-Image-2512
Qwen-Image-Edit (2509) QwenImageEditPlusPipeline Qwen/Qwen-Image-Edit-2509
Wan2.2 (T2V / TI2V) WanPipeline Wan-AI/Wan2.2-T2V-A14B-Diffusers, Wan2.2-TI2V-5B
Wan2.2-I2V WanImageToVideoPipeline Wan-AI/Wan2.2-I2V-A14B-Diffusers
FLUX.2-klein Flux2KleinPipeline black-forest-labs/FLUX.2-klein-9B
Stable-Audio-Open StableAudioPipeline stabilityai/stable-audio-open-1.0
Qwen3-TTS (12Hz) Qwen3TTSForConditionalGeneration Qwen3-TTS-12Hz-1.7B-VoiceDesign, Base, CustomVoice
Ovis-Image OvisImagePipeline OvisAI/Ovis-Image
LongCat-Image LongcatImagePipeline meituan-longcat/LongCat-Image

List of Supported Models for NPU (i.e Huawei Ascend Atlas)

Omni-Modality Serving Components & Support Matrix
Models Architecture Example HF Models NPU compatible
Qwen3-Omni Qwen3OmniMoeForConditionalGeneration Qwen/Qwen3-Omni-30B-A3B-Instruct
Qwen2.5-Omni Qwen2_5OmniForConditionalGeneration Qwen/Qwen2.5-Omni-7B, Qwen2.5-Omni-3B
Qwen-Image / 2512 QwenImagePipeline Qwen/Qwen-Image, Qwen/Qwen-Image-2512
Qwen-Image-Edit (2509/2511) QwenImageEditPlusPipeline Qwen/Qwen-Image-Edit-2511, Qwen-Image-Edit-2509
Z-Image ZImagePipeline Tongyi-MAI/Z-Image-Turbo
FLUX.2-klein Flux2KleinPipeline black-forest-labs/FLUX.2-klein-9B
Qwen3-TTS-12Hz Qwen3TTSForConditionalGeneration Qwen3-TTS-12Hz-1.7B-VoiceDesign, Base, CustomVoice
LongCat-Image LongcatImagePipeline meituan-longcat/LongCat-Image

โšกDiffusion stage acceleration

  • TeaCache: Caches transformer computations when consecutive timesteps are similar speeding-up by ~1.5xโ€“2.0x
  • Cache-DiT: speeds up diffusion transformers via DBCache (dual block), TaylorSeer (forecasting), & SCM
  • Ulysses-SP: Splits seq dimensions & uses all-to-all communication so attention heads can be processed in parallel 
  • Ring-Attention: Splits the sequence dimension and circulates KV blocks in a ring topology to accumulate attention & shard sequences
  • Other: Parallelism, Quantization, Fused Ops, Timestep distillation

Interface design

  • Offline Inference
    • The Omni class provides a Python interface for offline batched inference
  • Online Inference
    • Similar to vLLM, vLLM-Omni also provides a FastAPI-based server for online serving. 

Vllm-Omni Endpoints

  • All models expose all OpenAI-compatible endpoints such as /v1/chat/completions and /v1/images/generations.
  • /v1/transcription and /v1/translation: Audio transcription/translation
  • /v1/audio/speech: Text-to-speech (TTS) generation for supported models (e.g., Qwen3-TTS)
  • /v1/messages: Anthropic-compatible messages (if supported_tasks include โ€œgenerateโ€)

III. Online Inference Examples

1. Text-to-Image

# 1) Serve  
vllm serve Qwen/Qwen-Image --omni --port 8091  
  
# 2) Generate  
curl -s http://localhost:8091/v1/images/generations \  
  -H "Content-Type: application/json" \  
  -d '{  
    "prompt": "a cup of coffee on the table",  
    "model": "Qwen/Qwen-Image",  
    "n": 1,  
    "size": "1024x1024"  
  }' | jq -r '.data[0].b64_json' | base64 -d > coffee.png

2. Image-to-Image

# 1) Serve  
vllm serve Qwen/Qwen-Image-Edit --omni --port 8091  
  
# 2) Generate (edit)  
curl -s http://localhost:8091/v1/images/generations \  
  -H "Content-Type: application/json" \  
  -d '{  
    "prompt": "turn this cat to a dog",  
    "model": "Qwen/Qwen-Image-Edit",  
    "n": 1,  
    "size": "1024x1024",  
    "image": "data:image/png;base64,..."  
  }' | jq -r '.data[0].b64_json' | base64 -d > edited.png

3.Text-to-Speech (TTS)

# 1) Serve  
vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091  
  
# 2) Generate speech  
curl -s http://localhost:8091/v1/audio/speech \  
  -H "Content-Type: application/json" \  
  -d '{  
    "model": "Qwen/Qwen2.5-Omni-7B",  
    "input": "Hello, this is a test.",  
    "voice": "Default"  
  }' --output speech.wav

4. OmniQwen3-Omni (Multimodal: video inputโ†’ text + audio output)

# 1) Serve  
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091  
  
# 2) Generate (video input โ†’ text + audio output)  
curl -s http://localhost:8091/v1/chat/completions \  
  -H "Content-Type: application/json" \  
  -d '{  
    "model": "Qwen/Qwen3-Omni-30B-A3B-Instruct",  
    "messages": [  
      {  
        "role": "system",  
        "content": [  
          {  
            "type": "text",  
            "text": "You are Qwen, a virtual Cyborg developed by the Qwen Team, 
            capable of perceiving auditory and visual inputs, as well as generating text and speech."  
          }  
        ]  
      },  
      {  
        "role": "user",  
        "content": [  
          {"type": "video_url", "video_url": {"url": "data:video/mp4;base64,..."}},  
          {"type": "text", "text": "Describe the video briefly."}  
        ]  
      }  
    ],  
    "extra_body": {  
      "sampling_params_list": [  
        {"temperature": 0.4, "max_tokens": 2048},  
        {"temperature": 0.9, "max_tokens": 4096},  
        {"temperature": 0.0, "max_tokens": 65536}  
      ]  
    }  
  }' | jq -r '.choices[0].message.content' > response.json

Coming Up Next

That wraps up ourย Beginners vLLm-Omni intro, where we explored the key features that make vLLM-omni fast, efficient, and production-ready multi-modal serving engine. Hopefully, this little breakdown made it a little easier to grasp as the project is still growing in restless pace.
Inย Part 2, weโ€™ll show you all aboutย Diffusion models from noise to pixel, covering the research papers from 2021 to 2024!

Stay tuned for Part 2โšก

Run AI Your Way โ€” In Your Cloud


Run AI assistants, RAG, or internal models on an AI backend ๐—ฝ๐—ฟ๐—ถ๐˜ƒ๐—ฎ๐˜๐—ฒ๐—น๐˜† ๐—ถ๐—ป ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ฐ๐—น๐—ผ๐˜‚๐—ฑ –
โœ… No external APIs
โœ… No vendor lock-in
โœ… Total data control

๐—ฌ๐—ผ๐˜‚๐—ฟ ๐—ถ๐—ป๐—ณ๐—ฟ๐—ฎ. ๐—ฌ๐—ผ๐˜‚๐—ฟ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€. ๐—ฌ๐—ผ๐˜‚๐—ฟ ๐—ฟ๐˜‚๐—น๐—ฒ๐˜€…

๐Ÿ™‹๐Ÿปโ€โ™€๏ธIf you like this content please subscribe to our blog newsletter โค๏ธ.

๐Ÿ‘‹๐ŸปWant to chat about your challenges?
Weโ€™d love to hear from you!ย 

Share this…

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. ๐Ÿš€

Start your Cloud journey with us today .