Intro

last year, I have dived deep into Ollama inference where I ended up building and speaking about Ollama Kubernetes deployments along with rich documentation in my ollama_lab repo and quantization article—This year’s Cloudtrhill focus is VLLM Inference which is a next level beast from a model serving standpoint. Exploring multiple inference options is time-intensive but we believe is essential for recommending the right fit to clients.

💡In this series, we aim to provide a solid foundation of vLLM core concepts to help you understand how it works and why it’s emerging as a de facto choice for LLM deployments.

Acknowledgment:
While authored independently, this series benefited from the LMCache‘s supportive presence & openness to guide.

So here’s to more insights in the world of LLM inference🫶🏻

What is vLLM 🚀

vLLM (Virtual Large Language Model) is an open-source library—originally developed in the Sky Computing Lab, UC Berkeley—designed to efficiently serve LLMs. It’s built to solve one of the biggest challenges in inference: “serving massive models efficiently at scale with high throughput and low latency“. Think of vLLM as a high-performance engine that makes your favorite LLMs run faster and more efficiently when deployed in production environments. Ollama’s big brother.

Why vLLM over Ollama?

If you’re deploying LLMs in production, you should seriously consider vLLM over Ollama because:

Higher Throughput: Serve more requests per second with the same hardware
Lower Latency: Faster response times for users compared to other serving engines
Scalability: Scales from a single instance to distributed deployment with 0 code change
Resource cost Efficiency: Smarter GPU usage = lower costs
Supports API key authentication for security purposes (ollama doesn’t)
Production-Ready: Includes monitoring, routing, and other production grade features
Ability to handle longer sequences
Cloud based model serving standard: AWS, Azure, OCI, etc.

In a nutshell: vLLM handles high-concurrency and batch processing far better than Ollama.

🧠What is a KV Cache?

Key-Value cache is a memory construct storing the intermediate state (Key/value store). One of the key performance optimizations in vLLM is how it handles the KV cache to speed up inference with LLMs. I wrote a whole blog about it here➡️ KV_cache explained ✍🏻

How vLLM optimize KV Cache

As sequences get longer, kv cache grows, consuming higher VRAM and creating memory fragmentation issues.
vLLM addresses this KV cache challenge with its innovative architecture.

KV_cache Optimisation examples:

KV Cache Offloading: Moves KV caches from GPU memory to CPU when they’re not immediately needed.
Intelligent Routing: The router can direct requests to maximize KV cache reuse.
Paged Attention Reduces memory fragmentation, like OS memory swapping.
Chunked prefill, Auto-Prefix caching see detail on our KV_cache explained

Core Architecture of vLLM (V0)

vLLM’s architecture consists of several key components working together:

1. The Engine: Orchestrating Everything

At the heart of vLLM is the LLMEngine class, which coordinates all aspects of the inference process:

Processes input requests (prefill).
Manages the scheduling of these requests.
Coordinates model execution.
Handles output processing .

⚖️Offline and Online inference

There are two ways to interact with LLMs in vLLM

A. Offline inference (LLMClass)

This is the client interface used for synchronous batch processing of prompts in offline scenarios

For cases where you want to process a batch of prompts at once.
It’s a Native inference (LLM Class) with no separate inference server.
Without real-time requirements (multi-turn conversation).

B. Online inference (AsyncLLMEngine)

This is the client interface that provides asynchronous API for online serving

For real-time, interactive use cases like chatbots.
Handles multiple concurrent requests and stream outputs to clients.
Integration with APIs: OpenAI-compatible API server along with lmserve cli command.

Examples

vllm serve CLI
A FastAPI-based server (online serving)
OpenAI Python client

vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123

Using Python module-based API server launch

$ python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B

Client (REST)

$ curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
}'

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

print(completion.choices[0].message)

2. Scheduling layer

The scheduler determines which requests to process in each iteration, managing resources and request priorities.

It makes decisions about:

Which sequences to run (prefill vs. decode)
Block allocation and swapping
Batching and prioritization

Components

Sequence Manager: Tracks, prioritize sequences & manages sequence groups for batched processing
- ➡️state: running, waiting, swapped
Block Space Manager: Allocates KV cache blocks to sequences, Handles GPU/CPU blocks swapping
Continuous Batching: Interleaves prefill (prompt processing) and decode (generation) operations
Scheduler Outputs: Contains metadata about sequence groups to process
- Specifies memory operations (blocks to swap in/out/copy)

3. Execution layer

The execution layer handles model inference, sitting between the scheduler (which chooses tasks) and memory management (which manages KV cache), ensuring efficient LLM execution. It consists of several key components

3.1. Model Executor

Is the central component that manages model execution locally or across distributed GPUs. It is responsible for:

Initializing the model weights.
Executing forward passes through the model.
Managing distributed execution across multiple GPUs/TPUs.
Handling KV cache initialization.
Supports multiple backends (Ray, Multiprocessing, Uniprocess).
Coordinates workers for model execution across devices.

3.2. Workers

The Worker executes a partition of the model on a single GPU (or part of a GPU) and is responsible for:

Initializing the model and GPU resources.
Managing the KV cache through the CacheEngine.
Executing model forward passes via the ModelRunner.
Processing sampling operations.

3.3. Model Runner

Every model runner extends a worker, it handles the execution of LLMs on GPU/TPU hardware. It’s responsible for:

Model Execution: takes model inputs and produces outputs.
Input Preparation: Tokenizing and metadata.
Managing hardware resources like KV cache.
Handling optimizations: Capturing CUDA graphs, mixed precision, kernel fusion.

4. Memory management layer

This layer optimizes the usage of limited GPU memory to support efficient & concurrent inference. It’s responsible for:

KV Cache Management: Manages key-value cache blocks for transformer attention.
Block Management: Allocates, deallocates, and swaps memory blocks.
Memory Profiling: Analyzes memory usage to determine optimal allocation.
Memory Optimization: Implements techniques like FP8 KV cache to reduce memory footprint.
LMCache Integration: Provides KV cache offloading capabilities.

Paged Attention: The Secret Sauce

PagedAttention is the core innovation that efficiently manages the attention key-value memory.

Treats KV cache as non-contiguous memory blocks.
Maps logical sequence positions to physical memory locations.
Enables more efficient memory utilization.
Reduces memory fragmentation.
Supports dynamic sequence lengths.

💡 This allows for more concurrent requests than traditional methods, by efficiently packing sequences into memory.

vLLM key Components Summary v0

Component	Purpose	Key Classes
Engine	Core orchestration	LLMEngine, AsyncLLMEngine, LLM
Worker	Model execution	Worker, ModelRunner, GPUModelRunner
Scheduler	Request scheduling	Scheduler
Memory Management	KV cache optimization	CacheEngine, BlockSpaceManager
Configuration	System settings	VllmConfig, ModelConfig, CacheConfig
Input Processing	Request handling	InputProcessor, Processor
Output Processing	Response generation	OutputProcessor
Distributed	Multi-GPU/node execution	ExecutorBase, RayDistributedExecutor

vLLM Configuration Basics

vLLM uses the VllmConfig class as its main config hub, letting users control:

Model Specification: Which model to serve and from where to load it.
Resource Allocation: How many GPUs, CPU cores, and memory to allocate.
Inference Parameters: Settings like chunk size, data type, and maximum sequence length.
CacheConfig: Manages KV cache memory settings.
ParallelConfig: Handles distributed execution settings .
SchedulerConfig: Controls request scheduling behavior.
SpeculativeConfig: Controls speculative decoding features .

Basic configuration
Environment Variables

A basic configuration typically includes:

modelSpec:  
  - name: "llama3"  
    modelURL: "meta-llama/Llama-3.1-8B-Instruct"  
    requestGPU: 1  
    vllmConfig:  
      maxModelLen: 16384  
      dtype: "bfloat16"

These are defined in vllm/envs.py and include:

VLLM_USE_V1: Controls whether to use V1 architecture (default is True)
VLLM_TARGET_DEVICE: Specifies the target device (cuda, rocm, neuron, cpu)
VLLM_CACHE_ROOT: Sets the root directory for vLLM cache files

🚀Coming Up Next (Pt 2)

In the next post of our series, we’ll dive deeper into vLLM performance optimization techniques and features such as PagedAttention, attention backends(Flash Attention/FlashInfer), Speculative decoding , Chunked prefill, Speculative decoding, Disaggregated Prefill and more
Stay tuned for Part 2: “VLLM Performance Optimization features”!

Reference

vllm_docs

🙋🏻‍♀️If you like this content please subscribe to our blog newsletter ❤️.

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Get in touch

Latest Podcasts

vLLM for beginners: The Fundamentals

Intro

What is vLLM 🚀

Why vLLM over Ollama?

🧠What is a KV Cache?

How vLLM optimize KV Cache

Core Architecture of vLLM (V0)

1. The Engine: Orchestrating Everything

⚖️Offline and Online inference

A. Offline inference (LLMClass)

B. Online inference (AsyncLLMEngine)

Examples

2. Scheduling layer

3. Execution layer

3.1. Model Executor

3.2. Workers

3.3. Model Runner

4. Memory management layer

Paged Attention: The Secret Sauce

vLLM key Components Summary v0

vLLM Configuration Basics

🚀Coming Up Next (Pt 2)

Reference

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Intro

What is vLLM 🚀

Why vLLM over Ollama?

🧠What is a KV Cache?

How vLLM optimize KV Cache

Core Architecture of vLLM (V0)

1. The Engine: Orchestrating Everything

⚖️Offline and Online inference

A. Offline inference (LLMClass)

B. Online inference (AsyncLLMEngine)

Examples

2. Scheduling layer

3. Execution layer

3.1. Model Executor

3.2. Workers

3.3. Model Runner

4. Memory management layer

Paged Attention: The Secret Sauce

vLLM key Components Summary v0

vLLM Configuration Basics

🚀Coming Up Next (Pt 2)

Reference

👋🏻Want to chat about your challenges? We’d love to hear from you!

Don't miss a Bit!

Join countless others! Sign up and get awesome cloud content straight to your inbox. 🚀

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀