
Intro
In Part 2 of our vLLM for beginners Series, we explored performance features like PagedAttention, attention backends, and prefill/decode optimization. In this final part, we’ll shift from theory to practice, covering how to deploy vLLM across different environments, from source builds to docker containers (K8s deployment will be covered separately).
💡In this series, we aim to provide a solid foundation of vLLM core concepts to help you understand how it works and why it’s emerging as a de facto choice for LLM deployments.
While authored independently, this series benefited from the LMCache‘s supportive presence & openness to guide.
Translating LLM jargons to tech enthusiasts🫶🏻
I. System Requirements
Before installing vLLM, ensure your system meets these requirements:
Requirement | Details |
---|---|
OS | Linux🐧 |
Python Version | 3.9, 3.10, 3.11, or 3.12 |
Hardware Support | NVIDIA GPUs, AMD GPUs (ROCm), Intel CPUs/GPUs (xPU), PowerPC CPUs, AWS Neuron, Google TPUs, Intel Gaudi (HPU), or CPU-only |
CUDA Version | 11.8+ (for NVIDIA GPUs) |
ROCm Version | 5.6+ (for AMD GPUs) |
Disk Space | ~1GB for installation |
Memory | Dependent on model size and quantization (8-16GB+ recommended). CPU Only requires DRAM. |
Interface speaking to your GPU via optimized kernels, leveraging features like
torch.compile
,to enable LLM computations.
Platform Support Matrix
vLLM supports different levels of optimization for each hardware platform that are automatically configured
Feature | NVIDIA CUDA | AMD ROCm | CPU | Intel XPU | TPU | HPU | AWS Neuron |
---|---|---|---|---|---|---|---|
PagedAttention | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Tensor Parallelism | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Pipeline Parallelism | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
FlashAttention | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
Quantization | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Custom MoE Kernels | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
Multi-modal Support | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Model Support
vLLM supports various model architectures across various tasks, prioritizing popular models and relies on community support for others. Supported types Include:
- Generative models: text generation and multi turn chat tasks (60+ models/DeepSeek, Qwen, Gemma, etc).
- Pooling models: extract embeddings, classifications, or scores from text inputs.
- Multimodal models: text, image, audio, video.
Tasks: If a model supports more than one task, you can set the task via the --task
<task> argument.
Model Category | Task |
---|---|
Generative Models | generate |
Pooling Models | embed , classify , score , reward |
Multimodal Models | transcription |
Speculative Models | draft |
Model Registry
vLLM uses a model registry that maps models to their implementations. When you specify a model, vLLM looks it up in the registry based on 3 attributes:
- Architecture name (e.g.,
"LlamaForCausalLM"
) - Module name (e.g., “
llama
“) - Implementation class (e.g., “
LlamaForCausalLM
“)
💡The mappings for all models is defined in the registry.py
, see snippet 👇🏼:
_TEXT_GENERATION_MODELS = {
👉🏼<< Architecure->"model","Implementation" >>👈🏼
"LlamaForCausalLM": ("llama", "LlamaForCausalLM"),
"MistralForCausalLM": ("llama", "LlamaForCausalLM"),
"PhiForCausalLM": ("phi", "PhiForCausalLM"),
# ... many more mappings
}
...
## All the model type mappings are combined into a unified registry:
_VLLM_MODELS = {
**_TEXT_GENERATION_MODELS,
**_EMBEDDING_MODELS,
**_CROSS_ENCODER_MODELS,
**_MULTIMODAL_MODELS,
**_SPECULATIVE_DECODING_MODELS,
**_TRANSFORMERS_MODELS,
}
II. vLLM Deployment Options🚀
There are myriads of flavors and ways to install vLLM. We’ll explore some that might suit your use case (GPU/CPU).
- Build from source – Low level installation
- Python installation – Basic installation
- Docker-based deployment – Simplest way to get started with containerized environments
- Kubernetes – Enterprise-grade deployment for scalable prod environments (will be covered in another post)
💡Each option has its own advantages depending on your requirements for scalability, ease of use, and performance.
Performance Considerations
When deploying vLLM, consider these performance factors:
- GPU Memory: Larger models require more GPU memory. Use
--gpu-memory-utilization
to control memory usage. - Tensor Parallelism: For large models, distribute across multiple GPUs with
--tensor-parallel-size
. - Quantization: Reduce memory requirements with
--dtype
options likehalf
orbfloat16
. - Swap Space: Enable CPU offloading with
--swap-space
for handling more concurrent requests.
1. Install on GPU
Environment Prereqs 🎯
1. 🧰 Install NVIDIA Drivers (Ubuntu)
2. 📦 Download & install NVIDIA CUDA toolkit (Ubuntu)
A. Build Wheel from scratch
This is the lowest level installation as you need to build it from source. There are two ways to do it:
1. Set up using Python-only build (without compilation) includes compiled C++, CUDA libraries
git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install .
2. Set up using Full build (compile everything) which can take a while
git clone https://github.com/vllm-project/vllm.git
cd vllm
export MAX_JOBS=6 # limit parallel processes
pip install --editable .
💡With pip’s
--editable
flag, changes you make to the code will be reflected when you run vLLM.
B. Install Python vLLM package
Environment
A. Using uv
(faster install than pip)
install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
####
mkdir -p ~/ai/vllm
cd ~/ai/vllm
uv venv --python 3.12 --seed
# Activate with:
source .venv/bin/activate
B. Using Conda
# Create a new conda environment.
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
# Activate with:
source ~/miniconda3/bin/activate
# Update the python version with your own
conda create -n myenv python=3.12 -y
Install vLLM
Install vLLM along with its necessary dependencies, ensuring automatic GPU support (--torch-backend=auto
)
# Last stable version with auto cuda support
uv pip install vllm --torch-backend=auto
# last dev version with specific cuda 12.6
uv pip install vllm --torch-backend=cu126 --extra-index-url https://wheels.vllm.ai/nightly
# specific stable version
uv pip install vllm==0.8.5
C. Docker-Based Deployment
Using Pre-built Docker Images
vLLM offers official Docker images that come with all dependencies pre-installed, making it easy to get started:
# Run the vLLM OpenAI-compatible server with a Mistral model
docker run --runtime nvidia
--gpus all
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<your-token>" \
-p 8000:8000 --ipc=host \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-v0.1
This command:
- Mounts your local HuggingFace cache to avoid re-downloading models
- Exposes API on port
8000
- Uses host IPC for faster tensor-parallelism
- Sets the model to serve
D. Kubernetes
For production environments, vLLM offers a Kubernetes-based deployment through different methods such as:
- K8s native Manifests
- vllm-project/production-stack (Official subproject)
- KServe
- kubernetes-sigs/lws
- meta-llama/llama-stack
- substratusai/kubeai
- vllm-project/aibrix
🚀Advanced Configuration
For more complex deployments, you can configure:
- Multiple model replicas for load balancing
- Different resource allocations per model
- Custom routing strategies
- KV cache offloading with LMCache
The official vLLM production-stack implementation on Kubernetes will be covered in our next blog post. Stay tuned🎯.
2. Install On CPU
This was tricky as we even submitted a PR #19156 to improve missing instructions (CPU support is rather recent).
A. Build CPU wheel from source
As there are no pre-built CPU wheels, you must build it from source (set-up-using-python, build.inc).
Prerequisites
- Compiler:
gcc/g++ >= 12.3.0
- Instruction Set Architecture (ISA):
AVX512
(recommended)
1. Install dependencies
sudo apt-get update -y
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
2. Clone the vLLM repo
git clone --branch v0.8.5 https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source
3. Install required Python packages
uv pip install --upgrade pip
uv pip install "cmake>=3.26" wheel packaging ninja "setuptools-scm>=8" numpy
uv pip install -v -r requirements/cpu.txt --index-url https://download.pytorch.org/whl/cpu
4. Build vLLM CPU wheel:
# Specify kv cache in GiB
export VLLM_CPU_KVCACHE_SPACE=2
# Example: to bind to the first 4 CPU cores, use '0-3'. Check available cores using: lscpu -e
export VLLM_CPU_OMP_THREADS_BIND=0-4
# Build the wheel Using python build (recommended
VLLM_TARGET_DEVICE=cpu CMAKE_DISABLE_FIND_PACKAGE_CUDA=ON python -m build --wheel --no-isolation
💡Also possible using uv
(fastest option)
uv
This is the same as python build
but through uv
VLLM_TARGET_DEVICE=cpu CMAKE_DISABLE_FIND_PACKAGE_CUDA=ON uv build --wheel
5. Install the wheel (non-editable)
uv pip install dist/*.whl
💡
--no-build-isolation
: Installs using current environment instead of creating an isolated build environment.
B. Install CPU Docker image
Two options are available :
1. Pre-built images for CPU can be found here https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo
docker pull public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.8.5.post1
# Run the container
docker run --rm \
--privileged=true \
--shm-size=4g \
-p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=2 \ ## allocates 2GiB for KV cache on CPU
-e VLLM_CPU_OMP_THREADS_BIND=0-3 \ ## Example: to bind to the first 4 CPU cores, use '0-3'.
public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.8.5.post1 \
--model=TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--dtype=bfloat16
2. Build image from source
$ git clone --branch v0.8.5 https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source
$ docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai .
# download the model
huggingface-cli login
huggingface-cli repo download meta-llama/Llama-3.2-1B-Instruct --local-dir ./llama3
# Launching OpenAI server
$ docker run --rm \
--privileged=true \
--shm-size=4g \
-p 8000:8000 \
-v "$(pwd)/llama3:/models/llama3" \
-e VLLM_CPU_KVCACHE_SPACE=1 \ ---<KV cache space>
-e VLLM_CPU_OMP_THREADS_BIND=0-3 \ ---<CPU cores for inference>
vllm-cpu-env \
--model=/models/llama3 \
--dtype=bfloat16 \
--shm-size=2g # allow the container to access the host's shared memory.
## --api-key supersecretkey (to require a key from clients to access the model)
# You can either use the ipc=host or --shm-size flag
III. Running vLLM server (Offline + Online)
vLLM provides Offline and Online (OPenAI like) server endpoints. We’ll pick TinyLlama model in the following examples.
1. Offline Inference Server:
- Batch processing large numbers of prompts
- One-time inference jobs with Integration into data processing pipelines
- Scenarios where you don’t need a persistent API server
Batch processing
Here we load the model and process a batch of prompts directly using python API.
from vllm import LLM, SamplingParams
# Initialize the model
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=256,
top_p=0.95
)
# Prompt batch
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# perform the inference
outputs = llm.generate(prompts, sampling_params)
# Process the outputs
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Answer
Processed prompts: 100%|█████| 4/4 [00:40<00:00, 10.22s/it, est. speed input: 2 toks/s, output: 21.86 toks/s]
Prompt: 'Hello, my name is', Generated text: ' [Your Name] and I am a current student at [School Name]...
...
Prompt: 'The president of the United States is', Generated text: ' an important and influential leader in the world,
...
Prompt: 'The capital of France is', Generated text: ' Paris. The national language of France is French. The national
... # down to the last prompt ..
2. Online inference Server(OpenAI-Compatible)
vLLM’s OpenAI-compatible server allows you to serve models with an API matching OpenAI‘s interface, making it easy to integrate with existing applications. It’s ideal for Real-time applications and chatbots.
Running the Server
You can run the server using Docker or directly using the CLI on your machine:
vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype auto
"--api-key token-abc123" flag
You can also use the python endpoint module:
python -m vllm.entrypoints.openai.api_server --model=TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype bfloat16
Supported APIs
The server supports multiple API endpoints, slightly different than ollama ones:
- Completions API ➡️
/v1/completions
- Chat Completions API ➡️
/v1/chat/completions
- Embeddings API ➡️
/v1/embeddings
- Custom APIs like Tokenizer API ➡️
/tokenize
,/detokenize
Client Integration
Once the server is running, you can use the OpenAI Python client to interact with it(see open_ai_vllm_chat.py):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1", <---- your vLLM server URL
api_key="token-abc123",
)
chat_response = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world cup in 2022 ?"},
]
)
print("Chat response:",chat_response.choices[0].message)
Answer
$ python open_ai_vllm_chat.py
Chat response: ChatCompletionMessage(content='The 2022 FIFA World Cup, also called the 2022 FIFA World Cup,
took place from Nov. 20, 2022, to Dec. 18, 2022, in Qatar.
Argentina won the FIFA World Cup in a sensational Super Cup final against France on Dec. 18, 2022.', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], reasoning_content=None)
Conclusion, final thoughts
vLLM offers flexible deployment options from simple Docker containers to production-grade Kubernetes deployments.
In this series, we’ve covered vLLM’s fundamentals, architecture, and now deployment options. With these tools in hand, you’re ready to efficiently start serve LLMs for your applications using vLLM!
NEXT, next we’ll dive deeper into production-stack deployment scenarios in Kubernetes
Reference
🙋🏻♀️If you like this content please subscribe to our blog newsletter ❤️.