vLLM for beginners: Deployment Options (PartIII)

LLM quantization

Intro

In Part 2 of our vLLM for beginners Series, we explored performance features like PagedAttention, attention backends, and prefill/decode optimization. In this final part, we’ll shift from theory to practice, covering how to deploy vLLM across different environments, from source builds to docker containers (K8s deployment will be covered separately).

💡In this series, we aim to provide a solid foundation of vLLM core concepts to help you understand how it works and why it’s emerging as a de facto choice for LLM deployments.

Acknowledgment:
While authored independently, this series benefited from the LMCache‘s supportive presence & openness to guide.

Translating LLM jargons to tech enthusiasts🫶🏻

I. System Requirements

Before installing vLLM, ensure your system meets these requirements:

Requirement Details
OS Linux🐧
Python Version 3.9, 3.10, 3.11, or 3.12
Hardware Support NVIDIA GPUs, AMD GPUs (ROCm), Intel CPUs/GPUs (xPU), PowerPC CPUs, AWS Neuron, Google TPUs, Intel Gaudi (HPU), or CPU-only
CUDA Version 11.8+ (for NVIDIA GPUs)
ROCm Version 5.6+ (for AMD GPUs)
Disk Space ~1GB for installation
Memory Dependent on model size and quantization (8-16GB+ recommended). CPU Only requires DRAM.
💡vLLM uses PyTorch as:
Interface speaking to your GPU via optimized kernels, leveraging features like torch.compile,to enable LLM computations.

Platform Support Matrix

vLLM supports different levels of optimization for each hardware platform that are automatically configured

Feature NVIDIA CUDA AMD ROCm CPU Intel XPU TPU HPU AWS Neuron
PagedAttention
Tensor Parallelism
Pipeline Parallelism
FlashAttention
Quantization
Custom MoE Kernels
Multi-modal Support

Model Support

vLLM supports various model architectures across various tasks, prioritizing popular models and relies on community support for others. Supported types Include:

  1. Generative models: text generation and multi turn chat tasks (60+ models/DeepSeek, Qwen, Gemma, etc).
  2. Pooling models: extract embeddings, classifications, or scores from text inputs.
  3. Multimodal models: text, image, audio, video.
Tasks: If a model supports more than one task, you can set the task via the --task <task> argument.
Model Category Task
Generative Models generate
Pooling Models embed, classify, score, reward
Multimodal Models transcription
Speculative Models draft

Model Registry

vLLM uses a model registry that maps models to their implementations. When you specify a model, vLLM looks it up in the registry based on 3 attributes:

  1. Architecture name (e.g., "LlamaForCausalLM")
  2. Module name (e.g., “llama“)
  3. Implementation class (e.g., “LlamaForCausalLM“)
💡The mappings for all models is defined in the registry.py , see snippet 👇🏼:
_TEXT_GENERATION_MODELS = {
    👉🏼<< Architecure->"model","Implementation" >>👈🏼
    "LlamaForCausalLM": ("llama", "LlamaForCausalLM"), 
    "MistralForCausalLM": ("llama", "LlamaForCausalLM"),
    "PhiForCausalLM": ("phi", "PhiForCausalLM"),
    # ... many more mappings
}
...
## All the model type mappings are combined into a unified registry:
_VLLM_MODELS = {
    **_TEXT_GENERATION_MODELS,
    **_EMBEDDING_MODELS,
    **_CROSS_ENCODER_MODELS,
    **_MULTIMODAL_MODELS,
    **_SPECULATIVE_DECODING_MODELS,
    **_TRANSFORMERS_MODELS,
}

New models: You can either add them in the codebase or using Transformers backend by following Custom models steps.

II. vLLM Deployment Options🚀

There are myriads of flavors and ways to install vLLM. We’ll explore some that might suit your use case (GPU/CPU).

  1. Build from source – Low level installation
  2. Python installation – Basic installation
  3. Docker-based deployment – Simplest way to get started with containerized environments
  4. Kubernetes – Enterprise-grade deployment for scalable prod environments (will be covered in another post)

💡Each option has its own advantages depending on your requirements for scalability, ease of use, and performance.

Performance Considerations

When deploying vLLM, consider these performance factors:

  • GPU Memory: Larger models require more GPU memory. Use --gpu-memory-utilization to control memory usage.
  • Tensor Parallelism: For large models, distribute across multiple GPUs with --tensor-parallel-size.
  • Quantization: Reduce memory requirements with --dtype options like half or bfloat16.
  • Swap Space: Enable CPU offloading with --swap-space for handling more concurrent requests.

1. Install on GPU

Environment Prereqs 🎯

Note: For this we’ll be using references from a GitHub wiki I wrote covering the necessary steps for vLLM setups.

1. 🧰 Install NVIDIA Drivers (Ubuntu)
2. 📦 Download & install NVIDIA CUDA toolkit (Ubuntu)

A. Build Wheel from scratch

This is the lowest level installation as you need to build it from source. There are two ways to do it:

1. Set up using Python-only build (without compilation) includes compiled C++, CUDA libraries

git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install .

2. Set up using Full build (compile everything) which can take a while

git clone https://github.com/vllm-project/vllm.git
cd vllm
export MAX_JOBS=6  # limit parallel processes
pip install --editable .
💡Environment variables: You can find the full list of environment variables to configure the system here.
💡With pip’s --editable flag, changes you make to the code will be reflected when you run vLLM.

B. Install Python vLLM package

Environment

A. Using uv (faster install than pip)

install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

####
mkdir -p ~/ai/vllm
cd ~/ai/vllm
uv venv --python 3.12 --seed

# Activate with:
 source .venv/bin/activate
B. Using Conda
# Create a new conda environment.
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh

# Activate with:
source ~/miniconda3/bin/activate

# Update the python version with your own
conda create -n myenv python=3.12 -y

Install vLLM

Install vLLM along with its necessary dependencies, ensuring automatic GPU support (--torch-backend=auto)

# Last stable version with auto cuda support
uv pip install vllm --torch-backend=auto

# last dev version with specific cuda 12.6
uv pip install vllm --torch-backend=cu126 --extra-index-url https://wheels.vllm.ai/nightly

# specific stable version
uv pip install vllm==0.8.5 

C. Docker-Based Deployment

Using Pre-built Docker Images

vLLM offers official Docker images that come with all dependencies pre-installed, making it easy to get started:

# Run the vLLM OpenAI-compatible server with a Mistral model  
docker run --runtime nvidia 
           --gpus all 
           -v ~/.cache/huggingface:/root/.cache/huggingface \  
           --env "HUGGING_FACE_HUB_TOKEN=<your-token>" \  
           -p 8000:8000 --ipc=host \  
           vllm/vllm-openai:latest \ 
           --model mistralai/Mistral-7B-v0.1

This command:

  • Mounts your local HuggingFace cache to avoid re-downloading models
  • Exposes API on port 8000
  • Uses host IPC for faster tensor-parallelism
  • Sets the model to serve

D. Kubernetes

For production environments, vLLM offers a Kubernetes-based deployment through different methods such as:

🚀Advanced Configuration

For more complex deployments, you can configure:

  • Multiple model replicas for load balancing
  • Different resource allocations per model
  • Custom routing strategies
  • KV cache offloading with LMCache
💡Note: We’ll stop here and skip the Kubernetes part for now.
The official vLLM production-stack implementation on Kubernetes will be covered in our next blog post. Stay tuned🎯.

2. Install On CPU

This was tricky as we even submitted a PR #19156 to improve missing instructions (CPU support is rather recent).

A. Build CPU wheel from source

As there are no pre-built CPU wheels, you must build it from source (set-up-using-pythonbuild.inc).

Prerequisites

  • Compiler: gcc/g++ >= 12.3.0
  • Instruction Set Architecture (ISA): AVX512 (recommended)

1. Install dependencies

sudo apt-get update  -y
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12

2. Clone the vLLM repo

git clone --branch v0.8.5 https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source

3. Install required Python packages

uv pip install --upgrade pip
uv pip install "cmake>=3.26" wheel packaging ninja "setuptools-scm>=8" numpy
uv pip install -v -r requirements/cpu.txt --index-url https://download.pytorch.org/whl/cpu

4. Build vLLM CPU wheel:

# Specify kv cache in GiB
export VLLM_CPU_KVCACHE_SPACE=2
# Example: to bind to the first 4 CPU cores, use '0-3'. Check available cores using: lscpu -e
export VLLM_CPU_OMP_THREADS_BIND=0-4 
# Build the wheel Using python build (recommended
VLLM_TARGET_DEVICE=cpu CMAKE_DISABLE_FIND_PACKAGE_CUDA=ON python -m build --wheel --no-isolation
💡Also possible using uv (fastest option)

This is the same as python build but through uv

VLLM_TARGET_DEVICE=cpu CMAKE_DISABLE_FIND_PACKAGE_CUDA=ON  uv build --wheel

5. Install the wheel (non-editable)

uv pip install dist/*.whl
💡Environment variables: You can find the full list of environment variables to configure the system here.
💡--no-build-isolation: Installs using current environment instead of creating an isolated build environment.

B. Install CPU Docker image

Two options are available :

1. Pre-built images for CPU can be found here https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo

 docker pull public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.8.5.post1
# Run the container
 docker run --rm \
            --privileged=true \
            --shm-size=4g \
            -p 8000:8000 \
            -e VLLM_CPU_KVCACHE_SPACE=2 \       ## allocates 2GiB for KV cache on CPU
            -e VLLM_CPU_OMP_THREADS_BIND=0-3 \  ## Example: to bind to the first 4 CPU cores, use '0-3'.
            public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.8.5.post1 \
            --model=TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
            --dtype=bfloat16

2. Build image from source

$ git clone --branch v0.8.5 https://github.com/vllm-project/vllm.git vllm_source
cd vllm_source
$ docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai .
# download the model 
huggingface-cli login
huggingface-cli repo download meta-llama/Llama-3.2-1B-Instruct --local-dir ./llama3

# Launching OpenAI server 
$ docker run --rm \
             --privileged=true \
             --shm-size=4g \
             -p 8000:8000 \
             -v "$(pwd)/llama3:/models/llama3" \
             -e VLLM_CPU_KVCACHE_SPACE=1        \     ---<KV cache space>
             -e VLLM_CPU_OMP_THREADS_BIND=0-3    \    ---<CPU cores for inference>
             vllm-cpu-env \
             --model=/models/llama3  \
             --dtype=bfloat16 \   
             --shm-size=2g        #  allow the container to access the host's shared memory.
          ## --api-key supersecretkey  (to require a key from clients to access the model) 

# You can either use the ipc=host or --shm-size flag          
Following models don’t need hugging face login: TinyLlama/TinyLlama-1.1B-Chat-v1.0 , mistralai/Mistral-7B-Instruct-v0.1

III. Running vLLM server (Offline + Online)

vLLM provides Offline and Online (OPenAI like) server endpoints. We’ll pick TinyLlama model in the following examples.

1. Offline Inference Server:

  • Batch processing large numbers of prompts
  • One-time inference jobs with Integration into data processing pipelines
  • Scenarios where you don’t need a persistent API server

Batch processing

Here we load the model and process a batch of prompts directly using python API.

from vllm import LLM, SamplingParams  
  
# Initialize the model  
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")  
  
# Define sampling parameters  
sampling_params = SamplingParams(  
    temperature=0.7,  
    max_tokens=256,
    top_p=0.95  
)  
  
# Prompt batch   
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
] 
# perform the inference
outputs = llm.generate(prompts, sampling_params)
  
# Process the outputs  
for output in outputs:  
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Answer

Processed prompts: 100%|█████| 4/4 [00:40<00:00, 10.22s/it, est. speed input: 2 toks/s, output: 21.86 toks/s]
Prompt: 'Hello, my name is', Generated text: ' [Your Name] and I am a current student at [School Name]...
...
Prompt: 'The president of the United States is', Generated text: ' an important and influential leader in the world, 
...
Prompt: 'The capital of France is', Generated text: ' Paris. The national language of France is French. The national
... # down to the last prompt ..

2. Online inference Server(OpenAI-Compatible)

vLLM’s OpenAI-compatible server allows you to serve models with an API matching OpenAI‘s interface, making it easy to integrate with existing applications. It’s ideal for Real-time applications and chatbots.

Running the Server

You can run the server using Docker or directly using the CLI on your machine:

vllm serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --dtype auto 
Note: You can enforce API key access using "--api-key token-abc123" flag

You can also use the python endpoint module:

python -m vllm.entrypoints.openai.api_server --model=TinyLlama/TinyLlama-1.1B-Chat-v1.0  --dtype bfloat16

Supported APIs

The server supports multiple API endpoints, slightly different than ollama ones:

  • Completions API ➡️ /v1/completions
  • Chat Completions API ➡️ /v1/chat/completions
  • Embeddings API ➡️ /v1/embeddings
  • Custom APIs like Tokenizer API ➡️ /tokenize/detokenize

Client Integration

Once the server is running, you can use the OpenAI Python client to interact with it(see open_ai_vllm_chat.py):

from openai import OpenAI  
  
client = OpenAI(  
    base_url="http://localhost:8000/v1",  <---- your vLLM server URL
    api_key="token-abc123",  
)  
  
chat_response = client.chat.completions.create(  
  model="NousResearch/Meta-Llama-3-8B-Instruct",  
  messages=[  
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world cup in 2022 ?"},
  ]  
)  
  
print("Chat response:",chat_response.choices[0].message)

Answer

$ python open_ai_vllm_chat.py
Chat response: ChatCompletionMessage(content='The 2022 FIFA World Cup, also called the 2022 FIFA World Cup,
took place from Nov. 20, 2022, to Dec. 18, 2022, in Qatar. 
Argentina won the FIFA World Cup in a sensational Super Cup final against France on Dec. 18, 2022.', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], reasoning_content=None)

Conclusion, final thoughts

vLLM offers flexible deployment options from simple Docker containers to production-grade Kubernetes deployments.
In this series, we’ve covered vLLM’s fundamentals, architecture, and now deployment options. With these tools in hand, you’re ready to efficiently start serve LLMs for your applications using vLLM!

NEXT, next we’ll dive deeper into production-stack deployment scenarios in Kubernetes

Reference

🙋🏻‍♀️If you like this content please subscribe to our blog newsletter ❤️.

👋🏻Want to chat about your challenges?
We’d love to hear from you! 

Share this…

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Start your Cloud journey with us today .