vLLM production-stack: LLM inference for Enterprises (part1)

Intro

If you’ve played with vLLM locally you already know how fast it can crank out tokens. But the minute you try to serve real traffic with multiple models, thousands of chats, you hit the same pain points the community kept reporting:

⚠️ Pain point What you really want
High GPU bill Smarter routing + cache reuse
K8s YAML sprawl One-command Helm install
Slow recovery when a pod dies Automatic request migration
Long tail latencies KV-cache pre-fetch & compression

Topic of the Day

💡 vLLM Production Stack tackles all of the above with a community-maintained layer that wraps vanilla vLLM, adds a Python-native router, LMCache-powered KV-cache network, autoscaling hooks and Grafana dashboards—all deployable in a single Helm chart. Let’s dive into it!✍🏻

Acknowledgment:
While authored independently, this series benefited from the LMCache‘s supportive presence & openness to guide.

What is vLLM Production Stack ?

The Production-Stack is an open-source Kubernetes-native, enterprise-ready inference setup of vLLM, maintained by LMCache that lets you scale from a single node to a multi-cluster deployment without changing application code. Here’s a quick look at its Key Components:

  • Serving Engine: The vLLM core component that runs different LLMs
  • Request Router: Directs requests to the appropriate backends based on routing keys or session IDs 
  • Observability Stack: Monitors metrics through Prometheus and Grafana (i.e time to first token TTFT, latency)
  • Persistent Storage: Stores model weights and other required data
  • KV Cache Offloading: offloads cache from GPU to CPU or disk, enabling larger context and higher cache hit rates.

Production-stack benefits

Pillar Capabilities
User-Friendly
  • One-command Helm install and CSP agnostic (AWS, Azure, GCP)
  • Plug-and-play observability (Prometheus + Grafana)
Better Performance
  • Smart router keeps chats on cache-warm GPUs
  • Distributed KV-cache network slashes latency & GPU RAM use
Production-Ready
  • LoRA hot-swap for rapid fine-tune rollout
  • Autoscaling & fault-tolerance baked in

Why another “stack”?

Unlike the KServe, kubeai, aibrix that run vanilla vLLM — Production stack is the only official implementation, that is a production-tuned, featuring LMCache-powered KV cache offloading, which is one of its key performance features.

Note: Production stack extension isn’t a core part of vLLM, but a sub-project for enterprise deployments.

Request Flow

The diagram below walks through the entire request lifecycle through the vLLM Production Stack: the router accepts the call, discovers and selects the best engine, forwards it for inference, and streams the response back to the client.

Production-stack Architecture

The architecture follows cloud-native principles and leverages K8s for orchestration, scaling, and service discovery.

Under the Hood are three-main layers:

  • Core Layer: Everything you need to serve traffic such as the Router + vLLM engines with KV-cache and LoRA .
  • Integration Layer: K8s/Ray adapters slot the core into any orchestrator.✅ No refactoring/migration needed.
  • Infrastructure Layer: Runs GPU/CPU pools on K8s, Ray, or bare-metal. ✅Fully platform-agnostic.

I. Serving Engine

The serving engine is the core component that runs the vLLM instances that processes inference requests.

Each serving engine deployment:

  • Runs in a vLLM container, and mounts persistent volume (/data)
  • Loads a model from a Hugging Face or local storage
  • Exposes Endpoints:
    • API endpoint (port 8000)
    • Endpoint for monitoring /health 
  • Includes multiple configurations such as:
Component Configuration
Model Execution
  • ModelURL
  • dtype precision (i.e., bfloat16)
  • maxModelLen
  • tensorParalelSize
  • EnablePrefixCaching
  • enableChunkedPrefill
LMCache Integration (Optional) KV cache offloading (if enabled):
  • lmCacheConfig
  • cpuOffloadingBufferSize
  • diskOffloadingBufferSize

II. The Router

The router is responsible for directing requests to the appropriate serving engine based on configured routing logic that:

  • Discovers engines via K8s API or static backend list
  • Routes requests (round-robin or session-sticky)
  • Exports performance metrics
  • Exposes an OpenAI-compatible proxy endpoint

1. Routing logic

The router’s Smart routing implements two logics:

  1. Model routing filters the model from the available backends based on the requested model.
  2. Instance routing uses routing algorithms and live metrics to pick the best vLLM instance (replica) for each request.

2. Instance routing algorithms

The main Algorithms are Round robin, Session stickiness (session bound), and Prefix-aware load-balancing (WIP).

A. Instance Affinity – Session Stickiness

Session stickiness hashes session IDs to keep traffic on the same backend; if there’s no ID, it falls back to QPS routing

B. Instance Affinity – Prefix-aware

This feature is still in progress (see PR239) and relies on a chunk-based hash-trie built from each request body.
The Workflow operates as follows:

1️⃣. Chunk the request
2️⃣. Hash each chunk
3️⃣. Find the longest prefix match in the trie
4️⃣. Update the trie with the new hashes.

3. Fault tolerance (request migration)

When a vLLM instance dies, request migration shifts existing session to a new instance with its KV cache intact.

🧰 Other advanced routing features

✍🏻Request-rewrite module:
🔹 Adds Personal Identifiable Information (PII) protection before the request hits the model.
🔎Routing extensions:
🔹 Semantic caching caches responses based on semantic similarity of requests .
🔹 KV cache aware routing for smarter instance picks based on cached tokenized prompt.

Note: Explore the full 2025 Production Stack Feature Roadmap here.

III. KV-Cache Network

As explained in our previous post KV cache speeds-up inference, whether the cache is served locally or remotely.

KV Cache Offloading: LMCache uses Optimized CUDA kernels to copy all the KV cache blocks at the same time.

LMCache’s KV-cache network exposes a unified sharing interface and serves as the backend for all cache optimizations, so every instance can reuse the same cache while Speeding up KV-cache processing.

1. KV Cache Delivery- Pooling Mode

In pooling mode, data is stored and retrieved remotely in stores like redis DB through LMCache interface.

2. KV Cache Delivery- Peer-to-Peer Mode

Here, data is stored locally and shared between instances, while redis acts as a lookup server only storing metadata.

Extendible Interface: LMCache interface makes it easy to integrate any object store, which transparently use its functions.

Other LMCache advanced optimizations

1️⃣ KV cache blending

When chats have different input documents. Cacheblend concatenate their kv caches into one (useful in RAG setups).

2️⃣KV cache compression

Uses CacheGen algorithm to compress a KV cache into more compact bitstream representations for a faster transfer.

3️⃣ KV cache translation

Translation allows kv cache sharing between different LLMs (i.e Llama3 🔀 Mixtral), which is ideal for AI agents.

4️⃣ Update KV cache

LMCache can even offline-update KV entries, so the next time the cache is reused the LLM returns smarter answers.

5️⃣. KV Cache Prefetching

Prefetching lets the router pull remote caches to engines before request execution, keeping GPU busy instead of idle.

IV. The observability stack🔎

The observability stack provides monitoring capabilities for the vLLM Production Stack through Prometheus & Grafana.

All configuration details can be found here where the stack

  • Collects metrics from the router and serving engines
  • Provides custom metrics through the Prometheus Adapter
  • Monitors and feeds the Grafana dashboard with real-time key model performance indicators including:
vLLM Grafana Metric Description
Available vLLM Instances Displays the number of healthy instances.
Request Latency Distribution Visualizes end-to-end request latency.
Time-to-First-Token (TTFT) Distribution Monitors response times for token generation.
Number of Running Requests Tracks the number of active requests per instance.
Number of Pending Requests Tracks requests waiting to be processed.
GPU KV Usage Percent Monitors GPU KV cache usage.
GPU KV Cache Hit Rate Displays the hit rate for the GPU KV cache.

V. One-click helm install

To top it off, vLLM Production-Stack ships as a one-command Helm chart, turning a slim values.yaml into a fully wired K8s deployment; pods, services, autoscalers, and dashboards included.

To get started with the vLLM Production Stack, you’ll need:

  • A running Kubernetes cluster with GPU support
  • Helm installed on your machine
  • kubectl installed and configured to access your cluster

helm install gives you router, vLLM, LMCache, autoscaler, ServiceMonitor objects and dashboards out of the box.

$  git clone https://github.com/vllm-project/production-stack.git
$  cd production-stack/
1️⃣ # Add chart repo
helm repo add vllm https://vllm-project.github.io/production-stack

2️⃣ # Spin up a minimal stack (values file ships with the repo)
helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml

  • Expose the router locally
# local port forwarding 
$ kubectl port-forward svc/vllm-router-service 30080:80
$ curl http://localhost:30080/models  | jq .data[].id   
     "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
Customizations: GPU type, model path, LoRA adapters, LMCache backend—by editing values.yaml or using CRDs.

Benchmarks— Show me the numbers

A public benchmark with 80 concurrent users chatting to a Llama-2-8B replica (2× A100 80GB) shows that Production Stack yields lower latency and Higher Speed.

Note: Full benchmark scripts live in the repo under benchmarks/multi-round-qa
VLLM Benchmark on NVIDIA T4 in EKS with “TinyLlama” model.

TL;DR

If you need vLLM performance and SRE-grade reliability, the production stack saves you weeks of YAML tweaking.

  • Drop it into any GPU-enabled K8s cluster
  • Point your OpenAI-compatible client at the router
  • and get back to shipping features instead of herding pods.

🚀Coming Up Next

Now that we’ve seen how the vLLM Production-Stack turns vanilla model serving into an enterprise-grade platform.

In the next blog post, we’ll dive into the Helm chart configuration—walking through the key settings in values.yaml and explore deployment recipes ranging from a minimal setup to multi-cloud roll-outs (EKS,AKS,GKE).

Stay tuned for Part 2🫶🏻

Reference

🙋🏻‍♀️If you like this content please subscribe to our blog newsletter ❤️.

👋🏻Want to chat about your challenges?
We’d love to hear from you! 

Share this…

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Start your Cloud journey with us today .