vLLM production-stack: Deployment in the cloud (part2)

Intro

In the previous post, we explored how the vLLM Production-Stack upgrades vanilla vLLM engine to an enterprise-grade platform. This time, we’ll crack open the Helm chart, decoding the key knobs in values.yaml and showing deployment recipes that span from a minimal install to full cloud setups.

Acknowledgment:
While authored independently, this series benefited from the LMCache‘s supportive presence & openness to guide.

Prerequisites

I. Production-stack Helm Chart

Template & Resource Flow

The resource generation follows a systematic process to transform configuration values into deployed K8s resources.

Core Configuration Structure

The Helm chart is organized around several specification blocks which which include the following Key components:

1️⃣. Serving Engine Spec – The core vLLM inference engine
2️⃣. Router Spec – Load balancer/router for distributing requests
3️⃣. Cache Server Spec – Caching layer for improved performance
4️⃣. Shared Storage – Persistent storage for models and data
5️⃣. LoRA Controller (Optional) — Out of scope

Template System Overview

The template files render the values into a full fleet, engine, router, cache-server, shared-storage PVCs, and more.

Template File Purpose Key Resources Generated
deployment-vllm-multi.yaml Creates serving engine deployments Deployment, ConfigMap (chat templates)
service-vllm-multi.yaml Exposes serving engines Service resources for each model
deployment-router.yaml Creates router deployment Router Deployment
service-router.yaml Exposes router service Router Service, optional Ingress
secrets.yaml Manages sensitive data Secret for tokens and API keys
pvc.yaml Creates storage claims PersistentVolumeClaim for models
Note: The Production-Stack configuration is defined in values.yaml which is the foundation for all deployment scenarios.

1. Serving Engine

⦿ servingEngineSpec

This section defines how your LLM models will be deployed and managed across your Kubernetes cluster.

Essential Fields:
Field Description Example
enableEngine Controls whether serving engines are deployed true
containerPort Default container port (8000) 8000
servicePort Default service port (80) 80
labels Custom labels for engine deployments {key: value}
configs Environment variables from ConfigMap {key: value}
securityContext Pod-level security context configuration {}
tolerations Node scheduling constraints for pods []
Note: All the fields and their description can be found in our vLLM-lab GitHub repo README.

📦Examples :

servingEngineSpec:
  strategy: # Strategy Configuration
    type: Recreate  # vs RollingUpdate
  enableEngine: true
  containerPort: 8000
  servicePort: 80
  labels:
    environment: "production"
    release: "v1.0" 
  runtimeClassName: ""
servingEngineSpec:
  # Health checks
  livenessProbe:
    initialDelaySeconds: 30
    periodSeconds: 10
    timeoutSeconds: 5
    failureThreshold: 3
  # For large models requiring extended startup time    
  startupProbe:
    initialDelaySeconds: 60
    periodSeconds: 30
    failureThreshold: 120 # Allow up to 1 hour for startup 
  modelSpec:
      # ... model configuration
servingEngineSpec:  
  enableEngine: true  
    
  # Labels for service discovery  
  labels:  
    app: "vllm-engine"  
    version: "v1"    
    
  # Network configuration  
  containerPort: 8000  
  servicePort: 8000  
    
  # Security context  
  containerSecurityContext:  
    runAsNonRoot: false  # default value  
    runAsUser: 1000  
    allowPrivilegeEscalation: false  
      
  securityContext:  
    fsGroup: 1000  
    runAsUser: 1000  
    
  # Scheduling  
  runtimeClassName: "nvidia"  
  schedulerName: "default-scheduler"  
  tolerations:  
    - key: "nvidia.com/gpu"  
      operator: "Equal"  
      value: "true"  
      effect: "NoSchedule"  
    
  # Pod disruption budget (corrected to string)  
  maxUnavailablePodDisruptionBudget: "1"  
    
  # Additional configurations  
  configs:  
    CUDA_VISIBLE_DEVICES: "0"  
    VLLM_WORKER_MULTIPROC_METHOD: "spawn"  
      
  # API security (corrected to use secret reference)  
  vllmApiKey:  
    secretName: "vllm-api-secret"  
    secretKey: "api-key"  
    
  # Model specifications (moved repository and tag here)  
  modelSpec:  
    - name: "your-model"  
      repository: "vllm/vllm-openai"  # Moved from servingEngineSpec level  
      tag: "v0.4.2"                  # Moved from servingEngineSpec level  
      modelURL: "your-model-url"  
      replicaCount: 1  
      requestCPU: 4  
      requestMemory: "16Gi"  
      requestGPU: 1

⦿ ModelSpec Array

Each modelSpec block entry represents a distinct model deployment with its own configuration including vLLM settings.

Essential Fields:

The Individual resource fields (i.e requestGPU) will be transformed into a resources specification by the Helm templates.

Field Description Example
name Unique identifier for the model deployment “llama3”
repository Container image repository “vllm/vllm-openai”
tag Container image tag “latest”
modelURL Hugging Face model identifier or local path “meta-llama/Llama-3.1-8B-Instruct”
replicaCount Number of pod replicas 1
requestCPU CPU cores requested per replica 10
requestMemory Memory requested per replica “16Gi”
requestGPU GPU units requested per replica 1
Note: You can find the full list of modelspec fields such as Persistent volume claims and more here.

📦Examples :

servingEngineSpec:
  strategy:
    type: Recreate
  modelSpec:    
  - name: "chat-model"
    repository: "lmcache/vllm-openai"
    tag: "latest"
    modelURL: "meta-llama/Llama-3.1-8B-Instruct"
    replicaCount: 2
    requestCPU: 8
    requestMemory: "24Gi"
    requestGPU: 1
    pvcStorage: "50Gi"
    pvcAccessMode:
      - ReadWriteOnce  
    hf_token: <YOUR_HF_TOKEN>  
💡Note: If you have replicaCount:3 and requestGPU:1, you’re requesting 3 separate GPUs total (one per replica)
modelSpec:
  - name: "example"
    # ... other fields
    pvcStorage: "100Gi"
    pvcAccessMode: ["ReadWriteOnce"]
    storageClass: "fast-ssd"
    pvcMatchLabels:
      model: "llama"
servingEngineSpec:  
  modelSpec:  
    - name: "custom-fine-tuned-model"  
      repository: "vllm/vllm-openai"  # Required field  
      tag: "latest"                   # Required field  
      modelURL: "/shared-storage/models/custom-model"  
      replicaCount: 1  
      requestCPU: 8  
      requestMemory: "32Gi"  
      requestGPU: 2  
      vllmConfig:  
        maxModelLen: 4096  
        tensorParallelSize: 2  
        extraArgs:   
          - "--gpu-memory-utilization=0.9"  
          - "--max-num-batched-tokens=8192"  
          - "--max-num-seqs=256"  
          - "--enable-prefix-caching"  
          - "--disable-log-stats"  
          - "--trust-remote-code"  # hugging face source
        
    - name: "multi-modal-model"  
      repository: "vllm/vllm-openai"  
      tag: "latest"  
      modelURL: "/shared-storage/models/llava-v1.5-7b"  
      replicaCount: 1  
      requestCPU: 4  
      requestMemory: "16Gi"  
      requestGPU: 1  
      vllmConfig:  
        maxModelLen: 4096  
        tensorParallelSize: 1  
        extraArgs:  
          - "--limit-mm-per-prompt=image=4"  
          - "--trust-remote-code"  
      
    - name: "embedding-model"  
      repository: "lmcache/vllm-openai"  
      tag: "latest"  
      modelURL: "BAAI/bge-large-en-v1.5"  
      replicaCount: 1  
      requestCPU: 4  
      requestMemory: "16Gi"  
      requestGPU: 1     
      env:  
        - name: HUGGING_FACE_HUB_TOKEN  
          valueFrom:  
            secretKeyRef:  
              name: huggingface-credentials  
              key: HUGGING_FACE_HUB_TOKEN  
        - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING  
          value: "True"
💡Note: modelSpec array supports Enterprise deployments that require serving multiple models simultaneously.

⦿ vLLMConfig

vllmConfig is a child block of modelSpec with vLLM-specific configurations like context size and data type (i.e FP16).

Essential Fields:
Field Description Example
maxModelLen Maximum sequence length 4096
dtype Model data type precision “bfloat16”
enableChunkedPrefill Enable chunked prefill optimization false
enablePrefixCaching Enable KV cache prefix caching false
gpuMemoryUtilization GPU memory utilization ratio 0.9
tensorParallelSize Tensor parallelism degree 1
extraArgs Additional vLLM engine arguments []

📦Examples :

servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
  - name: "llama3"
    repository: "vllm/vllm-openai"
    modelURL: "meta-llama/Llama-3.1-8B-Instruct"
    replicaCount: 2
    requestGPU: 1
  # ... Rest of the modelSpec params    
  vllmConfig:
    enablePrefixCaching: true
    enableChunkedPrefill: false
    maxModelLen: 1024
    dtype: "bfloat16"
    tensorParallelSize: 2
    maxNumSeqs: 32
    gpuMemoryUtilization: 0.80
    extraArgs: ["--disable-log-requests", "--trust-remote-code"]
  hf_token: <YOUR_HF_TOKEN>  
  shmSize: "6Gi"   # only if tensorParallelism is enabled
💡Note: gpuMemoryUtilization applies per replica, not across all replicas at once: (3 replicas => 0.80 x 3GPU).

⦿ LMCacheConfig

lmcacheConfig is a modelSpec sub block enabling KV cache offloading to CPU memory.

Essential Fields:
Field Description Example
enabled Whether to enable LMCache for KV offloading true
cpuOffloadingBufferSize The CPU offloading buffer size for LMCache “30”
diskOffloadingBufferSize The disk offloading buffer size for LMCache “”

📦Examples :

servingEngineSpec: 
  modelSpec:
  - name: "llama3"
    # ... other fields
    lmcacheConfig:  
      enabled: true  
      cpuOffloadingBufferSize: "20"
    env:  
      - name: LMCACHE_LOG_LEVEL  
        value: "DEBUG"  
lmcacheConfig:  
  enabled: true  
  cpuOffloadingBufferSize: "60"  
  enableController: true  
  instanceId: "default1"  
  controllerPort: "9000"  
  workerPort: 8001
lmcacheConfig:  
  cudaVisibleDevices: "0"  
  enabled: true  
  kvRole: "kv_producer"  
  enableNixl: true  
  nixlRole: "sender"  
  nixlPeerHost: "vllm-llama-decode-engine-service"  
  nixlPeerPort: "55555"  
  nixlBufferSize: "1073741824"  # 1GB  
  nixlBufferDevice: "cuda"  
  nixlEnableGc: true  
  enablePD: true  
  cpuOffloadingBufferSize: 0
enableController, instanceId, kvRole, are used for specialized scenarios like KV-aware routing and disaggregated prefill.

⦿ Security and Authentication

For secure deployments, the stack supports API key and hugging face token authentication.

Essential Fields: Fixed Column Width Table
Field Description Example
vllmApiKey API Key Auth: to model endpoints “vllm_xxxxx” or secret reference
hf_token HuggingFace authentication token “hf_xxxxx” or secret reference
serviceAccountName RBAC Integration: through k8s service accounts “vllm-service-account”
networkPolicies K8s-native network security Network policy configuration

📦Examples :

The vllmApiKey field supports two formats for API key specification:

# Direct string (stored in generated secret)
servingEngineSpec:
  vllmApiKey: "vllm_secret_key_here"

# Reference to existing secret
servingEngineSpec:
  vllmApiKey:
    secretName: "my-secret"
    secretKey: "api-key"

Similar dual-format support for Hugging Face authentication:

# Per-model token configuration
modelSpec:
  - name: "gated-model"
    hf_token: "hf_token_here"
    # or
    hf_token:
      secretName: "hf-secrets"
      secretKey: "token"
env:  
  - name: HUGGING_FACE_HUB_TOKEN  
    valueFrom:  
      secretKeyRef:  
        name: huggingface-credentials  
        key: HUGGING_FACE_HUB_TOKEN  
  - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING  
    value: "True"

2. Router

The routerSpec section configures the vLLM router, handling request routing and load balancing across serving engines.

Essential Fields:
Field Description Example
enableRouter Enable router deployment true
serviceDiscovery Service discovery mode for the router (“k8s” or “static”) “k8s”
repository Router container image repository “lmcache/lmstack-router”
tag Container image tag “latest”
replicaCount Number of router replicas 1
routingLogic Routing algorithm “roundrobin”
sessionKey Session header key (for session routing) “x-user-id”
resources Resource requests and limits {}
vllmApiKey Fallback API key for securing the vLLM models secret reference

📦Examples :

routerSpec:
  engineScrapeInterval: 15
  requestStatsWindow: 60
  enableRouter: true
  repository: "lmcache/lmstack-router"
  tag: "latest"
  replicaCount: 1
  routingLogic: "rundrobin"  # default   
  serviceDiscovery: "k8s"
  # Auto-discovers services in same namespace using servingEngineSpec.labels selector  
  # serviceDiscovery: "static"
  # staticBackends: "http://model1:8000,http://model2:8000"
  # staticModels: "llama-7b,mistral-7b"  
  containerPort: 8000
  servicePort: 80  
  # OPTIONAL
  vllmApiKey:
    secretName: "my-secret"
    secretKey: "api-key"
💡routerSpec.vllmApiKey is a fallback API key configuration that’s only used when the serving engine is disabled.
routerSpec:  
  repository: "lmcache/lmstack-router"  
  tag: "kvaware-latest"  
  routingLogic: "session"  
  sessionKey: "X-Session-ID"
  extraArgs:  
    - "--log-level"  
    - "info"  
  resources: {} 
servingEngineSpec:
  modelSpec:
    - name: "llama-prefill"
      # ... model configuration
      lmcacheConfig:
        enabled: true
        kvRole: "kv_producer"
        enablePD: true
        
    - name: "llama-decode"
      # ... model configuration  
      lmcacheConfig:
        enabled: true
        kvRole: "kv_consumer"
        enablePD: true
 routerSpec:
  routingLogic: "disaggregated_prefill"
  extraArgs:
    - "--prefill-model-labels"
    - "llama-prefill"
    - "--decode-model-labels"
    - "llama-decode"       

3. Cache Server

cacheServerSpec is used for configuring remote shared KV cache storage using LMCache.

Essential Fields:
Field Description Example
replicaCount Number of cache server replicas 1
containerPort Container port for the cache server 8080
servicePort Service port for external access 81
serde Serializer/Deserializer type “naive”
repository Container image repository “lmcache/vllm-openai”
tag Container image tag “latest”
resources Resource requests and limits CPU/Memory specifications
labels Custom labels for cache server deployment {“environment”: “cacheserver”}
nodeSelectorTerms Node selection criteria for scheduling Node selector configuration

📦Examples :

cacheserverSpec:
  replicaCount: 1
  containerPort: 8080
  servicePort: 81
  serde: "naive"    # serialization/deserialization

  repository: "lmcache/vllm-openai"
  tag: "latest"
  resources: {}
  labels:  
    environment: "cacheserver"  
    release: "cacheserver"   

4. Shared Storage

This Bock configures shared storage across multiple models for efficient model weight management.

Essential Fields:
Field Description Example
enabled Enable shared storage creation true
size Storage capacity for the PersistentVolume “100Gi”
storageClass Kubernetes storage class name “standard”
accessModes Volume access modes for multi-pod access [“ReadWriteMany”]
hostPath Local host path for single-node development “/data/shared”
nfs.server NFS server hostname for production clusters “nfs-server.example.com”
nfs.path NFS export path on the server “/exports/vllm-shared”

📦Examples :

sharedStorage:  
  enabled: true  
  size: "100Gi"   
  storageClass: "standard"  
  nfs:  
    server: "nfs-server.example.com"  
    path: "/exports/vllm-shared"  
  accessModes:  
    - ReadWriteMany
sharedStorage:  
  enabled: true  
  size: "100Gi"  
  storageClass: "standard"  
  hostPath: "/data/shared"  
  accessModes:  
    - ReadWriteMany
Note: SharedStorage is created when either LoRA controller is enabled OR sharedStorage.enabled is set to true.

II. Deployment Recipes

Production-stack tutorials folder contains 20+ deployment scenarios. Anything from spinning up Minikube environment to multi-GPU parallelism. Below are deployment recipes, ranging from a minimal setup to cloud based roll-outs.

Note: We assume the chart repository has already been added using the following command:

helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update
Note: Currently, production-stack does not have a dedicated chart repository available on artifactHub.

1️⃣. Minimal Local Deployment(CPU)

For most beginners, dev environments means 0 GPUs, and that’s fine. You can easily start with a minimal, CPU-only vLLM stack right on your laptop’s k8s using the official pre-built vllm CPU image from AWS ECR.

# cpu-tinyllama-values.yaml  
servingEngineSpec:  
  enableEngine: true  
  runtimeClassName: ""  # Override the default "nvidia" runtime  
  containerSecurityContext:  
    privileged: true  
  modelSpec:  
  - name: "tinyllama-cpu"  
    repository: "public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo"  
    tag: "v0.8.5.post1"  
    modelURL: "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  
    replicaCount: 1  
    requestCPU: 3  
    requestMemory: "3Gi"  
    requestGPU: 0  # CPU Only
    limitCPU: "4"  
    limitMemory: "6Gi"  
    pvcStorage: "10Gi"  
 #  device: "cpu"   Undocumented    
    vllmConfig:  
      dtype: "bfloat16"  
      extraArgs:  
        - "--device"  
        - "cpu"   
    env:  
      - name: VLLM_CPU_KVCACHE_SPACE  
        value: "1"  
      - name: VLLM_CPU_OMP_THREADS_BIND  
        value: "0-2"  
      - name: HUGGING_FACE_HUB_TOKEN  # Optional for TinyLlama
        valueFrom:  
          secretKeyRef:  
            name: hf-token-secret  
            key: token    
routerSpec:  
  enableRouter: true  
  routingLogic: "roundrobin"
helm install vllm-cpu vllm/vllm-stack -f cpu-tinyllama-values.yaml
⬆️ This gives you: one vLLM engine + router on your local K8s. Great for API play-testing and template tweaks.

2️⃣. AWS EKS Deployment

For AWS deployments, the deployment process leverages AWS-specific features like EFS for shared storage and IAM roles for service authentication. Key considerations for AWS deployments are:

  • GPU Node Groups: Configure EKS with GPU-enabled instances (g4(T4), p3(T100), p4(A100) families)
  • Storage Integration: Use EFS or EBS for persistent model storage
  • Network Configuration: Proper VPC and security group setup for multi-AZ deployments
# eks-values.yaml
servingEngineSpec:
  runtimeClassName: nvidia
  modelSpec:
  - name: llama3
    modelURL: meta-llama/Meta-Llama-3-8B-Instruct
    requestGPU: 1
    requestGPUType: "nvidia.com/gpu"
    replicaCount: 3
    pvcStorage: 100Gi

routerSpec:
  serviceType: LoadBalancer        # creates an AWS NLB

sharedStorage:
  enabled: true
  size: 200Gi
  storageClass: efs-sc             # EFS CSI driver  
helm install vllm-eks vllm/vllm-stack -f eks-values.yaml
git clone https://github.com/vllm-project/production-stack.git
cd deployment_on_cloud/aws  
bash entry_point.sh YOUR_AWS_REGION production_stack_specification.yaml  # YAML_PATH

AWS Tutorial link (requires AWS)

  • This script:
    • Creates an EKS cluster with GPU nodes
    • Sets up EFS for shared storage
    • Configures the EFS CSI driver
    • Deploys the vLLM stack via Helm

3️⃣. Google Cloud GKE Deployment

GKE deployments benefit from Google Cloud’s managed Kubernetes service and integrated GPU support.
The GKE deployment includes:

  • Filestore integration: Mounts a shared Filestore volume so every engine replica can load the same weights.
  • Router exposure: Use external GCP load balancer for public access (or turn on Ingress if you prefer).
# gke-values.yaml
servingEngineSpec:
  runtimeClassName: nvidia          # GKE GPU runtime class
  modelSpec:
    - name: mistral
      modelURL: mistralai/Mistral-7B-Instruct-v0.2
      replicaCount: 3
      requestGPU: 1
      requestGPUType: "nvidia.com/gpu"
      pvcStorage: 100Gi

routerSpec:
  serviceType: LoadBalancer         # Creates a regional external LB
  ingress:
    enabled: false                  # Optional: keep off if using LB
# ---
sharedStorage:
  enabled: true
  size: 200Gi
  # Filestore CSI storage class created by: 
  # gcloud filestore instances create ...
  storageClass: filestore-csi      # Or your own Filestore CSI class
helm install vllm-gke vllm/vllm-stack -f gke-values.yaml
git clone https://github.com/vllm-project/production-stack.git
cd deployment_on_cloud/gcp
bash entry_point.sh YOUR_AWS_REGION production_stack_specification_basic.yaml 

GCP Tutorial link (requires gcloud)

  • This script:
    • Creates an GKE cluster with GPU nodes (n2d-standard-8) and Managed Prometheus
    • Configures Horizontal Pod  and HTTP Load Balancing addons
    • Automatically handles GKE persistent disk provisioning

4️⃣. Azure AKS Deployment

Azure deployments leverage AKS with specific configurations for enterprise workloads.
Azure-specific features include:

  • Azure Files Integration: Shared storage across multiple pods
  • Managed Identity: Secure access to Azure resources
  • Azure Monitor: Comprehensive observability integration
# aks-values.yaml
servingEngineSpec:
  runtimeClassName: nvidia
  modelSpec:
  - name: phi3
    modelURL: microsoft/phi-3-mini-4k-instruct
    requestGPU: 1
    replicaCount: 2

routerSpec:
  ingress:
    enabled: true
    className: nginx

sharedStorage:
  enabled: true
  storageClass: azurefile
  size: 200Gi
helm install vllm-aks vllm/vllm-stack -f aks-values.yaml
git clone https://github.com/vllm-project/production-stack.git
cd deployment_on_cloud/azure 
bash entry_point.sh setup  EXAMPLE_YAML_PATH

Azure Tutorial link (requires Azure cli)

  • This script:
    • Creates an AKS cluster with GPU nodes
    • Configures Azure Files
    • Integrates with Azure Monitor
    • Deploys the vLLM inference stack

🛡️Deployment Best Practices

🧮Resource Planning

  • GPU Sizing: Match GPU memory to model requirements
  • CPU Allocation: Ensure sufficient CPU for preprocessing
  • Memory Planning: Account for model weights and KV cache

🪢High Availability

  • Multi-Zone Deployment: Distribute across availability zones
  • Health Checks: Configure appropriate startup and readiness probes
  • Backup Strategies: Implement model weight backup procedures

⚡Performance Optimization

  • Request Routing: Configure session stickiness for cache efficiency
  • Batch Processing: Optimize batch sizes for throughput
  • Model Quantization: Use appropriate data types (bfloat16, int8)

🚀Conclusion

We’ve just seen how to leverage helm chart to deploy vLLM Production-Stack in both local and cloud-based K8s clusters. By now, we’ve covered the architecture along with varied deployment options. The vLLM Production-Stack makes it easy to:

  • Choose the right vLLM configuration for your use case
  • Understand and implement performance tuning options
  • Enable advanced features like caching and quantization
  • Configure distributed inference properly

With this material in hand, you’re now ready to efficiently scale LLMs using vLLM production stack in Kubernetes!

Reference

🙋🏻‍♀️If you like this content please subscribe to our blog newsletter ❤️.

👋🏻Want to chat about your challenges?
We’d love to hear from you! 

Run AI Your Way — In Your Cloud


Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

Share this…

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Start your Cloud journey with us today .