
Intro
Deploying vLLM manually is fine for a lab, but running it in production means dealing with Kubernetes, autoscaling, GPU orchestration, and observability. That’s where the vLLM Production Stack comes in – a Terraform-based blueprint that delivers production-ready LLM serving with enterprise-grade foundations.
In this post, we’ll deploy it on Amazon EKS, covering everything from network architecture and GPU provisioning to observability for both GPU and CPU Inference (see our PR).
💡You can find our code in the official repo ➡️ production-stack-tutorials-terraform or from our vllm-lab-repo .
📂 Project Structure
./
├── main.tf
├── network.tf
├── storage.tf
├── provider.tf
├── variables.tf
├── output.tf
├── cluster-tools.tf
├── datasources.tf
├── iam_role.tf
├── vllm-production-stack.tf
├── env-vars.template
├── terraform.tfvars.template
├── modules/
│ └── llm-stack
| ├── helm|
| ├── cpu|
| └── gpu|
├── config/
│ ├── calico-values.tpl
│ └── kubeconfig.tpl
└── README.md # ← you are here🧰Prerequisites
Before you begin, ensure you have the following:
| Tool | Version-tested | Purpose |
|---|---|---|
| Terraform | ≥ 1.5.7 | Infrastructure provisioning |
| AWS CLI v2 | ≥ 2.16 | AWS authentication |
| kubectl | ≥ 1.30 | Kubernetes management |
| helm | ≥ 3.14 | Used for Helm chart debugging |
| jq | latest | JSON parsing (optional) |
Follow the below steps to Install the tools (expend)👇🏼
# Install tools
sudo apt update && sudo apt install -y jq curl unzip gpg
wget -qO- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install -y terraform
curl -s "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && unzip -q awscliv2.zip && sudo ./aws/install && rm -rf aws awscliv2.zip
curl -sLO "https://dl.k8s.io/release/$(curl -Ls https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" && sudo install kubectl /usr/local/bin/ && rm kubectl
curl -s https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg >/dev/null && echo "deb [signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm.list && sudo apt update && sudo apt install -y helm- Configure AWS profile
aws configure --profile myprofile
export AWS_PROFILE=myprofile # ← If null Terraform exec auth will use the default profileWhat’s in the stack?📦
This terraform stack delivers a production-ready vLLM serving environment On Amazon EKS supporting both CPU/GPU inference with operational best practices embedded in AWS Integration and Automation (aws-ia).
It’s designed for real-world production workloads with:
✅ Enterprise-grade infrastructure: Built on AWS blueprint patterns.
✅ Flexible compute: Switch between CPU and GPU inference with a single flag.
✅ Operational excellence: Prometheus, Grafana, and CloudWatch integration out of the box.
✅ Security-first: IAM roles, secrets management, and network segmentation.
✅ Scalability: Auto-scaling node groups and efficient CNI with Calico overlay.
✅ Production hardening: Load balancers, ingress controllers, and persistent storage.
Architecture Overview

Deployment layers – The stack provisions infrastructure in logical layers that adapt based on your hardware choice:
| Layer | Component | CPU Mode | GPU Mode |
|---|---|---|---|
| Infrastructure | VPC + EKS + Calico CNI | ✅ Always | ✅ Always |
| Add-ons | EBS, ALB, Prometheus | ✅ Always | ✅ Always |
| vLLM Stack | Secrets + Helm chart | ✅ CPU nodes | ✅ GPU nodes + NVIDIA operator |
| Networking | Load balancer + Ingress | ✅ ALB | ✅ ALB |
1. Networking Foundation
The stack creates a production-grade network topology:
- Custom `
/16` VPC (with 3 public + 3 private subnets) - Public/private subnet architecture for workload isolation
- Calico overlay CNI: Supports 110+ pods per node (vs. 17 with AWS VPC CNI)
- Single NAT Gateway for cost optimization
- AWS Load Balancer Controller for ingress management
2. EKS Cluster
A Control plane v1.30 with two managed node-group Types
| Node Pool | Instance Type | Purpose |
|---|---|---|
| CPU Pool (default) | `t3a.large` (2 vCPU / 8 GiB) | Control plane & CPU inference |
| GPU Pool (optional) | `g5.xlarge` (1× NVIDIA A10G) | GPU-accelerated inference |
3. Essential Add-ons
Production-ready add-ons via terraform-aws-eks-blueprints-addons:
| Category | Add-on |
|---|---|
| CNI | Calico overlay (replaces VPC CNI) |
| Storage | EBS CSI (block) + EFS CSI (shared) |
| Ingress | AWS Load Balancer Controller (ALB/NLB) |
| Core | CoreDNS, kube-proxy, Metrics Server |
| Observability | kube-prometheus-stack, CloudWatch (Optional) |
| Security | cert-manager, External Secrets |
| GPU (Optional) | NVIDIA device plugin or GPU Operator |
4. vLLM Production Stack
The heart of the deployment a production-ready model serving:
✅ Model: TinyLlama-1.1B (default, fully customizable)
✅ Load balancing: Round-robin router service across replicas
✅ Secrets: Hugging Face token stored as Kubernetes Secret
✅ Storage: Init container with persistent model caching at `/data/models/`
✅ Monitoring: Prometheus metrics endpoint for observability
5. Hardware Flexibility: CPU vs GPU
You can choose to deploy VLLM production stack on either CPU or GPU using the inference_hardware parameter
# Deploy on CPU (default)
export TF_VAR_inference_hardware="cpu"
# Or deploy on GPU
export TF_VAR_inference_hardware="gpu"🖥️ AWS GPU Instance Types Available
Available GPU instances (T4 · L4 · V100 · A10G · A100)
| AWS EC2 Instance | vCPUs | Memory (GiB) | GPUs | GPU Memory (GiB) | Best For |
|---|---|---|---|---|---|
| NVIDIA Tesla T4 g4dn.xlarge | 4 | 16 | 1 | 16 | Small inference |
| g4dn.2xlarge | 8 | 32 | 1 | 16 | Medium inference |
| g4dn.4xlarge | 16 | 64 | 1 | 16 | Large inference |
| g4dn.12xlarge | 48 | 192 | 4 | 64 | Multi-GPU inference |
| NVIDIA L4 g6.xlarge | 4 | 16 | 1 | 24 | Cost-effective inference |
| g6.2xlarge | 8 | 32 | 1 | 24 | Balanced inference workloads |
| g6.4xlarge | 16 | 64 | 1 | 24 | Large-scale inference |
| NVIDIA Tesla V100 p3.2xlarge | 8 | 61 | 1 | 16 | Training & inference |
| p3.8xlarge | 32 | 244 | 4 | 64 | Multi-GPU training |
| p3.16xlarge | 64 | 488 | 8 | 128 | Large-scale training |
| NVIDIA A100 p4d.24xlarge | 96 | 1,152 | 8 | 320 | Large-scale AI training |
| NVIDIA A10G g5.xlarge | 4 | 16 | 1 | 24 | General GPU workloads |
| g5.2xlarge | 8 | 32 | 1 | 24 | Medium GPU workloads |
| g5.4xlarge | 16 | 64 | 1 | 24 | Large GPU workloads |
| g5.8xlarge | 32 | 128 | 1 | 24 | Large-scale inference |
| g5.12xlarge | 48 | 192 | 4 | 96 | Multi-GPU training |
| g5.24xlarge | 96 | 384 | 4 | 96 | Ultra-large-scale training |
| g5.48xlarge | 192 | 768 | 8 | 192 | Extreme-scale training |
here.
Getting started
The deployment automatically provisions only the required infrastructure based on your hardware selection.
| Phase | Component | Action | Condition |
|---|---|---|---|
| 1. Infrastructure | VPC | Provision VPC with 3 public + 3 private subnets | Always |
| EKS | Deploy v1.30 cluster + CPU node group (t3a.large) | Always | |
| CNI | Remove aws-node, install Calico overlay (VXLAN) | Always | |
| Add-ons | Deploy EBS CSI, ALB controller, kube-prometheus | Always | |
| 2. vLLM Stack | HF secret | Create `hf-token-secret` for Hugging Face | enable_vllm = true |
| CPU Deployment | Deploy vLLM on existing CPU nodes | inference_hardware = "cpu" |
|
| GPU Infrastructure | Provision GPU node group (g5.xlarge) | inference_hardware = "gpu" |
|
| GPU Operator | Deploy NVIDIA operator/plugin | inference_hardware = "gpu" |
|
| GPU Deployment | Deploy vLLM on GPU nodes with scheduling | inference_hardware = "gpu" |
|
| 3. Networking | Load Balancer | Configure ALB and ingress for external access | enable_vllm = true |
| 4. Model Storage | loaded locally | Using init container -> /data/models |
enable_vllm = true |
🔵 Deployment Steps
1️⃣Clone the repository
The vLLM EKS deployment build is located under tutorials/terraform/eks directory (see below):
🌍 Repo: https://github.com/vllm-project/production-stack
This repo is also a one stop shop for other terraform deployments🚀
./tutorials/terraform/eks- Navigate to the production-stack directory and terraform eks tutorial folder
$ git clone https://github.com/vllm-project/production-stack
📂..
$ cd production-stack/tutorials/terraform/eks/2️⃣ Set Up Environment Variables
Use an env-vars file to export your TF_VARS or use terraform.tfvars . Replace placeholders with your values:
cp env-vars.template env-vars
vim env-vars # Set HF token and customize deployment options
source env-varsUsage examples
- Option 1: Through Environment Variables
# Copy and customize
$ cp env-vars.template env-vars
$ vi env-vars
################################################################################
# AWS Credentials and Region
################################################################################
export TF_VAR_aws_profile="" # default: "" - Fill your AWS Profile name (e.g., default, cloudthrill)
export TF_VAR_region="us-east-2" # default: "us-east-2" - AWS Region for your deployment
################################################################################
# EKS Cluster Configuration
################################################################################
# ☸️ EKS cluster basics
export TF_VAR_cluster_name="vllm-eks-prod" # default: "vllm-eks-prod"
export TF_VAR_cluster_version="1.30" # default: "1.30" - Kubernetes cluster version
################################################################################
# 🤖 NVIDIA setup selector
# • plugin -> device-plugin only
# • operator_no_driver -> GPU Operator (driver disabled)
# • operator_custom -> GPU Operator with your YAML
################################################################################
export TF_VAR_nvidia_setup="plugin" # default: "plugin"
################################################################################
# 🧠 LLM Inference Configuration
################################################################################
export TF_VAR_enable_vllm="true" # default: "false" - Set to "true" to deploy vLLM
export TF_VAR_hf_token="" # default: "" - Hugging Face token for model download (if needed)
export TF_VAR_inference_hardware="gpu" # default: "cpu" - "cpu" or "gpu"
################################################################################
export TF_VAR_nvidia_setup="plugin" # default: ""
# Paths to Helm chart values templates for vLLM.
# These paths are relative to the root of your Terraform project.
export TF_VAR_gpu_vllm_helm_config="./modules/llm-stack/helm/gpu/gpu-tinyllama-light-ingress.tpl" # default: ""
export TF_VAR_cpu_vllm_helm_config="./modules/llm-stack/helm/cpu/cpu-tinyllama-light-ingress.tpl" # default: ""
################################################################################
# ⚙️ Node-group sizing
################################################################################
# CPU pool (always present)
export TF_VAR_cpu_node_min_size="1" # default: 1
export TF_VAR_cpu_node_max_size="3" # default: 3
export TF_VAR_cpu_node_desired_size="2" # default: 2
# GPU pool (ignored unless inference_hardware = "gpu")
export TF_VAR_gpu_node_min_size="1" # default: 1
export TF_VAR_gpu_node_max_size="1" # default: 1
export TF_VAR_gpu_node_desired_size="1" # default: 1
...snip
$ source env-vars- Option 2: Through Terraform Variables
# Copy and customize
$ cp terraform.tfvars.example terraform.tfvars
$ vim terraform.tfvars- Load the Variables into Your Shell Before running Terraform, source the env-vars file:
$ source env-vars3️⃣ Run Terraform deployment:
You can now safely run Terraform plan & apply. You will deploy the 100 resources in total, including local kubeconfig.
terraform init
terraform plan
terraform applyFull Plan
Plan: 100 to add, 0 to change, 0 to destroy.
Changes to Outputs:
+ Stack_Info = "Built with ❤️ by @Cloudthrill"
+ cluster_endpoint = (known after apply)
+ cluster_name = "vllm-eks"
+ cluster_public_subnets_info = (known after apply)
+ cluster_subnets_info = (known after apply)
+ configure_kubectl = (sensitive value)
+ cpu_node_instance_type = [
+ "t3.xlarge",
]
+ gpu_node_instance_type = [
+ "g6.2xlarge",
]
+ grafana_forward_cmd = "kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack"
+ private_subnets = [
+ (known after apply),
+ (known after apply),
+ (known after apply),
]
+ public_subnets = [
+ (known after apply),
+ (known after apply),
+ (known after apply),
]
+ vllm_api_url = "http://k8s-vllm-vllmgpui-92bfb93f13-1398084155.us-east-2.elb.amazonaws.com/v1"
+ vllm_ingress_hostname = "k8s-vllm-vllmgpui-92bfb93f13-1398084155.us-east-2.elb.amazonaws.com"
+ vpc_cidr = (known after apply)
+ vpc_id = (known after apply)After the deployment you should be able to interact with the cluster using kubectl after running the below command
export KUBECONFIG=$PWD/kubeconfig4️⃣ Observability (Grafana Login)
Upon deployment, you can access Grafana dashboards using port forwarding . URL → “http://localhost:3000”
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack- Run the below command to fetch the password
kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 --decode
- Username: admin
- Password : through kubectl command above
Automatic vLLM Dashboard
In this stack, the vLLM dashboard and service monitor are automatically configured for Grafana.

For Benchmarking vLLM Production Stack Performance check the multi-round QA tutorial
5️⃣ Destroying the Infrastructure 🚧
To delete everything just run the below (Note: sometimes you need to run it twice as the loadbalancer gets tough to die)
terraform destroy -auto-approve🛠️Configuration knobs
This stack provides extensive customization options to tailor your deployment:
| Variable | Default | Description |
|---|---|---|
region |
us-east-2 | AWS Region |
pod_cidr |
192.168.0.0/16 | Calico Pod overlay network |
inference_hardware |
cpu | gpu | Select Inference hardware and node pools |
enable_efs_csi_driver |
true | Shared storage |
enable_vllm |
true | Deploy Vllm-production stack |
hf_token |
«secret» | HF model download token |
enable_prometheus |
true | prometheus-grafana stack |
cluster_version |
1.30 | Kubernetes version |
nvidia_setup |
plugin | GPU setup mode (plugin/operator) |
📓This is just a subset. For the full list of 20+ configurable variables, consult the configuration template : env-vars.template
🧪 Quick Test
1️⃣ Router Endpoint and API URL
1.1 Router Endpoint through port forwarding run the following command:
kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:801.2 Extracting the Router URL via AWS ALB Ingress
If AWS LB Controller is enabled (enable_lb_ctl=true), The endpoint URL is displayed in vllm_ingress_hostname output :
$ kubectl get ingress -n vllm -o json| jq -r .items[0].status.loadBalancer.ingress[].hostname
k8s-vllm-vllmingr-983dc8fd68-161738753.us-east-2.elb.amazonaws.com
2️⃣ List models
# Case 1 : Port forwarding
export vllm_api_url=http://localhost:30080/v1
# Case 2 : AWS ALB Ingress enabled
export vllm_api_url=http://k8s-vllm-vllmingr-983dc8fd68-161738753.us-east-2.elb.amazonaws.com/v1
# ---- check models
curl -s ${vllm_api_url}/models | jq .
3️⃣ Completion Applicable for both ingress and port forwarding URLs
curl ${vllm_api_url}/completions -H "Content-Type: application/json" -d '{
"model": "/data/models/tinyllama",
"prompt": "Toronto is a",
"max_tokens": 20,
"temperature": 0
}'| jq .choices[].text
---
//*
"city that is known for its vibrant nightlife, and there are plenty of bars and clubs"
//*
4️⃣ vLLM model service
kubectl -n vllm get svc🎯Troubleshooting:
From v2.5, the LB controller adds a MutatingWebhook that intercepts every Service of type LoadBalancer. As a result services will timeout waiting for the webhook to be available.
** no endpoints available for service "aws-load-balancer-webhook-service" ➡️Solution: We turned off the webhook as we don’t use serviceType: LoadBalancer here.
# in your blueprints-addons block
aws_load_balancer_controller = {
enable_service_mutator_webhook = false # turns off the webhook
}🫧 Cleanup Notes
In rare cases, you may need to manually clean up some AWS resources while running terraform destroy.
Here are the most common scenarios:
1️⃣ Load balancer blocking public subnets/igw deletion
If AWS LB Controller is enabled (enable_lb_ctl=true), you may hit VPC deletion issues due to an LB dependency.
🧹Run the cleanup commands below👇🏼:
export PROFILE=profile_name (ex: default)
export region=<region> (ex: "us-east-2")
# 1. Clean up load balancer
alb_name=`aws elbv2 describe-load-balancers --query "LoadBalancers[*].LoadBalancerName" --output text --profile $PROFILE`
alb_arn=$(aws elbv2 describe-load-balancers --names $alb_name --query 'LoadBalancers[0].LoadBalancerArn' \
--output text --region $region --profile $PROFILE)
# delete :
aws elbv2 delete-load-balancer --load-balancer-arn "$alb_arn" --region $region --profile $PROFILERe-Run terraform destroy
terraform destroyenable_lb_ctl to false see variables.tf
2️⃣ vllm namespace
If AWS LB Controller is enabled (enable_lb_ctl=true), the vLLM namespace can get stuck in “Terminating” state, you might need to patch some finalizers.
🧹Run the cleanup commands below👇🏼:
# Remove finalizers from AWS resources
RESOURCE_NAME=$(kubectl get targetgroupbinding.elbv2.k8s.aws -n vllm -o jsonpath='{.items[0].metadata.name}')
##
kubectl patch targetgroupbinding.elbv2.k8s.aws $RESOURCE_NAME -n vllm --type=merge -p '{"metadata":{"finalizers":[]}}'
## -- the delete might not be needed
kubectl delete targetgroupbinding.elbv2.k8s.aws $RESOURCE_NAME -n vllm --ignore-not-found=true
INGRESS_NAME=$(kubectl get ingress -n vllm -o jsonpath='{.items[0].metadata.name}')
kubectl patch ingress $INGRESS_NAME -n vllm --type=merge -p '{"metadata":{"finalizers":[]}}'Re-Run terraform destroy
terraform destroyenable_lb_ctl to false see variables.tf
3️⃣ Calico Cleanup Jobs
If encountering job conflicts during Calico removal (i.e: * jobs.batch “tigera-operator-uninstall” already exists)
🧹Run the cleanup commands below👇🏼:
# use the following commands to delete the jobs manually first:
kubectl -n tigera-operator delete job tigera-operator-uninstall --ignore-not-found=true
kubectl -n tigera-operator delete job tigera-operator-delete-crds --ignore-not-found=true
kubectl delete ns tigera-operator --ignore-not-found=trueRe-Run terraform destroy
terraform destroyenable_lb_ctl to false see variables.tf
4️⃣ Clean up associated security groups
If AWS LB Controller is enabled (enable_lb_ctl=true), you might need to delete orphan SGs (non-default) to destroy subnets:
🧹Run the cleanup commands below👇🏼:
VPC_ID=$(aws ec2 describe-vpcs --query 'Vpcs[?Tags[?Key==`Name` && Value==`vllm-vpc`]].VpcId' --output text --profile $PROFILE)
# Deletion
aws ec2 describe-security-groups --filters Name=vpc-id,Values=${VPC_ID} --query "SecurityGroups[?starts_with(GroupName, 'k8s-') || contains(GroupName, 'vllm')].GroupId" --output text --profile ${PROFILE} | tr -s '[:space:]' '\n' | xargs -r -I{} aws ec2 delete-security-group --group-id {} --profile ${PROFILE}Re-Run terraform destroy
terraform destroyenable_lb_ctl to false see variables.tf
Conclusion
This deployment gives you a solid foundation for production LLM serving on AWS. Here are ways to extend it
- Custom models: Swap TinyLlama for Llama 3, Mixtral, or your fine-tuned models
- Auto-scaling: Configure HPA and cluster autoscaling for dynamic workloads –
- CI/CD: Automate with GitOps (ArgoCD, FluxCD)
- Multi-region: Deploy across multiple AWS regions for HA
- Lmcache: leverage kv cache offloading using lmcache server
Next Steps 🚀
- In the next post, we’re taking this stack to Azure AKS with a production-critical addition: automated SSL/TLS certificate management.
📚 Additional Resources
- vLLM Documentation
- terraform-aws-eks
- EKS Blueprints
- Calico Documentation
- AWS Load Balancer Controller

Run AI Your Way — In Your Cloud
Want full control over your AI backend? The CloudThrill VLLM Private Inference POC is still open — but not forever.
📢 Secure your spot (only a few left), 𝗔𝗽𝗽𝗹𝘆 𝗻𝗼𝘄!
Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control
𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…
🙋🏻♀️If you like this content please subscribe to our blog newsletter ❤️.