vLLM Production Stack on Amazon EKS with Terraform🧑🏼‍🚀

Intro

Deploying vLLM manually is fine for a lab, but running it in production means dealing with Kubernetes, autoscaling, GPU orchestration, and observability. That’s where the vLLM Production Stack comes in – a Terraform-based blueprint that delivers production-ready LLM serving with enterprise-grade foundations.

In this post, we’ll deploy it on Amazon EKS, covering everything from network architecture and GPU provisioning to observability for both GPU and CPU Inference (see our PR).

💡You can find our code in the official repo ➡️ production-stack-tutorials-terraform or from our vllm-lab-repo .

This is part of CloudThrill‘s ongoing contribution to the vLLM Production Stack project. Extending terraform deployment patterns across AWS, Azure, GCP, Oracle OCI, and Nebius.

📂 Project Structure

./
├── main.tf
├── network.tf
├── storage.tf
├── provider.tf
├── variables.tf
├── output.tf
├── cluster-tools.tf
├── datasources.tf
├── iam_role.tf
├── vllm-production-stack.tf
├── env-vars.template
├── terraform.tfvars.template
├── modules/
   └── llm-stack
|       ├── helm|
|           ├── cpu|
|           └── gpu|
├── config/
   ├── calico-values.tpl
   └── kubeconfig.tpl
└── README.md                          # ← you are here

🧰Prerequisites

Before you begin, ensure you have the following:

Tool Version-tested Purpose
Terraform ≥ 1.5.7 Infrastructure provisioning
AWS CLI v2 ≥ 2.16 AWS authentication
kubectl ≥ 1.30 Kubernetes management
helm ≥ 3.14 Used for Helm chart debugging
jq latest JSON parsing (optional)
Follow the below steps to Install the tools (expend)👇🏼
# Install tools
sudo apt update && sudo apt install -y jq curl unzip gpg
wget -qO- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install -y terraform
curl -s "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && unzip -q awscliv2.zip && sudo ./aws/install && rm -rf aws awscliv2.zip
curl -sLO "https://dl.k8s.io/release/$(curl -Ls https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" && sudo install kubectl /usr/local/bin/ && rm kubectl
curl -s https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg >/dev/null && echo "deb [signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm.list && sudo apt update && sudo apt install -y helm
  • Configure AWS profile
aws configure --profile myprofile
export AWS_PROFILE=myprofile        # ← If null Terraform exec auth will use the default profile

What’s in the stack?📦

This terraform stack delivers a production-ready vLLM serving environment On Amazon EKS supporting both CPU/GPU inference with operational best practices embedded in AWS Integration and Automation (aws-ia).

It’s designed for real-world production workloads with:
✅ Enterprise-grade infrastructure: Built on AWS blueprint patterns.
✅ Flexible compute: Switch between CPU and GPU inference with a single flag.
✅ Operational excellence: Prometheus, Grafana, and CloudWatch integration out of the box.
Security-first: IAM roles, secrets management, and network segmentation.
✅ Scalability: Auto-scaling node groups and efficient CNI with Calico overlay.
✅ Production hardening: Load balancers, ingress controllers, and persistent storage.

Architecture Overview

Deployment layers – The stack provisions infrastructure in logical layers that adapt based on your hardware choice:

Layer Component CPU Mode GPU Mode
Infrastructure VPC + EKS + Calico CNI ✅ Always ✅ Always
Add-ons EBS, ALB, Prometheus ✅ Always ✅ Always
vLLM Stack Secrets + Helm chart ✅ CPU nodes ✅ GPU nodes + NVIDIA operator
Networking Load balancer + Ingress ✅ ALB ✅ ALB

1. Networking Foundation

The stack creates a production-grade network topology:

  • Custom `/16` VPC (with 3 public + 3 private subnets)
  • Public/private subnet architecture for workload isolation
  • Calico overlay CNI: Supports 110+ pods per node (vs. 17 with AWS VPC CNI)
  • Single NAT Gateway for cost optimization
  • AWS Load Balancer Controller for ingress management

2. EKS Cluster

A Control plane v1.30 with two managed node-group Types

Node Pool Instance Type Purpose
CPU Pool (default) `t3a.large` (2 vCPU / 8 GiB) Control plane & CPU inference
GPU Pool (optional) `g5.xlarge` (1× NVIDIA A10G) GPU-accelerated inference

3. Essential Add-ons

Production-ready add-ons via terraform-aws-eks-blueprints-addons:

Category Add-on
CNI Calico overlay (replaces VPC CNI)
Storage EBS CSI (block) + EFS CSI (shared)
Ingress AWS Load Balancer Controller (ALB/NLB)
Core CoreDNS, kube-proxy, Metrics Server
Observability kube-prometheus-stack, CloudWatch (Optional)
Security cert-manager, External Secrets
GPU (Optional) NVIDIA device plugin or GPU Operator

4. vLLM Production Stack

The heart of the deployment a production-ready model serving:

✅ Model: TinyLlama-1.1B (default, fully customizable)
✅ Load balancing: Round-robin router service across replicas
✅ Secrets: Hugging Face token stored as Kubernetes Secret
Storage: Init container with persistent model caching at `/data/models/`
Monitoring: Prometheus metrics endpoint for observability

5. Hardware Flexibility: CPU vs GPU

You can choose to deploy VLLM production stack on either CPU or GPU using the inference_hardware parameter

# Deploy on CPU (default)
export TF_VAR_inference_hardware="cpu"

# Or deploy on GPU
export TF_VAR_inference_hardware="gpu"

🖥️ AWS GPU Instance Types Available

Available GPU instances (T4 · L4 · V100 · A10G · A100)
AWS EC2 Instance vCPUs Memory (GiB) GPUs GPU Memory (GiB) Best For
NVIDIA Tesla T4 g4dn.xlarge 4 16 1 16 Small inference
g4dn.2xlarge 8 32 1 16 Medium inference
g4dn.4xlarge 16 64 1 16 Large inference
g4dn.12xlarge 48 192 4 64 Multi-GPU inference
NVIDIA L4 g6.xlarge 4 16 1 24 Cost-effective inference
g6.2xlarge 8 32 1 24 Balanced inference workloads
g6.4xlarge 16 64 1 24 Large-scale inference
NVIDIA Tesla V100 p3.2xlarge 8 61 1 16 Training & inference
p3.8xlarge 32 244 4 64 Multi-GPU training
p3.16xlarge 64 488 8 128 Large-scale training
NVIDIA A100 p4d.24xlarge 96 1,152 8 320 Large-scale AI training
NVIDIA A10G g5.xlarge 4 16 1 24 General GPU workloads
g5.2xlarge 8 32 1 24 Medium GPU workloads
g5.4xlarge 16 64 1 24 Large GPU workloads
g5.8xlarge 32 128 1 24 Large-scale inference
g5.12xlarge 48 192 4 96 Multi-GPU training
g5.24xlarge 96 384 4 96 Ultra-large-scale training
g5.48xlarge 192 768 8 192 Extreme-scale training

 here.

Note: Check the full list of AWS GPU instance offering here

Getting started

The deployment automatically provisions only the required infrastructure based on your hardware selection.

Phase Component Action Condition
1. Infrastructure VPC Provision VPC with 3 public + 3 private subnets Always
EKS Deploy v1.30 cluster + CPU node group (t3a.large) Always
CNI Remove aws-node, install Calico overlay (VXLAN) Always
Add-ons Deploy EBS CSI, ALB controller, kube-prometheus Always
2. vLLM Stack HF secret Create `hf-token-secret` for Hugging Face enable_vllm = true
CPU Deployment Deploy vLLM on existing CPU nodes inference_hardware = "cpu"
GPU Infrastructure Provision GPU node group (g5.xlarge) inference_hardware = "gpu"
GPU Operator Deploy NVIDIA operator/plugin inference_hardware = "gpu"
GPU Deployment Deploy vLLM on GPU nodes with scheduling inference_hardware = "gpu"
3. Networking Load Balancer Configure ALB and ingress for external access enable_vllm = true
4. Model Storage loaded locally Using init container -> /data/models enable_vllm = true

🔵 Deployment Steps

1️⃣Clone the repository

The vLLM EKS deployment build is located under tutorials/terraform/eks directory (see below):

🌍 Repo: https://github.com/vllm-project/production-stack
This repo is also a one stop shop for other terraform deployments🚀

./tutorials/terraform/eks

  • Navigate to the production-stack directory and terraform eks tutorial folder
 $ git clone https://github.com/vllm-project/production-stack
 📂.. 
 $ cd production-stack/tutorials/terraform/eks/

2️⃣ Set Up Environment Variables

Use an env-vars file to export your TF_VARS or use terraform.tfvars . Replace placeholders with your values:

cp env-vars.template env-vars
vim env-vars  # Set HF token and customize deployment options
source env-vars

Usage examples

  • Option 1: Through Environment Variables
# Copy and customize
$ cp env-vars.template env-vars
$ vi env-vars
################################################################################
# AWS Credentials and Region
################################################################################
export TF_VAR_aws_profile=""     # default: "" - Fill your AWS Profile name (e.g., default, cloudthrill)
export TF_VAR_region="us-east-2" # default: "us-east-2" - AWS Region for your deployment
################################################################################
# EKS Cluster Configuration
################################################################################
# ☸️ EKS cluster basics
export TF_VAR_cluster_name="vllm-eks-prod" # default: "vllm-eks-prod"
export TF_VAR_cluster_version="1.30"       # default: "1.30" - Kubernetes cluster version
 ################################################################################
 # 🤖 NVIDIA setup selector
 #   • plugin           -> device-plugin only
 #   • operator_no_driver -> GPU Operator (driver disabled)
 #   • operator_custom  -> GPU Operator with your YAML
 ################################################################################
 export TF_VAR_nvidia_setup="plugin" # default: "plugin"
 ################################################################################
 # 🧠 LLM Inference Configuration
 ################################################################################
 export TF_VAR_enable_vllm="true"         # default: "false" - Set to "true" to deploy vLLM
 export TF_VAR_hf_token=""                # default: "" - Hugging Face token for model download (if needed)
 export TF_VAR_inference_hardware="gpu"   # default: "cpu" - "cpu" or "gpu"
 ################################################################################
 export TF_VAR_nvidia_setup="plugin" # default: ""
 # Paths to Helm chart values templates for vLLM.
 # These paths are relative to the root of your Terraform project.
 export TF_VAR_gpu_vllm_helm_config="./modules/llm-stack/helm/gpu/gpu-tinyllama-light-ingress.tpl" # default: ""
 export TF_VAR_cpu_vllm_helm_config="./modules/llm-stack/helm/cpu/cpu-tinyllama-light-ingress.tpl" # default: ""
 ################################################################################
 # ⚙️ Node-group sizing
 ################################################################################
 # CPU pool (always present)
 export TF_VAR_cpu_node_min_size="1"     # default: 1
 export TF_VAR_cpu_node_max_size="3"     # default: 3
 export TF_VAR_cpu_node_desired_size="2" # default: 2
 # GPU pool (ignored unless inference_hardware = "gpu")
 export TF_VAR_gpu_node_min_size="1"     # default: 1
 export TF_VAR_gpu_node_max_size="1"     # default: 1
 export TF_VAR_gpu_node_desired_size="1" # default: 1
 ...snip
 $ source env-vars
  • Option 2: Through Terraform Variables
 # Copy and customize
 $ cp terraform.tfvars.example terraform.tfvars
 $ vim terraform.tfvars
  • Load the Variables into Your Shell Before running Terraform, source the env-vars file:
$ source env-vars

3️⃣ Run Terraform deployment:

You can now safely run Terraform plan & apply. You will deploy the 100 resources in total, including local kubeconfig.

terraform init
terraform plan
terraform apply
Full Plan
Plan: 100 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + Stack_Info                  = "Built with ❤️ by @Cloudthrill"
  + cluster_endpoint            = (known after apply)
  + cluster_name                = "vllm-eks"
  + cluster_public_subnets_info = (known after apply)
  + cluster_subnets_info        = (known after apply)
  + configure_kubectl           = (sensitive value)
  + cpu_node_instance_type      = [
      + "t3.xlarge",
    ]
  + gpu_node_instance_type      = [
      + "g6.2xlarge",
    ]
  + grafana_forward_cmd         = "kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack"
  + private_subnets             = [
      + (known after apply),
      + (known after apply),
      + (known after apply),
    ]
  + public_subnets              = [
      + (known after apply),
      + (known after apply),
      + (known after apply),
    ]
  + vllm_api_url = "http://k8s-vllm-vllmgpui-92bfb93f13-1398084155.us-east-2.elb.amazonaws.com/v1"
  + vllm_ingress_hostname = "k8s-vllm-vllmgpui-92bfb93f13-1398084155.us-east-2.elb.amazonaws.com"
  + vpc_cidr                    = (known after apply)
  + vpc_id                      = (known after apply)

After the deployment you should be able to interact with the cluster using kubectl after running the below command

export KUBECONFIG=$PWD/kubeconfig

4️⃣ Observability (Grafana Login)

Upon deployment, you can access Grafana dashboards using port forwarding . URL → “http://localhost:3000

kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack
  • Run the below command to fetch the password
kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 --decode
  • Username: admin
  • Password : through kubectl command above

Automatic vLLM Dashboard

In this stack, the vLLM dashboard and service monitor are automatically configured for Grafana.

For Benchmarking vLLM Production Stack Performance check the multi-round QA tutorial

5️⃣ Destroying the Infrastructure 🚧

To delete everything just run the below (Note: sometimes you need to run it twice as the loadbalancer gets tough to die)

terraform destroy -auto-approve



🛠️Configuration knobs

This stack provides extensive customization options to tailor your deployment:

Variable Default Description
region us-east-2 AWS Region
pod_cidr 192.168.0.0/16 Calico Pod overlay network
inference_hardware cpu | gpu Select Inference hardware and node pools
enable_efs_csi_driver true Shared storage
enable_vllm true Deploy Vllm-production stack
hf_token «secret» HF model download token
enable_prometheus true prometheus-grafana stack
cluster_version 1.30 Kubernetes version
nvidia_setup plugin GPU setup mode (plugin/operator)

📓This is just a subset. For the full list of 20+ configurable variables, consult the configuration template : env-vars.template

🧪 Quick Test

1️⃣ Router Endpoint and API URL

1.1 Router Endpoint through port forwarding run the following command:

kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80

1.2 Extracting the Router URL via AWS ALB Ingress 
If AWS LB Controller is enabled (enable_lb_ctl=true), The endpoint URL is displayed in vllm_ingress_hostname output :

$ kubectl get ingress -n vllm -o json| jq -r .items[0].status.loadBalancer.ingress[].hostname
k8s-vllm-vllmingr-983dc8fd68-161738753.us-east-2.elb.amazonaws.com


2️⃣ List models

# Case 1 : Port forwarding
export vllm_api_url=http://localhost:30080/v1
# Case 2 : AWS ALB Ingress enabled
export vllm_api_url=http://k8s-vllm-vllmingr-983dc8fd68-161738753.us-east-2.elb.amazonaws.com/v1

# ---- check models
curl -s ${vllm_api_url}/models | jq .


3️⃣ Completion Applicable for both ingress and port forwarding URLs

curl ${vllm_api_url}/completions     -H "Content-Type: application/json"     -d '{
        "model": "/data/models/tinyllama",
        "prompt": "Toronto is a",
        "max_tokens": 20,
        "temperature": 0
    }'| jq .choices[].text

---
//*
"city that is known for its vibrant nightlife, and there are plenty of bars and clubs"
//*


4️⃣ vLLM model service

kubectl -n vllm get svc

🎯Troubleshooting:

From v2.5, the LB controller adds a MutatingWebhook that intercepts every Service of type LoadBalancer. As a result services will timeout waiting for the webhook to be available.

** no endpoints available for service "aws-load-balancer-webhook-service"

 ➡️Solution: We turned off the webhook as we don’t use serviceType: LoadBalancer here.

# in your blueprints-addons block
aws_load_balancer_controller = {
 enable_service_mutator_webhook = false   # turns off the webhook
}

🫧 Cleanup Notes

In rare cases, you may need to manually clean up some AWS resources while running terraform destroy.
Here are the most common scenarios:

Note: These manual steps are only needed if terraform destroy encounters specific dependency issues.

1️⃣ Load balancer blocking public subnets/igw deletion

If AWS LB Controller is enabled (enable_lb_ctl=true), you may hit VPC deletion issues due to an LB dependency.

🧹Run the cleanup commands below👇🏼:
export PROFILE=profile_name  (ex: default)
export region=<region>       (ex: "us-east-2")

# 1. Clean up load balancer
alb_name=`aws elbv2 describe-load-balancers --query "LoadBalancers[*].LoadBalancerName" --output text --profile $PROFILE`

alb_arn=$(aws elbv2 describe-load-balancers --names  $alb_name --query 'LoadBalancers[0].LoadBalancerArn' \
  --output text --region $region --profile $PROFILE)

# delete :
aws elbv2 delete-load-balancer --load-balancer-arn "$alb_arn" --region $region --profile $PROFILE

Re-Run terraform destroy

terraform destroy
Note: Another solution is to disable AWS load balancer control creation altogether by setting the variable enable_lb_ctl to false see variables.tf

2️⃣ vllm namespace

If AWS LB Controller is enabled (enable_lb_ctl=true), the vLLM namespace can get stuck in “Terminating” state, you might need to patch some finalizers.

🧹Run the cleanup commands below👇🏼:
# Remove finalizers from AWS resources
RESOURCE_NAME=$(kubectl get targetgroupbinding.elbv2.k8s.aws -n vllm -o jsonpath='{.items[0].metadata.name}')
##
kubectl patch targetgroupbinding.elbv2.k8s.aws $RESOURCE_NAME -n vllm --type=merge -p '{"metadata":{"finalizers":[]}}'
## -- the delete might not be needed
kubectl delete targetgroupbinding.elbv2.k8s.aws $RESOURCE_NAME -n vllm --ignore-not-found=true
INGRESS_NAME=$(kubectl get ingress -n vllm -o jsonpath='{.items[0].metadata.name}')
kubectl patch ingress $INGRESS_NAME -n vllm --type=merge -p '{"metadata":{"finalizers":[]}}'

Re-Run terraform destroy

terraform destroy
Note: Another solution is to disable AWS load balancer control creation altogether by setting the variable enable_lb_ctl to false see variables.tf

3️⃣ Calico Cleanup Jobs

If encountering job conflicts during Calico removal (i.e: * jobs.batch “tigera-operator-uninstall” already exists)

🧹Run the cleanup commands below👇🏼:
# use the following commands to delete the jobs manually first:
kubectl -n tigera-operator delete job tigera-operator-uninstall --ignore-not-found=true
kubectl -n tigera-operator delete job tigera-operator-delete-crds --ignore-not-found=true
kubectl delete ns tigera-operator --ignore-not-found=true

Re-Run terraform destroy

terraform destroy
Note: Another solution is to disable AWS load balancer control creation altogether by setting the variable enable_lb_ctl to false see variables.tf

4️⃣ Clean up associated security groups 

If AWS LB Controller is enabled (enable_lb_ctl=true), you might need to delete orphan SGs (non-default) to destroy subnets:

🧹Run the cleanup commands below👇🏼:
VPC_ID=$(aws ec2 describe-vpcs --query 'Vpcs[?Tags[?Key==`Name` && Value==`vllm-vpc`]].VpcId' --output text --profile $PROFILE)
# Deletion
aws ec2 describe-security-groups --filters Name=vpc-id,Values=${VPC_ID} --query "SecurityGroups[?starts_with(GroupName, 'k8s-') || contains(GroupName, 'vllm')].GroupId"    --output text    --profile ${PROFILE} |  tr -s '[:space:]' '\n' |  xargs -r -I{} aws ec2 delete-security-group --group-id {} --profile ${PROFILE}

Re-Run terraform destroy

terraform destroy
Note: Another solution is to disable AWS load balancer control creation altogether by setting the variable enable_lb_ctl to false see variables.tf

Conclusion

This deployment gives you a solid foundation for production LLM serving on AWS. Here are ways to extend it

  • Custom models: Swap TinyLlama for Llama 3, Mixtral, or your fine-tuned models
  • Auto-scaling: Configure HPA and cluster autoscaling for dynamic workloads –
  • CI/CD: Automate with GitOps (ArgoCD, FluxCD)
  • Multi-region: Deploy across multiple AWS regions for HA
  • Lmcache: leverage kv cache offloading using lmcache server

Next Steps 🚀

  • In the next post, we’re taking this stack to Azure AKS with a production-critical addition: automated SSL/TLS certificate management.

📚 Additional Resources


Run AI Your Way — In Your Cloud


Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

🙋🏻‍♀️If you like this content please subscribe to our blog newsletter ❤️.

👋🏻Want to chat about your challenges?
We’d love to hear from you! 

Share this…

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Start your Cloud journey with us today .