Intro

Deploying vLLM manually is fine for a lab, but running it in production means dealing with Kubernetes, autoscaling, GPU orchestration, and observability. That’s where the vLLM Production Stack comes in – a Terraform-based blueprint that delivers production-ready LLM serving with enterprise-grade foundations.

In this post, we’ll deploy it on Amazon EKS, covering everything from network architecture and GPU provisioning to observability for both GPU and CPU Inference (see our PR).

💡You can find our code in the official repo ➡️ production-stack-tutorials-terraform or from our vllm-lab-repo .

This is part of CloudThrill‘s ongoing contribution to the vLLM Production Stack project. Extending terraform deployment patterns across AWS, Azure, GCP, Oracle OCI, and Nebius.

📂 Project Structure

./
├── main.tf
├── network.tf
├── storage.tf
├── provider.tf
├── variables.tf
├── output.tf
├── cluster-tools.tf
├── datasources.tf
├── iam_role.tf
├── vllm-production-stack.tf
├── env-vars.template
├── terraform.tfvars.template
├── modules/
│   └── llm-stack
|       ├── helm|
|           ├── cpu|
|           └── gpu|
├── config/
│   ├── calico-values.tpl
│   └── kubeconfig.tpl
└── README.md                          # ← you are here

./
├── main.tf
├── network.tf
├── storage.tf
├── provider.tf
├── variables.tf
├── output.tf
├── cluster-tools.tf
├── datasources.tf
├── iam_role.tf
├── vllm-production-stack.tf
├── env-vars.template
├── terraform.tfvars.template
├── modules/
│   └── llm-stack
|       ├── helm|
|           ├── cpu|
|           └── gpu|
├── config/
│   ├── calico-values.tpl
│   └── kubeconfig.tpl
└── README.md                          # ← you are here

🧰Prerequisites

Before you begin, ensure you have the following:

Tool	Version-tested	Purpose
Terraform	≥ 1.5.7	Infrastructure provisioning
AWS CLI v2	≥ 2.16	AWS authentication
kubectl	≥ 1.30	Kubernetes management
helm	≥ 3.14	Used for Helm chart debugging
jq	latest	JSON parsing (optional)

Follow the below steps to Install the tools (expend)👇🏼

# Install tools
sudo apt update && sudo apt install -y jq curl unzip gpg
wget -qO- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install -y terraform
curl -s "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && unzip -q awscliv2.zip && sudo ./aws/install && rm -rf aws awscliv2.zip
curl -sLO "https://dl.k8s.io/release/$(curl -Ls https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" && sudo install kubectl /usr/local/bin/ && rm kubectl
curl -s https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg >/dev/null && echo "deb [signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm.list && sudo apt update && sudo apt install -y helm

# Install tools
sudo apt update && sudo apt install -y jq curl unzip gpg
wget -qO- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install -y terraform
curl -s "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && unzip -q awscliv2.zip && sudo ./aws/install && rm -rf aws awscliv2.zip
curl -sLO "https://dl.k8s.io/release/$(curl -Ls https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" && sudo install kubectl /usr/local/bin/ && rm kubectl
curl -s https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg >/dev/null && echo "deb [signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm.list && sudo apt update && sudo apt install -y helm

Configure AWS profile

aws configure --profile myprofile
export AWS_PROFILE=myprofile        # ← If null Terraform exec auth will use the default profile

aws configure --profile myprofile
export AWS_PROFILE=myprofile        # ← If null Terraform exec auth will use the default profile

What’s in the stack?📦

This terraform stack delivers a production-ready vLLM serving environment On Amazon EKS supporting both CPU/GPU inference with operational best practices embedded in AWS Integration and Automation (aws-ia).

It’s designed for real-world production workloads with:
✅ Enterprise-grade infrastructure: Built on AWS blueprint patterns.
✅ Flexible compute: Switch between CPU and GPU inference with a single flag.
✅ Operational excellence: Prometheus, Grafana, and CloudWatch integration out of the box.
✅ Security-first: IAM roles, secrets management, and network segmentation.
✅ Scalability: Auto-scaling node groups and efficient CNI with Calico overlay.
✅ Production hardening: Load balancers, ingress controllers, and persistent storage.

Architecture Overview

Deployment layers – The stack provisions infrastructure in logical layers that adapt based on your hardware choice:

Layer	Component	CPU Mode	GPU Mode
Infrastructure	VPC + EKS + Calico CNI	✅ Always	✅ Always
Add-ons	EBS, ALB, Prometheus	✅ Always	✅ Always
vLLM Stack	Secrets + Helm chart	✅ CPU nodes	✅ GPU nodes + NVIDIA operator
Networking	Load balancer + Ingress	✅ ALB	✅ ALB

1. Networking Foundation

The stack creates a production-grade network topology:

Custom `/16` VPC (with 3 public + 3 private subnets)
Public/private subnet architecture for workload isolation
Calico overlay CNI: Supports 110+ pods per node (vs. 17 with AWS VPC CNI)
Single NAT Gateway for cost optimization
AWS Load Balancer Controller for ingress management

2. EKS Cluster

A Control plane v1.30 with two managed node-group Types

Node Pool	Instance Type	Purpose
CPU Pool (default)	`t3a.large` (2 vCPU / 8 GiB)	Control plane & CPU inference
GPU Pool (optional)	`g5.xlarge` (1× NVIDIA A10G)	GPU-accelerated inference

3. Essential Add-ons

Production-ready add-ons via terraform-aws-eks-blueprints-addons:

Category	Add-on
CNI	Calico overlay (replaces VPC CNI)
Storage	EBS CSI (block) + EFS CSI (shared)
Ingress	AWS Load Balancer Controller (ALB/NLB)
Core	CoreDNS, kube-proxy, Metrics Server
Observability	kube-prometheus-stack, CloudWatch (Optional)
Security	cert-manager, External Secrets
GPU (Optional)	NVIDIA device plugin or GPU Operator

4. vLLM Production Stack

The heart of the deployment a production-ready model serving:

✅ Model: TinyLlama-1.1B (default, fully customizable)
✅ Load balancing: Round-robin router service across replicas
✅ Secrets: Hugging Face token stored as Kubernetes Secret
✅ Storage: Init container with persistent model caching at `/data/models/`
✅ Monitoring: Prometheus metrics endpoint for observability

5. Hardware Flexibility: CPU vs GPU

You can choose to deploy VLLM production stack on either CPU or GPU using the inference_hardware parameter

# Deploy on CPU (default)
export TF_VAR_inference_hardware="cpu"

# Or deploy on GPU
export TF_VAR_inference_hardware="gpu"

# Deploy on CPU (default)
export TF_VAR_inference_hardware="cpu"

# Or deploy on GPU
export TF_VAR_inference_hardware="gpu"

🖥️ AWS GPU Instance Types Available

Available GPU instances (T4 · L4 · V100 · A10G · A100)

AWS EC2 Instance	vCPUs	Memory (GiB)	GPUs	GPU Memory (GiB)	Best For
NVIDIA Tesla T4 g4dn.xlarge	4	16	1	16	Small inference
g4dn.2xlarge	8	32	1	16	Medium inference
g4dn.4xlarge	16	64	1	16	Large inference
g4dn.12xlarge	48	192	4	64	Multi-GPU inference
NVIDIA L4 g6.xlarge	4	16	1	24	Cost-effective inference
g6.2xlarge	8	32	1	24	Balanced inference workloads
g6.4xlarge	16	64	1	24	Large-scale inference
NVIDIA Tesla V100 p3.2xlarge	8	61	1	16	Training & inference
p3.8xlarge	32	244	4	64	Multi-GPU training
p3.16xlarge	64	488	8	128	Large-scale training
NVIDIA A100 p4d.24xlarge	96	1,152	8	320	Large-scale AI training
NVIDIA A10G g5.xlarge	4	16	1	24	General GPU workloads
g5.2xlarge	8	32	1	24	Medium GPU workloads
g5.4xlarge	16	64	1	24	Large GPU workloads
g5.8xlarge	32	128	1	24	Large-scale inference
g5.12xlarge	48	192	4	96	Multi-GPU training
g5.24xlarge	96	384	4	96	Ultra-large-scale training
g5.48xlarge	192	768	8	192	Extreme-scale training

here.

Note: Check the full list of AWS GPU instance offering here

Getting started

The deployment automatically provisions only the required infrastructure based on your hardware selection.

Phase	Component	Action	Condition
1. Infrastructure	VPC	Provision VPC with 3 public + 3 private subnets	Always
	EKS	Deploy v1.30 cluster + CPU node group (t3a.large)	Always
	CNI	Remove aws-node, install Calico overlay (VXLAN)	Always
	Add-ons	Deploy EBS CSI, ALB controller, kube-prometheus	Always
2. vLLM Stack	HF secret	Create `hf-token-secret` for Hugging Face	`enable_vllm = true`
	CPU Deployment	Deploy vLLM on existing CPU nodes	`inference_hardware = "cpu"`
	GPU Infrastructure	Provision GPU node group (g5.xlarge)	`inference_hardware = "gpu"`
	GPU Operator	Deploy NVIDIA operator/plugin	`inference_hardware = "gpu"`
	GPU Deployment	Deploy vLLM on GPU nodes with scheduling	`inference_hardware = "gpu"`
3. Networking	Load Balancer	Configure ALB and ingress for external access	`enable_vllm = true`
4. Model Storage	loaded locally	Using init container -> `/data/models`	`enable_vllm = true`

🔵 Deployment Steps

1️⃣Clone the repository

The vLLM EKS deployment build is located under tutorials/terraform/eks directory (see below):

Deployment repo
Civo deployment folder

🌍 Repo: https://github.com/vllm-project/production-stack
This repo is also a one stop shop for other terraform deployments🚀

./tutorials/terraform/eks

./tutorials/terraform/eks

Navigate to the production-stack directory and terraform eks tutorial folder

 $ git clone https://github.com/vllm-project/production-stack
 📂.. 
 $ cd production-stack/tutorials/terraform/eks/

 $ git clone https://github.com/vllm-project/production-stack
 📂.. 
 $ cd production-stack/tutorials/terraform/eks/

2️⃣ Set Up Environment Variables

Use an env-vars file to export your TF_VARS or use terraform.tfvars . Replace placeholders with your values:

cp env-vars.template env-vars
vim env-vars  # Set HF token and customize deployment options
source env-vars

cp env-vars.template env-vars
vim env-vars  # Set HF token and customize deployment options
source env-vars

Usage examples

Option 1: Through Environment Variables

# Copy and customize
$ cp env-vars.template env-vars
$ vi env-vars
################################################################################
# AWS Credentials and Region
################################################################################
export TF_VAR_aws_profile=""     # default: "" - Fill your AWS Profile name (e.g., default, cloudthrill)
export TF_VAR_region="us-east-2" # default: "us-east-2" - AWS Region for your deployment
################################################################################
# EKS Cluster Configuration
################################################################################
# ☸️ EKS cluster basics
export TF_VAR_cluster_name="vllm-eks-prod" # default: "vllm-eks-prod"
export TF_VAR_cluster_version="1.30"       # default: "1.30" - Kubernetes cluster version
 ################################################################################
 # 🤖 NVIDIA setup selector
 #   • plugin           -> device-plugin only
 #   • operator_no_driver -> GPU Operator (driver disabled)
 #   • operator_custom  -> GPU Operator with your YAML
 ################################################################################
 export TF_VAR_nvidia_setup="plugin" # default: "plugin"
 ################################################################################
 # 🧠 LLM Inference Configuration
 ################################################################################
 export TF_VAR_enable_vllm="true"         # default: "false" - Set to "true" to deploy vLLM
 export TF_VAR_hf_token=""                # default: "" - Hugging Face token for model download (if needed)
 export TF_VAR_inference_hardware="gpu"   # default: "cpu" - "cpu" or "gpu"
 ################################################################################
 export TF_VAR_nvidia_setup="plugin" # default: ""
 # Paths to Helm chart values templates for vLLM.
 # These paths are relative to the root of your Terraform project.
 export TF_VAR_gpu_vllm_helm_config="./modules/llm-stack/helm/gpu/gpu-tinyllama-light-ingress.tpl" # default: ""
 export TF_VAR_cpu_vllm_helm_config="./modules/llm-stack/helm/cpu/cpu-tinyllama-light-ingress.tpl" # default: ""
 ################################################################################
 # ⚙️ Node-group sizing
 ################################################################################
 # CPU pool (always present)
 export TF_VAR_cpu_node_min_size="1"     # default: 1
 export TF_VAR_cpu_node_max_size="3"     # default: 3
 export TF_VAR_cpu_node_desired_size="2" # default: 2
 # GPU pool (ignored unless inference_hardware = "gpu")
 export TF_VAR_gpu_node_min_size="1"     # default: 1
 export TF_VAR_gpu_node_max_size="1"     # default: 1
 export TF_VAR_gpu_node_desired_size="1" # default: 1
 ...snip
 $ source env-vars

# Copy and customize
$ cp env-vars.template env-vars
$ vi env-vars
################################################################################
# AWS Credentials and Region
################################################################################
export TF_VAR_aws_profile=""     # default: "" - Fill your AWS Profile name (e.g., default, cloudthrill)
export TF_VAR_region="us-east-2" # default: "us-east-2" - AWS Region for your deployment
################################################################################
# EKS Cluster Configuration
################################################################################
# ☸️ EKS cluster basics
export TF_VAR_cluster_name="vllm-eks-prod" # default: "vllm-eks-prod"
export TF_VAR_cluster_version="1.30"       # default: "1.30" - Kubernetes cluster version
 ################################################################################
 # 🤖 NVIDIA setup selector
 #   • plugin           -> device-plugin only
 #   • operator_no_driver -> GPU Operator (driver disabled)
 #   • operator_custom  -> GPU Operator with your YAML
 ################################################################################
 export TF_VAR_nvidia_setup="plugin" # default: "plugin"
 ################################################################################
 # 🧠 LLM Inference Configuration
 ################################################################################
 export TF_VAR_enable_vllm="true"         # default: "false" - Set to "true" to deploy vLLM
 export TF_VAR_hf_token=""                # default: "" - Hugging Face token for model download (if needed)
 export TF_VAR_inference_hardware="gpu"   # default: "cpu" - "cpu" or "gpu"
 ################################################################################
 export TF_VAR_nvidia_setup="plugin" # default: ""
 # Paths to Helm chart values templates for vLLM.
 # These paths are relative to the root of your Terraform project.
 export TF_VAR_gpu_vllm_helm_config="./modules/llm-stack/helm/gpu/gpu-tinyllama-light-ingress.tpl" # default: ""
 export TF_VAR_cpu_vllm_helm_config="./modules/llm-stack/helm/cpu/cpu-tinyllama-light-ingress.tpl" # default: ""
 ################################################################################
 # ⚙️ Node-group sizing
 ################################################################################
 # CPU pool (always present)
 export TF_VAR_cpu_node_min_size="1"     # default: 1
 export TF_VAR_cpu_node_max_size="3"     # default: 3
 export TF_VAR_cpu_node_desired_size="2" # default: 2
 # GPU pool (ignored unless inference_hardware = "gpu")
 export TF_VAR_gpu_node_min_size="1"     # default: 1
 export TF_VAR_gpu_node_max_size="1"     # default: 1
 export TF_VAR_gpu_node_desired_size="1" # default: 1
 ...snip
 $ source env-vars

Option 2: Through Terraform Variables

 # Copy and customize
 $ cp terraform.tfvars.example terraform.tfvars
 $ vim terraform.tfvars

 # Copy and customize
 $ cp terraform.tfvars.example terraform.tfvars
 $ vim terraform.tfvars

Load the Variables into Your Shell Before running Terraform, source the env-vars file:

$ source env-vars

$ source env-vars

3️⃣ Run Terraform deployment:

You can now safely run Terraform plan & apply. You will deploy the 100 resources in total, including local kubeconfig.

terraform init
terraform plan
terraform apply

terraform init
terraform plan
terraform apply

Full Plan

Plan: 100 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + Stack_Info                  = "Built with ❤️ by @Cloudthrill"
  + cluster_endpoint            = (known after apply)
  + cluster_name                = "vllm-eks"
  + cluster_public_subnets_info = (known after apply)
  + cluster_subnets_info        = (known after apply)
  + configure_kubectl           = (sensitive value)
  + cpu_node_instance_type      = [
      + "t3.xlarge",
    ]
  + gpu_node_instance_type      = [
      + "g6.2xlarge",
    ]
  + grafana_forward_cmd         = "kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack"
  + private_subnets             = [
      + (known after apply),
      + (known after apply),
      + (known after apply),
    ]
  + public_subnets              = [
      + (known after apply),
      + (known after apply),
      + (known after apply),
    ]
  + vllm_api_url = "http://k8s-vllm-vllmgpui-92bfb93f13-1398084155.us-east-2.elb.amazonaws.com/v1"
  + vllm_ingress_hostname = "k8s-vllm-vllmgpui-92bfb93f13-1398084155.us-east-2.elb.amazonaws.com"
  + vpc_cidr                    = (known after apply)
  + vpc_id                      = (known after apply)

Plan: 100 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + Stack_Info                  = "Built with ❤️ by @Cloudthrill"
  + cluster_endpoint            = (known after apply)
  + cluster_name                = "vllm-eks"
  + cluster_public_subnets_info = (known after apply)
  + cluster_subnets_info        = (known after apply)
  + configure_kubectl           = (sensitive value)
  + cpu_node_instance_type      = [
      + "t3.xlarge",
    ]
  + gpu_node_instance_type      = [
      + "g6.2xlarge",
    ]
  + grafana_forward_cmd         = "kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack"
  + private_subnets             = [
      + (known after apply),
      + (known after apply),
      + (known after apply),
    ]
  + public_subnets              = [
      + (known after apply),
      + (known after apply),
      + (known after apply),
    ]
  + vllm_api_url = "http://k8s-vllm-vllmgpui-92bfb93f13-1398084155.us-east-2.elb.amazonaws.com/v1"
  + vllm_ingress_hostname = "k8s-vllm-vllmgpui-92bfb93f13-1398084155.us-east-2.elb.amazonaws.com"
  + vpc_cidr                    = (known after apply)
  + vpc_id                      = (known after apply)

After the deployment you should be able to interact with the cluster using kubectl after running the below command

export KUBECONFIG=$PWD/kubeconfig

export KUBECONFIG=$PWD/kubeconfig

4️⃣ Observability (Grafana Login)

Upon deployment, you can access Grafana dashboards using port forwarding . URL → “http://localhost:3000”

kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack

kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack

Run the below command to fetch the password

kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 --decode

kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 --decode

Username: admin
Password : through kubectl command above

Automatic vLLM Dashboard

In this stack, the vLLM dashboard and service monitor are automatically configured for Grafana.

For Benchmarking vLLM Production Stack Performance check the multi-round QA tutorial

5️⃣ Destroying the Infrastructure 🚧

To delete everything just run the below (Note: sometimes you need to run it twice as the loadbalancer gets tough to die)

terraform destroy -auto-approve

terraform destroy -auto-approve

🛠️Configuration knobs

This stack provides extensive customization options to tailor your deployment:

Variable	Default	Description
`region`	us-east-2	AWS Region
`pod_cidr`	192.168.0.0/16	Calico Pod overlay network
`inference_hardware`	cpu \| gpu	Select Inference hardware and node pools
`enable_efs_csi_driver`	true	Shared storage
`enable_vllm`	true	Deploy Vllm-production stack
`hf_token`	«secret»	HF model download token
`enable_prometheus`	true	prometheus-grafana stack
`cluster_version`	1.30	Kubernetes version
`nvidia_setup`	plugin	GPU setup mode (plugin/operator)

📓This is just a subset. For the full list of 20+ configurable variables, consult the configuration template : env-vars.template

🧪 Quick Test

1️⃣ Router Endpoint and API URL

1.1 Router Endpoint through port forwarding run the following command:

kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80

kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80

1.2 Extracting the Router URL via AWS ALB Ingress
If AWS LB Controller is enabled (enable_lb_ctl=true), The endpoint URL is displayed in vllm_ingress_hostname output :

$ kubectl get ingress -n vllm -o json| jq -r .items[0].status.loadBalancer.ingress[].hostname
k8s-vllm-vllmingr-983dc8fd68-161738753.us-east-2.elb.amazonaws.com

$ kubectl get ingress -n vllm -o json| jq -r .items[0].status.loadBalancer.ingress[].hostname
k8s-vllm-vllmingr-983dc8fd68-161738753.us-east-2.elb.amazonaws.com

2️⃣ List models

# Case 1 : Port forwarding
export vllm_api_url=http://localhost:30080/v1
# Case 2 : AWS ALB Ingress enabled
export vllm_api_url=http://k8s-vllm-vllmingr-983dc8fd68-161738753.us-east-2.elb.amazonaws.com/v1

# ---- check models
curl -s ${vllm_api_url}/models | jq .

# Case 1 : Port forwarding
export vllm_api_url=http://localhost:30080/v1
# Case 2 : AWS ALB Ingress enabled
export vllm_api_url=http://k8s-vllm-vllmingr-983dc8fd68-161738753.us-east-2.elb.amazonaws.com/v1

# ---- check models
curl -s ${vllm_api_url}/models | jq .

3️⃣ Completion Applicable for both ingress and port forwarding URLs

curl ${vllm_api_url}/completions     -H "Content-Type: application/json"     -d '{
        "model": "/data/models/tinyllama",
        "prompt": "Toronto is a",
        "max_tokens": 20,
        "temperature": 0
    }'| jq .choices[].text

---
//*
"city that is known for its vibrant nightlife, and there are plenty of bars and clubs"
//*

curl ${vllm_api_url}/completions     -H "Content-Type: application/json"     -d '{
        "model": "/data/models/tinyllama",
        "prompt": "Toronto is a",
        "max_tokens": 20,
        "temperature": 0
    }'| jq .choices[].text

---
//*
"city that is known for its vibrant nightlife, and there are plenty of bars and clubs"
//*

4️⃣ vLLM model service

kubectl -n vllm get svc

kubectl -n vllm get svc

🎯Troubleshooting:

AWS LB Controller webhook timeout
Calico discovery commands

From v2.5, the LB controller adds a MutatingWebhook that intercepts every Service of type LoadBalancer. As a result services will timeout waiting for the webhook to be available.

** no endpoints available for service "aws-load-balancer-webhook-service"

** no endpoints available for service "aws-load-balancer-webhook-service"

➡️Solution: We turned off the webhook as we don’t use serviceType: LoadBalancer here.

# in your blueprints-addons block
aws_load_balancer_controller = {
 enable_service_mutator_webhook = false   # turns off the webhook
}

# in your blueprints-addons block
aws_load_balancer_controller = {
 enable_service_mutator_webhook = false   # turns off the webhook
}

🫧 Cleanup Notes

In rare cases, you may need to manually clean up some AWS resources while running terraform destroy.
Here are the most common scenarios:

Note: These manual steps are only needed if terraform destroy encounters specific dependency issues.

1️⃣ Load balancer blocking public subnets/igw deletion

If AWS LB Controller is enabled (enable_lb_ctl=true), you may hit VPC deletion issues due to an LB dependency.

🧹Run the cleanup commands below👇🏼:

export PROFILE=profile_name  (ex: default)
export region=<region>       (ex: "us-east-2")

# 1. Clean up load balancer
alb_name=`aws elbv2 describe-load-balancers --query "LoadBalancers[*].LoadBalancerName" --output text --profile $PROFILE`

alb_arn=$(aws elbv2 describe-load-balancers --names  $alb_name --query 'LoadBalancers[0].LoadBalancerArn' \
  --output text --region $region --profile $PROFILE)

# delete :
aws elbv2 delete-load-balancer --load-balancer-arn "$alb_arn" --region $region --profile $PROFILE

export PROFILE=profile_name  (ex: default)
export region=<region>       (ex: "us-east-2")

# 1. Clean up load balancer
alb_name=`aws elbv2 describe-load-balancers --query "LoadBalancers[*].LoadBalancerName" --output text --profile $PROFILE`

alb_arn=$(aws elbv2 describe-load-balancers --names  $alb_name --query 'LoadBalancers[0].LoadBalancerArn' \
  --output text --region $region --profile $PROFILE)

# delete :
aws elbv2 delete-load-balancer --load-balancer-arn "$alb_arn" --region $region --profile $PROFILE

Re-Run terraform destroy

terraform destroy

terraform destroy

Note: Another solution is to disable AWS load balancer control creation altogether by setting the variable enable_lb_ctl to false see variables.tf

2️⃣ vllm namespace

If AWS LB Controller is enabled (enable_lb_ctl=true), the vLLM namespace can get stuck in “Terminating” state, you might need to patch some finalizers.

🧹Run the cleanup commands below👇🏼:

# Remove finalizers from AWS resources
RESOURCE_NAME=$(kubectl get targetgroupbinding.elbv2.k8s.aws -n vllm -o jsonpath='{.items[0].metadata.name}')
##
kubectl patch targetgroupbinding.elbv2.k8s.aws $RESOURCE_NAME -n vllm --type=merge -p '{"metadata":{"finalizers":[]}}'
## -- the delete might not be needed
kubectl delete targetgroupbinding.elbv2.k8s.aws $RESOURCE_NAME -n vllm --ignore-not-found=true
INGRESS_NAME=$(kubectl get ingress -n vllm -o jsonpath='{.items[0].metadata.name}')
kubectl patch ingress $INGRESS_NAME -n vllm --type=merge -p '{"metadata":{"finalizers":[]}}'

# Remove finalizers from AWS resources
RESOURCE_NAME=$(kubectl get targetgroupbinding.elbv2.k8s.aws -n vllm -o jsonpath='{.items[0].metadata.name}')
##
kubectl patch targetgroupbinding.elbv2.k8s.aws $RESOURCE_NAME -n vllm --type=merge -p '{"metadata":{"finalizers":[]}}'
## -- the delete might not be needed
kubectl delete targetgroupbinding.elbv2.k8s.aws $RESOURCE_NAME -n vllm --ignore-not-found=true
INGRESS_NAME=$(kubectl get ingress -n vllm -o jsonpath='{.items[0].metadata.name}')
kubectl patch ingress $INGRESS_NAME -n vllm --type=merge -p '{"metadata":{"finalizers":[]}}'

Re-Run terraform destroy

terraform destroy

terraform destroy

Note: Another solution is to disable AWS load balancer control creation altogether by setting the variable enable_lb_ctl to false see variables.tf

3️⃣ Calico Cleanup Jobs

If encountering job conflicts during Calico removal (i.e: * jobs.batch “tigera-operator-uninstall” already exists)

🧹Run the cleanup commands below👇🏼:

# use the following commands to delete the jobs manually first:
kubectl -n tigera-operator delete job tigera-operator-uninstall --ignore-not-found=true
kubectl -n tigera-operator delete job tigera-operator-delete-crds --ignore-not-found=true
kubectl delete ns tigera-operator --ignore-not-found=true

# use the following commands to delete the jobs manually first:
kubectl -n tigera-operator delete job tigera-operator-uninstall --ignore-not-found=true
kubectl -n tigera-operator delete job tigera-operator-delete-crds --ignore-not-found=true
kubectl delete ns tigera-operator --ignore-not-found=true

Re-Run terraform destroy

terraform destroy

terraform destroy

Note: Another solution is to disable AWS load balancer control creation altogether by setting the variable enable_lb_ctl to false see variables.tf

4️⃣ Clean up associated security groups

If AWS LB Controller is enabled (enable_lb_ctl=true), you might need to delete orphan SGs (non-default) to destroy subnets:

🧹Run the cleanup commands below👇🏼:

VPC_ID=$(aws ec2 describe-vpcs --query 'Vpcs[?Tags[?Key==`Name` && Value==`vllm-vpc`]].VpcId' --output text --profile $PROFILE)
# Deletion
aws ec2 describe-security-groups --filters Name=vpc-id,Values=${VPC_ID} --query "SecurityGroups[?starts_with(GroupName, 'k8s-') || contains(GroupName, 'vllm')].GroupId"    --output text    --profile ${PROFILE} |  tr -s '[:space:]' '\n' |  xargs -r -I{} aws ec2 delete-security-group --group-id {} --profile ${PROFILE}

VPC_ID=$(aws ec2 describe-vpcs --query 'Vpcs[?Tags[?Key==`Name` && Value==`vllm-vpc`]].VpcId' --output text --profile $PROFILE)
# Deletion
aws ec2 describe-security-groups --filters Name=vpc-id,Values=${VPC_ID} --query "SecurityGroups[?starts_with(GroupName, 'k8s-') || contains(GroupName, 'vllm')].GroupId"    --output text    --profile ${PROFILE} |  tr -s '[:space:]' '\n' |  xargs -r -I{} aws ec2 delete-security-group --group-id {} --profile ${PROFILE}

Re-Run terraform destroy

terraform destroy

terraform destroy

Note: Another solution is to disable AWS load balancer control creation altogether by setting the variable enable_lb_ctl to false see variables.tf

Conclusion

This deployment gives you a solid foundation for production LLM serving on AWS. Here are ways to extend it

Custom models: Swap TinyLlama for Llama 3, Mixtral, or your fine-tuned models
Auto-scaling: Configure HPA and cluster autoscaling for dynamic workloads –
CI/CD: Automate with GitOps (ArgoCD, FluxCD)
Multi-region: Deploy across multiple AWS regions for HA
Lmcache: leverage kv cache offloading using lmcache server

Next Steps 🚀

In the next post, we’re taking this stack to Azure AKS with a production-critical addition: automated SSL/TLS certificate management.

📚 Additional Resources

Run AI Your Way — In Your Cloud

Want full control over your AI backend? The CloudThrill VLLM Private Inference POC is still open — but not forever.

📢 Secure your spot (only a few left), 𝗔𝗽𝗽𝗹𝘆 𝗻𝗼𝘄!

Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control

Claim YOur FREE VLLM POC

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

🙋🏻‍♀️If you like this content please subscribe to our blog newsletter ❤️.

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Get in touch

Latest Podcasts

vLLM Production Stack on Amazon EKS with Terraform🧑🏼‍🚀

Intro

📂 Project Structure

🧰Prerequisites

What’s in the stack?📦

Architecture Overview

1. Networking Foundation

2. EKS Cluster

3. Essential Add-ons

4. vLLM Production Stack

5. Hardware Flexibility: CPU vs GPU

🖥️ AWS GPU Instance Types Available

Getting started

🔵 Deployment Steps

1️⃣Clone the repository

2️⃣ Set Up Environment Variables

Usage examples

3️⃣ Run Terraform deployment:

4️⃣ Observability (Grafana Login)

Automatic vLLM Dashboard

5️⃣ Destroying the Infrastructure 🚧

🛠️Configuration knobs

🧪 Quick Test

🎯Troubleshooting:

🫧 Cleanup Notes

1️⃣ Load balancer blocking public subnets/igw deletion

2️⃣ vllm namespace

3️⃣ Calico Cleanup Jobs

4️⃣ Clean up associated security groups

Conclusion

Next Steps 🚀

📚 Additional Resources

Run AI Your Way — In Your Cloud

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Intro

📂 Project Structure

🧰Prerequisites

What’s in the stack?📦

Architecture Overview

1. Networking Foundation

2. EKS Cluster

3. Essential Add-ons

4. vLLM Production Stack

5. Hardware Flexibility: CPU vs GPU

🖥️ AWS GPU Instance Types Available

Getting started

🔵 Deployment Steps

1️⃣Clone the repository

2️⃣ Set Up Environment Variables

Usage examples

3️⃣ Run Terraform deployment:

4️⃣ Observability (Grafana Login)

Automatic vLLM Dashboard

5️⃣ Destroying the Infrastructure 🚧

🛠️Configuration knobs

🧪 Quick Test

🎯Troubleshooting:

🫧 Cleanup Notes

1️⃣ Load balancer blocking public subnets/igw deletion

2️⃣ vllm namespace

3️⃣ Calico Cleanup Jobs

4️⃣ Clean up associated security groups

Conclusion

Next Steps 🚀

📚 Additional Resources

Run AI Your Way — In Your Cloud

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

👋🏻Want to chat about your challenges? We’d love to hear from you!

Don't miss a Bit!

Join countless others! Sign up and get awesome cloud content straight to your inbox. 🚀

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀