
Intro
The vLLM Production Stack is designed to work across any cloud provider with Kubernetes. After covering AWS EKS, today we’re deploying vLLM production-stack on Azure AKS with the same Terraform approach.
This guide shows you how to deploy the same production-ready LLM serving environment on Azure, with azure-specific optimizations. We’ll cover network architecture, certificate automation (using Let’s Encrypt), GPU provisioning, and observability for both CPU and GPU inference—all using Infrastructure as Code.
💡You can find our code in the CloudThrill repo ➡️ production-stack-terraform.
📂 Project Structure
./
├── main.tf
├── network.tf
├── provider.tf
├── variables.tf
├── output.tf
├── cluster-tools.tf
├── datasources.tf
├── vllm-production-stack.tf
├── env-vars.template
├── terraform.tfvars.template
├── modules/
│ ├── avm-res-cs-managedcluster/ # Azure Verified Module for AKS
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── terraform.tf
│ │ ├── locals.tf
│ │ ├── main.diagnostic.tf
│ │ ├── main.nodepool.tf
│ │ ├── main.privateendpoint.tf
│ │ ├── main.telemetry.tf
│ │ └── modules/
│ │ └── nodepool/
│ ├── az-networking/ # Azure Networking Module
│ │ └── vnet/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── terraform.tf
│ │ ├── data.tf
│ │ ├── locals.tf
│ │ ├── main.interfaces.tf
│ │ ├── main.peering.tf
│ │ ├── main.subnet.tf
│ │ ├── main.telemetry.tf
│ │ ├── main.virtual.network.tf
│ │ └── modules/
│ │ ├── peering/
│ │ └── subnet/
│ └── llm-stack/ # vLLM Helm Charts
│ └── helm/
│ ├── cpu/
│ │ ├── cpu-tinyllama-light-ingress-azure.tpl
│ └── gpu/
│ ├── gpu-operator-values.yaml
│ └── gpu-tinyllama-light-ingress-azure.tpl
├── config/
│ ├── helm/
│ │ └── kube-prome-stack.yaml
│ ├── manifests/
│ │ └── letsencrypt-issuer.yaml
│ ├── kubeconfig.tpl
│ └── vllm-dashboard.json
└── README.md # ← you are here🧰Prerequisites
Before you begin, ensure you have the following:
| Tool | Version | Notes |
|---|---|---|
| Terraform | ≥ 1.9 | Tested on 1.9+ |
| Azure CLI | ≥ 2.50 | For authentication |
| kubectl | ≥ 1.32 | ±1 of control-plane |
| jq | optional | JSON helper |
Follow the below steps to Install the tools (expend)👇🏼
# Install tools
sudo apt update && sudo apt install -y jq curl unzip gpg
# Terraform
wget -qO- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install -y terraform
# Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
# kubectl
curl -sLO "https://dl.k8s.io/release/$(curl -Ls https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install kubectl /usr/local/bin/ && rm kubectl- Configure AWS profile
# Login to Azure
az login
# Set subscription (if you have multiple)
az account set --subscription "YOUR_SUBSCRIPTION_ID"
# Verify
az account showWhat’s in the stack?📦
This terraform stack delivers a production-ready vLLM serving environment On Azure AKS supporting both CPU/GPU inference with operational best practices embedded in The Terraform Azure Verified Resource Modules .
It’s designed for real-world production workloads with:
✅ Enterprise-grade infrastructure: Built on Azure Verified Modules (avm-res-containerservice-managedcluster)
✅ Flexible compute: Switch between CPU and GPU inference with a single flag.
✅ Operational excellence: Prometheus, Grafana, and optional Azure Monitor.
✅ Security-first: Managed identities, Azure Key-Vault integration, Cilium network policies.
✅ Scalability: Cluster-autoscaler, user node-pools, spot VM support.
✅ Secure endpoints: HTTPS-only model serving through NGINX Ingress + Azure LB + Let’s Encrypt certificates
Architecture Overview

Deployment layers – The stack provisions infrastructure in logical layers that adapt based on your hardware choice:
| Layer | Component | CPU Mode | GPU Mode |
|---|---|---|---|
| Infrastructure | VNet + AKS + Azure CNI Cilium | ✅ Always deployed | ✅ Always deployed |
| Add-ons | Azure Disk CSI, NGINX Ingress, Prometheus | ✅ Always deployed | ✅ Always deployed |
| vLLM Stack | Secrets + Helm chart | ✅ Deploy on CPU nodes | ✅ + GPU nodes + NVIDIA operator |
| Networking | Load Balancer + Ingress + TLS + Let’s Encrypt | ✅ NGINX + cert-manager | ✅ NGINX + cert-manager |
1. Networking Foundation
The stack creates a production-grade network topology:
- Custom `
/16` VNet with 3 subnets (system, GPU, AppGateway) - Azure CNI Overlay with Cilium network policy (high pod density)
- NGINX Ingress Controller with Azure Load Balancer
- Automated SSL/TLS certificates via cert-manager + Let’s Encrypt
- Network security with Cilium policies
2. AKS Cluster
A Control plane v1.32 with two managed node-group Types
| Pool | Instance | Purpose |
|---|---|---|
system (default) |
Standard_D4s_v4 (4 vCPU / 16 GiB) | System services & CPU inference |
cpu-pool |
Standard_D4s_v4 (4 vCPU / 16 GiB) | CPU inference workloads |
gpu-pool (optional) |
Standard_NC4a_T4_v3 (1 × NVIDIA T4) | GPU inference |
3. Essential Add-ons
Core AKS add-ons via Azure Verified Modules:
| Category | Add-on |
|---|---|
| CNI | Azure CNI Overlay with Cilium network policy |
| Storage | Azure Disk CSI (block) Azure Files CSI (shared, optional) |
| Ingress | NGINX Ingress Controller with Azure LB |
| SSL/TLS | cert-manager + Let’s Encrypt ClusterIssuer |
| Core | CoreDNS, kube-proxy, Metrics Server |
| Observability | kube-prometheus-stack, Grafana |
| GPU (optional) | NVIDIA GPU Operator |
4. vLLM Production Stack
The heart of the deployment a production-ready model serving:
✅ Model: TinyLlama-1.1B (default, fully customizable)
✅ Load balancing: Round-robin router service across replicas
✅ Secrets: Hugging Face token stored as Kubernetes Secret
✅ Storage: Init container with persistent model caching at `/data/models/`
✅ Monitoring: Prometheus metrics endpoint for observability
✅ Default Helm charts: cpu-tinyllama-light-ingress-azure | gpu-tinyllama-light-ingress-azure
5. Hardware Flexibility: CPU vs GPU
You can choose to deploy VLLM production stack on either CPU or GPU using the inference_hardware parameter
# Deploy on CPU (default)n
export TF_VAR_inference_hardware=cpu
# Or deploy on GPU
export TF_VAR_inference_hardware=gpu🖥️ AWS GPU Instance Types Available
Available GPU instances (T4 · L4 · V100 · A10G · A100)
| Azure VM Size | vCPUs | Memory (GiB) | GPUs | GPU Memory (GiB) | Best For |
|---|---|---|---|---|---|
| NVIDIA Tesla T4 | |||||
Standard_NC4as_T4_v3 |
4 | 28 | 1 | 16 | Cost-effective inference |
Standard_NC8as_T4_v3 |
8 | 56 | 1 | 16 | Medium inference |
Standard_NC16as_T4_v3 |
16 | 110 | 1 | 16 | Large inference |
| NVIDIA Tesla V100 | |||||
Standard_NC6s_v3 |
6 | 112 | 1 | 16 | Training & inference |
Standard_NC12s_v3 |
12 | 224 | 2 | 32 | Multi-GPU training |
Standard_NC24s_v3 |
24 | 448 | 4 | 64 | Large-scale training |
| NVIDIA A100 | |||||
Standard_ND96asr_v4 |
96 | 900 | 8 | 320 | Large-scale AI training |
Getting started
The deployment automatically provisions only the required infrastructure based on your hardware selection.
| Phase | Component | Action | Condition |
|---|---|---|---|
| 1. Infrastructure | VNet | Create VNet with subnets | Always |
| AKS | Deploy v1.32 cluster + system & CPU node pools | Always | |
| CNI | Enable Azure CNI with Cilium | Always | |
| Add-ons | Install Azure Disk CSI, NGINX Ingress | Always | |
| 2. SSL/TLS | cert-manager | Install cert-manager | Always |
| ClusterIssuer | Create Let’s Encrypt production ClusterIssuer | Always | |
| 3. vLLM Stack | HF secret | Create Hugging Face token secret | enable_vllm = true |
| CPU Deployment | Deploy vLLM on CPU nodes | inference_hardware = "cpu" |
|
| GPU Infrastructure | Provision GPU node pool | inference_hardware = "gpu" |
|
| GPU Operator | Install NVIDIA GPU Operator | inference_hardware = "gpu" |
|
| Helm chart | Deploy TinyLlama-1.1B with HTTPS ingress | enable_vllm = true |
|
| 4. Observability | Prometheus/Grafana | Deploy stack + vLLM dashboard | Always |
🔵 Deployment Steps
1️⃣Clone the repository
The vLLM AKS deployment build is located under vllm-production-stack-terraform/aks directory:
- Navigate to the production-stack-terraform directory and terraform AKS tutorial folder
$ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
📂..
$ cd vllm-production-stack-terraform/aks/2️⃣ Set Up Environment Variables
Use an env-vars file to export your TF_VARS or use terraform.tfvars . Replace placeholders with your values:
cp env-vars.template
env-vars
vim env-vars
# Set HF token and customize deployment options
source env-varsUsage examples
- Option 1: Through Environment Variables
# Copy and customise
cp env-vars.template env-vars
vi env-vars
################################################################################
# Azure Credentials & Location
################################################################################
export TF_VAR_subscription_id="" # ← your Azure subscription id
export TF_VAR_tenant_id="" # ← your Azure tenant id
export TF_VAR_location="eastus" # Azure region
################################################################################
# AKS Cluster Basics
################################################################################
export TF_VAR_cluster_name="vllm-aks"
export TF_VAR_cluster_version="1.32"
################################################################################
# LLM Inference
################################################################################
export TF_VAR_enable_vllm="true" # deploy vLLM stack
export TF_VAR_hf_token="" # ← Hugging-Face token
export TF_VAR_inference_hardware="gpu" # "cpu" or "gpu"
export TF_VAR_letsencrypt_email="your@email.com" # chamge me
################################################################################
# Node pools (same sizing defaults as AKS)
################################################################################
# CPU pool (always present)
export TF_VAR_cpu_node_min_size="1"
export TF_VAR_cpu_node_max_size="2"
# GPU pool (only if inference_hardware="gpu")
export TF_VAR_gpu_node_min_size="1"
export TF_VAR_gpu_node_max_size="1"
################################################################################
# (optional) Paths to Helm value templates
################################################################################
export TF_VAR_cpu_vllm_helm_config="modules/llm-stack/helm/cpu/cpu-tinyllama-light-ingress-azure.tpl"
export TF_VAR_gpu_vllm_helm_config="modules/llm-stack/helm/gpu/gpu-tinyllama-light-ingress-azure.tpl"
# load vars
source env-vars- Option 2: Through Terraform Variables
# Copy and customizen
$ cp terraform.tfvars.example terraform.tfvars
$ vim terraform.tfvars- Load the Variables into Your Shell Before running Terraform, source the env-vars file:
$ source env-vars3️⃣ Run Terraform deployment:
You can now safely run Terraform plan & apply. You will deploy the 100 resources in total, including local kubeconfig.
terraform init
terraform plan
terraform applyFull Plan
Plan: 24 to add, 0 to change, 0 to destroy.
Changes to Outputs:
+ Stack_Info = "Built with ❤️ by @Cloudthrill"
+ aks_deployment_info = (sensitive value)
+ aks_kubelet_identity_id = (known after apply)
+ aks_name = "vllm-aks"
+ aks_oidc_issuer_url = "https://eastus.oic.prod-aks.azure.com/*"
+ aks_resource_id = (known after apply)
+ gpu_operator_status = {
+ deployed = "true"
+ name = "gpu-operator"
+ namespace = "gpu-operator"
+ version = "v25.3.1"
}
+ grafana_url = "https://grafana.14b9d539.sslip.io"
+ vllm_api_url = "https://vllm-api.14b9d539.sslip.io/v1"After the deployment you should be able to interact with the cluster using kubectl:
export KUBECONFIG=$PWD/kubeconfig4️⃣ Observability (Grafana Login)
You can access Grafana dashboards using grafana_url output or port forwarding .(i.e http://localhost:3000)
# Get Grafana HTTPS URL (already printed by Terraform) i.e https://grafana.xxxxx.nip.io
terraform output -raw grafana_url
# Or port forward
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack
- Run the below command to fetch the password
kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath={.data.admin-password} | base64 -d
- Username: admin
- Password : through kubectl command above
Automatic vLLM Dashboard
In this stack, the vLLM dashboard and service monitor are automatically configured for Grafana.

For Benchmarking vLLM Production Stack Performance check the multi-round QA tutorial
5️⃣ Destroying the Infrastructure 🚧
To delete everything just run the below (Note: sometimes you need to run it twice as the loadbalancer gets tough to die)
terraform destroy -auto-approve
# Destroy complete! Resources: 24 destroyed.🛠️Configuration knobs
This stack provides extensive customization options to tailor your deployment:
| Variable | Default | Description |
|---|---|---|
location |
eastus | Azure region |
cluster_version |
1.32 | Kubernetes version |
inference_hardware |
cpu | cpu or gpu |
pod_cidr |
10.244.0.0/16 | Pod overlay network |
enable_vllm |
false | Deploy vLLM stack |
hf_token |
«secret» | HF model download token |
enable_prometheus |
true | Prometheus-Grafana stack |
enable_cert_manager |
true | cert-manager for TLS |
letsencrypt_email |
admin@example.com | Email for Let’s Encrypt |
📓This is just a subset. For the full list of 20+ configurable variables, consult the configuration template : env-vars.template
🧪 Quick Test
1️⃣ Router Endpoint and API URL
1.1 Router Endpoint through port forwarding run the following command:
# Case 1 : Port forwarding
kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80
export vllm_api_url=http://localhost:30080/v11.2 Extracting the Router URL via nginx egress
The endpoint URL can be found in the vllm_api_url output :
# Case 2 : Extract from Terraform output
export vllm_api_url=$(terraform output -raw vllm_api_url)
# Example output:
# https://vllm.a1b2c3d4.nip.io/v1
2️⃣ List models
# check models
curl -s ${vllm_api_url}/models | jq .
3️⃣ Completion Applicable for both ingress and port forwarding URLs
curl ${vllm_api_url}/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/data/models/tinyllama",
"prompt": "Azure is a",
"max_tokens": 20,
"temperature": 0
}' | jq .choices[].text
4️⃣ vLLM model service
kubectl -n vllm get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
vllm-gpu-router-service ClusterIP 10.96.174.35 <none> 80/TCP,9000/TCP 29m
vllm-gpu-tinyllama-gpu-engine-service ClusterIP 10.96.226.142 <none> 80/TCP,55555/TCP,9999/TCP 29m🎯Troubleshooting:
Certificate Not Issuing
Debug: STATUS: Pending or False
# Check certificate status
kubectl describe certificate -n vllm
# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100
# Check HTTP-01 challenge
kubectl get challenge -n vllm- Symptom
# Message:
Failed to create new order: acme: urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates already issued for: nip.io: see letsencrypt.org/docs/rate-limitsnip.io to sslip.io in the ingress host of the vllm helm charts cpu-tinyllama-light-ingress-azure | gpu-tinyllama-light-ingress-azure
Useful Az cli Debugging Commands
# Check AKS cluster status
az aks show -g vllm-aks-rg -n vllm-aks
# Check node pools
az aks nodepool list -g vllm-aks-rg --cluster-name vllm-aks -o tableConclusion
After exploring EKS, this deployment now gives you a solid foundation for production LLM serving on Azure. This gives you an ideal starting point to extend it further.
Next Steps 🚀
- In the next post, we’re taking this stack to Google GKE. Stay Tuned!
📚 Additional Resources
- vLLM Documentation
- vLLM Production stack documentation
- Azure AKS Best Practices
- Azure Verified Modules
- Cilium Documentation
- terraform-azurerm-avm-res-containerservice-managedcluster

Run AI Your Way — In Your Cloud
Want full control over your AI backend? The CloudThrill VLLM Private Inference POC is still open — but not forever.
📢 Secure your spot (only a few left), 𝗔𝗽𝗽𝗹𝘆 𝗻𝗼𝘄!
Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control
𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…
🙋🏻♀️If you like this content please subscribe to our blog newsletter ❤️.