vLLM Production Stack on Azure AKS with Terraform🧑🏼‍🚀

Intro

The vLLM Production Stack is designed to work across any cloud provider with Kubernetes. After covering AWS EKS, today we’re deploying vLLM production-stack on Azure AKS with the same Terraform approach.

This guide shows you how to deploy the same production-ready LLM serving environment on Azure, with azure-specific optimizations. We’ll cover network architecture, certificate automation (using Let’s Encrypt), GPU provisioning, and observability for both CPU and GPU inference—all using Infrastructure as Code.

💡You can find our code in the CloudThrill repo ➡️ production-stack-terraform.

This is part of CloudThrill‘s ongoing contribution to the vLLM Production Stack project. Extending terraform deployment patterns across AWS, Azure, GCP, Oracle OCI, and Nebius.

📂 Project Structure

./
├── main.tf
├── network.tf
├── provider.tf
├── variables.tf
├── output.tf
├── cluster-tools.tf
├── datasources.tf
├── vllm-production-stack.tf
├── env-vars.template
├── terraform.tfvars.template
├── modules/
   ├── avm-res-cs-managedcluster/      # Azure Verified Module for AKS
      ├── main.tf
      ├── variables.tf
      ├── outputs.tf
      ├── terraform.tf
      ├── locals.tf
      ├── main.diagnostic.tf
      ├── main.nodepool.tf
      ├── main.privateendpoint.tf
      ├── main.telemetry.tf
      └── modules/
          └── nodepool/
   ├── az-networking/                   # Azure Networking Module
      └── vnet/
          ├── main.tf
          ├── variables.tf
          ├── outputs.tf
          ├── terraform.tf
          ├── data.tf
          ├── locals.tf
          ├── main.interfaces.tf
          ├── main.peering.tf
          ├── main.subnet.tf
          ├── main.telemetry.tf
          ├── main.virtual.network.tf
          └── modules/
              ├── peering/
              └── subnet/
   └── llm-stack/                       # vLLM Helm Charts
       └── helm/
           ├── cpu/
              ├── cpu-tinyllama-light-ingress-azure.tpl
           └── gpu/
               ├── gpu-operator-values.yaml
               └── gpu-tinyllama-light-ingress-azure.tpl
├── config/
   ├── helm/
      └── kube-prome-stack.yaml
   ├── manifests/
      └── letsencrypt-issuer.yaml
   ├── kubeconfig.tpl
   └── vllm-dashboard.json
└── README.md                            # ← you are here

🧰Prerequisites

Before you begin, ensure you have the following:

Tool Version Notes
Terraform ≥ 1.9 Tested on 1.9+
Azure CLI ≥ 2.50 For authentication
kubectl ≥ 1.32 ±1 of control-plane
jq optional JSON helper
Follow the below steps to Install the tools (expend)👇🏼
# Install tools
sudo apt update && sudo apt install -y jq curl unzip gpg

  # Terraform
wget -qO- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install -y terraform

  # Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

# kubectl
curl -sLO "https://dl.k8s.io/release/$(curl -Ls https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install kubectl /usr/local/bin/ && rm kubectl
  • Configure AWS profile
# Login to Azure
az login

# Set subscription (if you have multiple)
az account set --subscription "YOUR_SUBSCRIPTION_ID"

# Verify
az account show

What’s in the stack?📦

This terraform stack delivers a production-ready vLLM serving environment On Azure AKS supporting both CPU/GPU inference with operational best practices embedded in The Terraform Azure Verified Resource Modules .

It’s designed for real-world production workloads with:
✅ Enterprise-grade infrastructure: Built on Azure Verified Modules (avm-res-containerservice-managedcluster)
✅ Flexible compute: Switch between CPU and GPU inference with a single flag.
✅ Operational excellence: Prometheus, Grafana, and optional Azure Monitor.
Security-first: Managed identities, Azure Key-Vault integration, Cilium network policies.
✅ Scalability: Cluster-autoscaler, user node-pools, spot VM support.
✅ Secure endpoints: HTTPS-only model serving through NGINX Ingress + Azure LB + Let’s Encrypt certificates

Architecture Overview

Deployment layers – The stack provisions infrastructure in logical layers that adapt based on your hardware choice:

Layer Component CPU Mode GPU Mode
Infrastructure VNet + AKS + Azure CNI Cilium ✅ Always deployed ✅ Always deployed
Add-ons Azure Disk CSI, NGINX Ingress, Prometheus ✅ Always deployed ✅ Always deployed
vLLM Stack Secrets + Helm chart ✅ Deploy on CPU nodes ✅ + GPU nodes + NVIDIA operator
Networking Load Balancer + Ingress + TLS + Let’s Encrypt ✅ NGINX + cert-manager ✅ NGINX + cert-manager

1. Networking Foundation

The stack creates a production-grade network topology:

  • Custom `/16` VNet with 3 subnets (system, GPU, AppGateway)
  • Azure CNI Overlay with Cilium network policy (high pod density)
  • NGINX Ingress Controller with Azure Load Balancer
  • Automated SSL/TLS certificates via cert-manager + Let’s Encrypt
  • Network security with Cilium policies

2. AKS Cluster

A Control plane v1.32 with two managed node-group Types

Pool Instance Purpose
system (default) Standard_D4s_v4 (4 vCPU / 16 GiB) System services & CPU inference
cpu-pool Standard_D4s_v4 (4 vCPU / 16 GiB) CPU inference workloads
gpu-pool (optional) Standard_NC4a_T4_v3 (1 × NVIDIA T4) GPU inference

3. Essential Add-ons

Core AKS add-ons via Azure Verified Modules:

Category Add-on
CNI Azure CNI Overlay with Cilium network policy
Storage Azure Disk CSI (block)
Azure Files CSI (shared, optional)
Ingress NGINX Ingress Controller with Azure LB
SSL/TLS cert-manager + Let’s Encrypt ClusterIssuer
Core CoreDNS, kube-proxy, Metrics Server
Observability kube-prometheus-stack, Grafana
GPU (optional) NVIDIA GPU Operator

4. vLLM Production Stack

The heart of the deployment a production-ready model serving:

✅ Model: TinyLlama-1.1B (default, fully customizable)
✅ Load balancing: Round-robin router service across replicas
✅ Secrets: Hugging Face token stored as Kubernetes Secret
Storage: Init container with persistent model caching at `/data/models/`
Monitoring: Prometheus metrics endpoint for observability
Default Helm chartscpu-tinyllama-light-ingress-azure | gpu-tinyllama-light-ingress-azure

5. Hardware Flexibility: CPU vs GPU

You can choose to deploy VLLM production stack on either CPU or GPU using the inference_hardware parameter

# Deploy on CPU (default)n
export TF_VAR_inference_hardware=cpu
# Or deploy on GPU
export TF_VAR_inference_hardware=gpu

🖥️ AWS GPU Instance Types Available

Available GPU instances (T4 · L4 · V100 · A10G · A100)
Azure VM Size vCPUs Memory (GiB) GPUs GPU Memory (GiB) Best For
NVIDIA Tesla T4
Standard_NC4as_T4_v3 4 28 1 16 Cost-effective inference
Standard_NC8as_T4_v3 8 56 1 16 Medium inference
Standard_NC16as_T4_v3 16 110 1 16 Large inference
NVIDIA Tesla V100
Standard_NC6s_v3 6 112 1 16 Training & inference
Standard_NC12s_v3 12 224 2 32 Multi-GPU training
Standard_NC24s_v3 24 448 4 64 Large-scale training
NVIDIA A100
Standard_ND96asr_v4 96 900 8 320 Large-scale AI training
Note: Check the full list of Azure GPU instance offering here

Getting started

The deployment automatically provisions only the required infrastructure based on your hardware selection.

Phase Component Action Condition
1. Infrastructure VNet Create VNet with subnets Always
AKS Deploy v1.32 cluster + system & CPU node pools Always
CNI Enable Azure CNI with Cilium Always
Add-ons Install Azure Disk CSI, NGINX Ingress Always
2. SSL/TLS cert-manager Install cert-manager Always
ClusterIssuer Create Let’s Encrypt production ClusterIssuer Always
3. vLLM Stack HF secret Create Hugging Face token secret enable_vllm = true
CPU Deployment Deploy vLLM on CPU nodes inference_hardware = "cpu"
GPU Infrastructure Provision GPU node pool inference_hardware = "gpu"
GPU Operator Install NVIDIA GPU Operator inference_hardware = "gpu"
Helm chart Deploy TinyLlama-1.1B with HTTPS ingress enable_vllm = true
4. Observability Prometheus/Grafana Deploy stack + vLLM dashboard Always

🔵 Deployment Steps

1️⃣Clone the repository

The vLLM AKS deployment build is located under vllm-production-stack-terraform/aks directory:

  • Navigate to the production-stack-terraform directory and terraform AKS tutorial folder
$ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
 📂.. 
$ cd vllm-production-stack-terraform/aks/

2️⃣ Set Up Environment Variables

Use an env-vars file to export your TF_VARS or use terraform.tfvars . Replace placeholders with your values:

cp env-vars.template 
env-vars
vim env-vars  
# Set HF token and customize deployment options
source env-vars

Usage examples

  • Option 1: Through Environment Variables
# Copy and customise
cp env-vars.template env-vars
vi env-vars

################################################################################
# Azure Credentials & Location
################################################################################
export TF_VAR_subscription_id=""        # ← your Azure subscription id
export TF_VAR_tenant_id=""              # ← your Azure tenant id
export TF_VAR_location="eastus"         # Azure region

################################################################################
# AKS Cluster Basics
################################################################################
export TF_VAR_cluster_name="vllm-aks"
export TF_VAR_cluster_version="1.32"

################################################################################
# LLM Inference
################################################################################
export TF_VAR_enable_vllm="true"        # deploy vLLM stack
export TF_VAR_hf_token=""               # ← Hugging-Face token
export TF_VAR_inference_hardware="gpu"  # "cpu" or "gpu"
export TF_VAR_letsencrypt_email="your@email.com"  # chamge me

################################################################################
# Node pools (same sizing defaults as AKS)
################################################################################
# CPU pool (always present)
export TF_VAR_cpu_node_min_size="1"
export TF_VAR_cpu_node_max_size="2" 

# GPU pool (only if inference_hardware="gpu")
export TF_VAR_gpu_node_min_size="1"
export TF_VAR_gpu_node_max_size="1" 

################################################################################
# (optional) Paths to Helm value templates
################################################################################
export TF_VAR_cpu_vllm_helm_config="modules/llm-stack/helm/cpu/cpu-tinyllama-light-ingress-azure.tpl"
export TF_VAR_gpu_vllm_helm_config="modules/llm-stack/helm/gpu/gpu-tinyllama-light-ingress-azure.tpl"

# load vars
source env-vars
  • Option 2: Through Terraform Variables
 # Copy and customizen 
 $ cp terraform.tfvars.example terraform.tfvars
 $ vim terraform.tfvars
  • Load the Variables into Your Shell Before running Terraform, source the env-vars file:
$ source env-vars

3️⃣ Run Terraform deployment:

You can now safely run Terraform plan & apply. You will deploy the 100 resources in total, including local kubeconfig.

terraform init
terraform plan
terraform apply

Full Plan
Plan: 24 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + Stack_Info                               = "Built with ❤️ by @Cloudthrill"
  + aks_deployment_info                      = (sensitive value)
  + aks_kubelet_identity_id                  = (known after apply)
  + aks_name                                 = "vllm-aks"
  + aks_oidc_issuer_url                      = "https://eastus.oic.prod-aks.azure.com/*"
  + aks_resource_id                          = (known after apply)
  + gpu_operator_status                      = {
      + deployed  = "true"
      + name      = "gpu-operator"
      + namespace = "gpu-operator"
      + version   = "v25.3.1"
    }
  + grafana_url                              = "https://grafana.14b9d539.sslip.io"
  + vllm_api_url                             = "https://vllm-api.14b9d539.sslip.io/v1"

After the deployment you should be able to interact with the cluster using kubectl:

export KUBECONFIG=$PWD/kubeconfig

4️⃣ Observability (Grafana Login)

You can access Grafana dashboards using grafana_url output or port forwarding .(i.e http://localhost:3000)

# Get Grafana HTTPS URL (already printed by Terraform) i.e https://grafana.xxxxx.nip.io
terraform output -raw grafana_url 
# Or port forward
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack
  • Run the below command to fetch the password
kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath={.data.admin-password} | base64 -d
  • Username: admin
  • Password : through kubectl command above

Automatic vLLM Dashboard

In this stack, the vLLM dashboard and service monitor are automatically configured for Grafana.

For Benchmarking vLLM Production Stack Performance check the multi-round QA tutorial

5️⃣ Destroying the Infrastructure 🚧

To delete everything just run the below (Note: sometimes you need to run it twice as the loadbalancer gets tough to die)

terraform destroy -auto-approve
# Destroy complete! Resources: 24 destroyed.



🛠️Configuration knobs

This stack provides extensive customization options to tailor your deployment:

Variable Default Description
location eastus Azure region
cluster_version 1.32 Kubernetes version
inference_hardware cpu cpu or gpu
pod_cidr 10.244.0.0/16 Pod overlay network
enable_vllm false Deploy vLLM stack
hf_token «secret» HF model download token
enable_prometheus true Prometheus-Grafana stack
enable_cert_manager true cert-manager for TLS
letsencrypt_email admin@example.com Email for Let’s Encrypt

📓This is just a subset. For the full list of 20+ configurable variables, consult the configuration template : env-vars.template

🧪 Quick Test

1️⃣ Router Endpoint and API URL

1.1 Router Endpoint through port forwarding run the following command:

# Case 1 : Port forwarding
kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80
export vllm_api_url=http://localhost:30080/v1

1.2 Extracting the Router URL via nginx egress 
The endpoint URL can be found in the vllm_api_url output :

# Case 2 : Extract from Terraform output 
export vllm_api_url=$(terraform output -raw vllm_api_url)

# Example output:
# https://vllm.a1b2c3d4.nip.io/v1


2️⃣ List models

# check models
curl -s ${vllm_api_url}/models | jq .


3️⃣ Completion Applicable for both ingress and port forwarding URLs

curl ${vllm_api_url}/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/data/models/tinyllama",
    "prompt": "Azure is a",
    "max_tokens": 20,
    "temperature": 0
  }' | jq .choices[].text


4️⃣ vLLM model service

kubectl -n vllm get svc

NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                     AGE
vllm-gpu-router-service                 ClusterIP   10.96.174.35    <none>        80/TCP,9000/TCP             29m
vllm-gpu-tinyllama-gpu-engine-service   ClusterIP   10.96.226.142   <none>        80/TCP,55555/TCP,9999/TCP   29m

🎯Troubleshooting:

Certificate Not Issuing

Debug: STATUS: Pending or False

# Check certificate status
kubectl describe certificate -n vllm

# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100

# Check HTTP-01 challenge
kubectl get challenge -n vllm

  • Symptom
# Message: 
Failed to create new order: acme: urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates already issued for: nip.io: see letsencrypt.org/docs/rate-limits
Fix: Change nip.io to sslip.io in the ingress host of the vllm helm charts
cpu-tinyllama-light-ingress-azure | gpu-tinyllama-light-ingress-azure

Useful Az cli Debugging Commands

# Check AKS cluster status
az aks show -g vllm-aks-rg -n vllm-aks

# Check node pools
az aks nodepool list -g vllm-aks-rg --cluster-name vllm-aks -o table

Conclusion

After exploring EKS, this deployment now gives you a solid foundation for production LLM serving on Azure. This gives you an ideal starting point to extend it further.

Next Steps 🚀

  • In the next post, we’re taking this stack to Google GKE. Stay Tuned!

📚 Additional Resources


Run AI Your Way — In Your Cloud


Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

🙋🏻‍♀️If you like this content please subscribe to our blog newsletter ❤️.

👋🏻Want to chat about your challenges?
We’d love to hear from you! 

Share this…

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Start your Cloud journey with us today .