Intro

The vLLM Production Stack is designed to work across any cloud provider with Kubernetes. After covering AWS EKS, today we’re deploying vLLM production-stack on Azure AKS with the same Terraform approach.

This guide shows you how to deploy the same production-ready LLM serving environment on Azure, with azure-specific optimizations. We’ll cover network architecture, certificate automation (using Let’s Encrypt), GPU provisioning, and observability for both CPU and GPU inference—all using Infrastructure as Code.

💡You can find our code in the CloudThrill repo ➡️ production-stack-terraform.

This is part of CloudThrill‘s ongoing contribution to the vLLM Production Stack project. Extending terraform deployment patterns across AWS, Azure, GCP, Oracle OCI, and Nebius.

📂 Project Structure

./
├── main.tf
├── network.tf
├── provider.tf
├── variables.tf
├── output.tf
├── cluster-tools.tf
├── datasources.tf
├── vllm-production-stack.tf
├── env-vars.template
├── terraform.tfvars.template
├── modules/
│   ├── avm-res-cs-managedcluster/      # Azure Verified Module for AKS
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── terraform.tf
│   │   ├── locals.tf
│   │   ├── main.diagnostic.tf
│   │   ├── main.nodepool.tf
│   │   ├── main.privateendpoint.tf
│   │   ├── main.telemetry.tf
│   │   └── modules/
│   │       └── nodepool/
│   ├── az-networking/                   # Azure Networking Module
│   │   └── vnet/
│   │       ├── main.tf
│   │       ├── variables.tf
│   │       ├── outputs.tf
│   │       ├── terraform.tf
│   │       ├── data.tf
│   │       ├── locals.tf
│   │       ├── main.interfaces.tf
│   │       ├── main.peering.tf
│   │       ├── main.subnet.tf
│   │       ├── main.telemetry.tf
│   │       ├── main.virtual.network.tf
│   │       └── modules/
│   │           ├── peering/
│   │           └── subnet/
│   └── llm-stack/                       # vLLM Helm Charts
│       └── helm/
│           ├── cpu/
│           │   ├── cpu-tinyllama-light-ingress-azure.tpl
│           └── gpu/
│               ├── gpu-operator-values.yaml
│               └── gpu-tinyllama-light-ingress-azure.tpl
├── config/
│   ├── helm/
│   │   └── kube-prome-stack.yaml
│   ├── manifests/
│   │   └── letsencrypt-issuer.yaml
│   ├── kubeconfig.tpl
│   └── vllm-dashboard.json
└── README.md                            # ← you are here

./
├── main.tf
├── network.tf
├── provider.tf
├── variables.tf
├── output.tf
├── cluster-tools.tf
├── datasources.tf
├── vllm-production-stack.tf
├── env-vars.template
├── terraform.tfvars.template
├── modules/
│   ├── avm-res-cs-managedcluster/      # Azure Verified Module for AKS
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── terraform.tf
│   │   ├── locals.tf
│   │   ├── main.diagnostic.tf
│   │   ├── main.nodepool.tf
│   │   ├── main.privateendpoint.tf
│   │   ├── main.telemetry.tf
│   │   └── modules/
│   │       └── nodepool/
│   ├── az-networking/                   # Azure Networking Module
│   │   └── vnet/
│   │       ├── main.tf
│   │       ├── variables.tf
│   │       ├── outputs.tf
│   │       ├── terraform.tf
│   │       ├── data.tf
│   │       ├── locals.tf
│   │       ├── main.interfaces.tf
│   │       ├── main.peering.tf
│   │       ├── main.subnet.tf
│   │       ├── main.telemetry.tf
│   │       ├── main.virtual.network.tf
│   │       └── modules/
│   │           ├── peering/
│   │           └── subnet/
│   └── llm-stack/                       # vLLM Helm Charts
│       └── helm/
│           ├── cpu/
│           │   ├── cpu-tinyllama-light-ingress-azure.tpl
│           └── gpu/
│               ├── gpu-operator-values.yaml
│               └── gpu-tinyllama-light-ingress-azure.tpl
├── config/
│   ├── helm/
│   │   └── kube-prome-stack.yaml
│   ├── manifests/
│   │   └── letsencrypt-issuer.yaml
│   ├── kubeconfig.tpl
│   └── vllm-dashboard.json
└── README.md                            # ← you are here

🧰Prerequisites

Before you begin, ensure you have the following:

Tool	Version	Notes
Terraform	≥ 1.9	Tested on 1.9+
Azure CLI	≥ 2.50	For authentication
kubectl	≥ 1.32	±1 of control-plane
jq	optional	JSON helper

Follow the below steps to Install the tools (expend)👇🏼

# Install tools
sudo apt update && sudo apt install -y jq curl unzip gpg

  # Terraform
wget -qO- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install -y terraform

  # Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

# kubectl
curl -sLO "https://dl.k8s.io/release/$(curl -Ls https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install kubectl /usr/local/bin/ && rm kubectl

# Install tools
sudo apt update && sudo apt install -y jq curl unzip gpg

  # Terraform
wget -qO- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install -y terraform

  # Azure CLI
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

# kubectl
curl -sLO "https://dl.k8s.io/release/$(curl -Ls https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install kubectl /usr/local/bin/ && rm kubectl

Configure AWS profile

# Login to Azure
az login

# Set subscription (if you have multiple)
az account set --subscription "YOUR_SUBSCRIPTION_ID"

# Verify
az account show

# Login to Azure
az login

# Set subscription (if you have multiple)
az account set --subscription "YOUR_SUBSCRIPTION_ID"

# Verify
az account show

What’s in the stack?📦

This terraform stack delivers a production-ready vLLM serving environment On Azure AKS supporting both CPU/GPU inference with operational best practices embedded in The Terraform Azure Verified Resource Modules .

It’s designed for real-world production workloads with:
✅ Enterprise-grade infrastructure: Built on Azure Verified Modules (avm-res-containerservice-managedcluster)
✅ Flexible compute: Switch between CPU and GPU inference with a single flag.
✅ Operational excellence: Prometheus, Grafana, and optional Azure Monitor.
✅ Security-first: Managed identities, Azure Key-Vault integration, Cilium network policies.
✅ Scalability: Cluster-autoscaler, user node-pools, spot VM support.
✅ Secure endpoints: HTTPS-only model serving through NGINX Ingress + Azure LB + Let’s Encrypt certificates

Architecture Overview

Deployment layers – The stack provisions infrastructure in logical layers that adapt based on your hardware choice:

Layer	Component	CPU Mode	GPU Mode
Infrastructure	VNet + AKS + Azure CNI Cilium	✅ Always deployed	✅ Always deployed
Add-ons	Azure Disk CSI, NGINX Ingress, Prometheus	✅ Always deployed	✅ Always deployed
vLLM Stack	Secrets + Helm chart	✅ Deploy on CPU nodes	✅ + GPU nodes + NVIDIA operator
Networking	Load Balancer + Ingress + TLS + Let’s Encrypt	✅ NGINX + cert-manager	✅ NGINX + cert-manager

1. Networking Foundation

The stack creates a production-grade network topology:

Custom `/16` VNet with 3 subnets (system, GPU, AppGateway)
Azure CNI Overlay with Cilium network policy (high pod density)
NGINX Ingress Controller with Azure Load Balancer
Automated SSL/TLS certificates via cert-manager + Let’s Encrypt
Network security with Cilium policies

2. AKS Cluster

A Control plane v1.32 with two managed node-group Types

Pool	Instance	Purpose
`system` (default)	Standard_D4s_v4 (4 vCPU / 16 GiB)	System services & CPU inference
`cpu-pool`	Standard_D4s_v4 (4 vCPU / 16 GiB)	CPU inference workloads
`gpu-pool` (optional)	Standard_NC4a_T4_v3 (1 × NVIDIA T4)	GPU inference

3. Essential Add-ons

Core AKS add-ons via Azure Verified Modules:

Category	Add-on
CNI	Azure CNI Overlay with Cilium network policy
Storage	Azure Disk CSI (block) Azure Files CSI (shared, optional)
Ingress	NGINX Ingress Controller with Azure LB
SSL/TLS	cert-manager + Let’s Encrypt ClusterIssuer
Core	CoreDNS, kube-proxy, Metrics Server
Observability	kube-prometheus-stack, Grafana
GPU (optional)	NVIDIA GPU Operator

4. vLLM Production Stack

The heart of the deployment a production-ready model serving:

✅ Model: TinyLlama-1.1B (default, fully customizable)
✅ Load balancing: Round-robin router service across replicas
✅ Secrets: Hugging Face token stored as Kubernetes Secret
✅ Storage: Init container with persistent model caching at `/data/models/`
✅ Monitoring: Prometheus metrics endpoint for observability
✅ Default Helm charts: cpu-tinyllama-light-ingress-azure | gpu-tinyllama-light-ingress-azure

5. Hardware Flexibility: CPU vs GPU

You can choose to deploy VLLM production stack on either CPU or GPU using the inference_hardware parameter

# Deploy on CPU (default)n
export TF_VAR_inference_hardware=cpu
# Or deploy on GPU
export TF_VAR_inference_hardware=gpu

# Deploy on CPU (default)n
export TF_VAR_inference_hardware=cpu
# Or deploy on GPU
export TF_VAR_inference_hardware=gpu

🖥️ AWS GPU Instance Types Available

Available GPU instances (T4 · L4 · V100 · A10G · A100)

Azure VM Size	vCPUs	Memory (GiB)	GPUs	GPU Memory (GiB)	Best For
NVIDIA Tesla T4
`Standard_NC4as_T4_v3`	4	28	1	16	Cost-effective inference
`Standard_NC8as_T4_v3`	8	56	1	16	Medium inference
`Standard_NC16as_T4_v3`	16	110	1	16	Large inference
NVIDIA Tesla V100
`Standard_NC6s_v3`	6	112	1	16	Training & inference
`Standard_NC12s_v3`	12	224	2	32	Multi-GPU training
`Standard_NC24s_v3`	24	448	4	64	Large-scale training
NVIDIA A100
`Standard_ND96asr_v4`	96	900	8	320	Large-scale AI training

Note: Check the full list of Azure GPU instance offering here

Getting started

The deployment automatically provisions only the required infrastructure based on your hardware selection.

Phase	Component	Action	Condition
1. Infrastructure	VNet	Create VNet with subnets	Always
	AKS	Deploy v1.32 cluster + system & CPU node pools	Always
	CNI	Enable Azure CNI with Cilium	Always
	Add-ons	Install Azure Disk CSI, NGINX Ingress	Always
2. SSL/TLS	cert-manager	Install cert-manager	Always
2. SSL/TLS	ClusterIssuer	Create Let’s Encrypt production ClusterIssuer	Always
3. vLLM Stack	HF secret	Create Hugging Face token secret	`enable_vllm = true`
	CPU Deployment	Deploy vLLM on CPU nodes	`inference_hardware = "cpu"`
	GPU Infrastructure	Provision GPU node pool	`inference_hardware = "gpu"`
	GPU Operator	Install NVIDIA GPU Operator	`inference_hardware = "gpu"`
	Helm chart	Deploy TinyLlama-1.1B with HTTPS ingress	`enable_vllm = true`
4. Observability	Prometheus/Grafana	Deploy stack + vLLM dashboard	Always

🔵 Deployment Steps

1️⃣Clone the repository

The vLLM AKS deployment build is located under vllm-production-stack-terraform/aks directory:

Navigate to the production-stack-terraform directory and terraform AKS tutorial folder

$ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
 📂.. 
$ cd vllm-production-stack-terraform/aks/

$ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
 📂.. 
$ cd vllm-production-stack-terraform/aks/

2️⃣ Set Up Environment Variables

Use an env-vars file to export your TF_VARS or use terraform.tfvars . Replace placeholders with your values:

cp env-vars.template 
env-vars
vim env-vars  
# Set HF token and customize deployment options
source env-vars

cp env-vars.template 
env-vars
vim env-vars  
# Set HF token and customize deployment options
source env-vars

Usage examples

Option 1: Through Environment Variables

# Copy and customise
cp env-vars.template env-vars
vi env-vars

################################################################################
# Azure Credentials & Location
################################################################################
export TF_VAR_subscription_id=""        # ← your Azure subscription id
export TF_VAR_tenant_id=""              # ← your Azure tenant id
export TF_VAR_location="eastus"         # Azure region

################################################################################
# AKS Cluster Basics
################################################################################
export TF_VAR_cluster_name="vllm-aks"
export TF_VAR_cluster_version="1.32"

################################################################################
# LLM Inference
################################################################################
export TF_VAR_enable_vllm="true"        # deploy vLLM stack
export TF_VAR_hf_token=""               # ← Hugging-Face token
export TF_VAR_inference_hardware="gpu"  # "cpu" or "gpu"
export TF_VAR_letsencrypt_email="your@email.com"  # chamge me

################################################################################
# Node pools (same sizing defaults as AKS)
################################################################################
# CPU pool (always present)
export TF_VAR_cpu_node_min_size="1"
export TF_VAR_cpu_node_max_size="2" 

# GPU pool (only if inference_hardware="gpu")
export TF_VAR_gpu_node_min_size="1"
export TF_VAR_gpu_node_max_size="1" 

################################################################################
# (optional) Paths to Helm value templates
################################################################################
export TF_VAR_cpu_vllm_helm_config="modules/llm-stack/helm/cpu/cpu-tinyllama-light-ingress-azure.tpl"
export TF_VAR_gpu_vllm_helm_config="modules/llm-stack/helm/gpu/gpu-tinyllama-light-ingress-azure.tpl"

# load vars
source env-vars

# Copy and customise
cp env-vars.template env-vars
vi env-vars

################################################################################
# Azure Credentials & Location
################################################################################
export TF_VAR_subscription_id=""        # ← your Azure subscription id
export TF_VAR_tenant_id=""              # ← your Azure tenant id
export TF_VAR_location="eastus"         # Azure region

################################################################################
# AKS Cluster Basics
################################################################################
export TF_VAR_cluster_name="vllm-aks"
export TF_VAR_cluster_version="1.32"

################################################################################
# LLM Inference
################################################################################
export TF_VAR_enable_vllm="true"        # deploy vLLM stack
export TF_VAR_hf_token=""               # ← Hugging-Face token
export TF_VAR_inference_hardware="gpu"  # "cpu" or "gpu"
export TF_VAR_letsencrypt_email="your@email.com"  # chamge me

################################################################################
# Node pools (same sizing defaults as AKS)
################################################################################
# CPU pool (always present)
export TF_VAR_cpu_node_min_size="1"
export TF_VAR_cpu_node_max_size="2" 

# GPU pool (only if inference_hardware="gpu")
export TF_VAR_gpu_node_min_size="1"
export TF_VAR_gpu_node_max_size="1" 

################################################################################
# (optional) Paths to Helm value templates
################################################################################
export TF_VAR_cpu_vllm_helm_config="modules/llm-stack/helm/cpu/cpu-tinyllama-light-ingress-azure.tpl"
export TF_VAR_gpu_vllm_helm_config="modules/llm-stack/helm/gpu/gpu-tinyllama-light-ingress-azure.tpl"

# load vars
source env-vars

Option 2: Through Terraform Variables

 # Copy and customizen 
 $ cp terraform.tfvars.example terraform.tfvars
 $ vim terraform.tfvars

 # Copy and customizen 
 $ cp terraform.tfvars.example terraform.tfvars
 $ vim terraform.tfvars

Load the Variables into Your Shell Before running Terraform, source the env-vars file:

$ source env-vars

$ source env-vars

3️⃣ Run Terraform deployment:

You can now safely run Terraform plan & apply. You will deploy the 100 resources in total, including local kubeconfig.

terraform init
terraform plan
terraform apply

terraform init
terraform plan
terraform apply

Full Plan

Plan: 24 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + Stack_Info                               = "Built with ❤️ by @Cloudthrill"
  + aks_deployment_info                      = (sensitive value)
  + aks_kubelet_identity_id                  = (known after apply)
  + aks_name                                 = "vllm-aks"
  + aks_oidc_issuer_url                      = "https://eastus.oic.prod-aks.azure.com/*"
  + aks_resource_id                          = (known after apply)
  + gpu_operator_status                      = {
      + deployed  = "true"
      + name      = "gpu-operator"
      + namespace = "gpu-operator"
      + version   = "v25.3.1"
    }
  + grafana_url                              = "https://grafana.14b9d539.sslip.io"
  + vllm_api_url                             = "https://vllm-api.14b9d539.sslip.io/v1"

Plan: 24 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + Stack_Info                               = "Built with ❤️ by @Cloudthrill"
  + aks_deployment_info                      = (sensitive value)
  + aks_kubelet_identity_id                  = (known after apply)
  + aks_name                                 = "vllm-aks"
  + aks_oidc_issuer_url                      = "https://eastus.oic.prod-aks.azure.com/*"
  + aks_resource_id                          = (known after apply)
  + gpu_operator_status                      = {
      + deployed  = "true"
      + name      = "gpu-operator"
      + namespace = "gpu-operator"
      + version   = "v25.3.1"
    }
  + grafana_url                              = "https://grafana.14b9d539.sslip.io"
  + vllm_api_url                             = "https://vllm-api.14b9d539.sslip.io/v1"

After the deployment you should be able to interact with the cluster using kubectl:

export KUBECONFIG=$PWD/kubeconfig

export KUBECONFIG=$PWD/kubeconfig

4️⃣ Observability (Grafana Login)

You can access Grafana dashboards using grafana_url output or port forwarding .(i.e http://localhost:3000)

# Get Grafana HTTPS URL (already printed by Terraform) i.e https://grafana.xxxxx.nip.io
terraform output -raw grafana_url 
# Or port forward
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack

# Get Grafana HTTPS URL (already printed by Terraform) i.e https://grafana.xxxxx.nip.io
terraform output -raw grafana_url 
# Or port forward
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack

Run the below command to fetch the password

kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath={.data.admin-password} | base64 -d

kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath={.data.admin-password} | base64 -d

Username: admin
Password : through kubectl command above

Automatic vLLM Dashboard

In this stack, the vLLM dashboard and service monitor are automatically configured for Grafana.

For Benchmarking vLLM Production Stack Performance check the multi-round QA tutorial

5️⃣ Destroying the Infrastructure 🚧

To delete everything just run the below (Note: sometimes you need to run it twice as the loadbalancer gets tough to die)

terraform destroy -auto-approve
# Destroy complete! Resources: 24 destroyed.

terraform destroy -auto-approve
# Destroy complete! Resources: 24 destroyed.

🛠️Configuration knobs

This stack provides extensive customization options to tailor your deployment:

Variable	Default	Description
`location`	eastus	Azure region
`cluster_version`	1.32	Kubernetes version
`inference_hardware`	cpu	cpu or gpu
`pod_cidr`	10.244.0.0/16	Pod overlay network
`enable_vllm`	false	Deploy vLLM stack
`hf_token`	«secret»	HF model download token
`enable_prometheus`	true	Prometheus-Grafana stack
`enable_cert_manager`	true	cert-manager for TLS
`letsencrypt_email`	admin@example.com	Email for Let’s Encrypt

📓This is just a subset. For the full list of 20+ configurable variables, consult the configuration template : env-vars.template

🧪 Quick Test

1️⃣ Router Endpoint and API URL

1.1 Router Endpoint through port forwarding run the following command:

# Case 1 : Port forwarding
kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80
export vllm_api_url=http://localhost:30080/v1

# Case 1 : Port forwarding
kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80
export vllm_api_url=http://localhost:30080/v1

1.2 Extracting the Router URL via nginx egress
The endpoint URL can be found in the vllm_api_url output :

# Case 2 : Extract from Terraform output 
export vllm_api_url=$(terraform output -raw vllm_api_url)

# Example output:
# https://vllm.a1b2c3d4.nip.io/v1

# Case 2 : Extract from Terraform output 
export vllm_api_url=$(terraform output -raw vllm_api_url)

# Example output:
# https://vllm.a1b2c3d4.nip.io/v1

2️⃣ List models

# check models
curl -s ${vllm_api_url}/models | jq .

# check models
curl -s ${vllm_api_url}/models | jq .

3️⃣ Completion Applicable for both ingress and port forwarding URLs

curl ${vllm_api_url}/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/data/models/tinyllama",
    "prompt": "Azure is a",
    "max_tokens": 20,
    "temperature": 0
  }' | jq .choices[].text

curl ${vllm_api_url}/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/data/models/tinyllama",
    "prompt": "Azure is a",
    "max_tokens": 20,
    "temperature": 0
  }' | jq .choices[].text

4️⃣ vLLM model service

kubectl -n vllm get svc

NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                     AGE
vllm-gpu-router-service                 ClusterIP   10.96.174.35    <none>        80/TCP,9000/TCP             29m
vllm-gpu-tinyllama-gpu-engine-service   ClusterIP   10.96.226.142   <none>        80/TCP,55555/TCP,9999/TCP   29m

kubectl -n vllm get svc

NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                     AGE
vllm-gpu-router-service                 ClusterIP   10.96.174.35    <none>        80/TCP,9000/TCP             29m
vllm-gpu-tinyllama-gpu-engine-service   ClusterIP   10.96.226.142   <none>        80/TCP,55555/TCP,9999/TCP   29m

🎯Troubleshooting:

Certificate Not Issuing

Debug: STATUS: Pending or False

# Check certificate status
kubectl describe certificate -n vllm

# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100

# Check HTTP-01 challenge
kubectl get challenge -n vllm

# Check certificate status
kubectl describe certificate -n vllm

# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100

# Check HTTP-01 challenge
kubectl get challenge -n vllm

Symptom

# Message: 
Failed to create new order: acme: urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates already issued for: nip.io: see letsencrypt.org/docs/rate-limits

# Message: 
Failed to create new order: acme: urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates already issued for: nip.io: see letsencrypt.org/docs/rate-limits

Fix: Change nip.io to sslip.io in the ingress host of the vllm helm charts
cpu-tinyllama-light-ingress-azure | gpu-tinyllama-light-ingress-azure

Useful Az cli Debugging Commands

# Check AKS cluster status
az aks show -g vllm-aks-rg -n vllm-aks

# Check node pools
az aks nodepool list -g vllm-aks-rg --cluster-name vllm-aks -o table

# Check AKS cluster status
az aks show -g vllm-aks-rg -n vllm-aks

# Check node pools
az aks nodepool list -g vllm-aks-rg --cluster-name vllm-aks -o table

Conclusion

After exploring EKS, this deployment now gives you a solid foundation for production LLM serving on Azure. This gives you an ideal starting point to extend it further.

Next Steps 🚀

In the next post, we’re taking this stack to Google GKE. Stay Tuned!

📚 Additional Resources

Run AI Your Way — In Your Cloud

Want full control over your AI backend? The CloudThrill VLLM Private Inference POC is still open — but not forever.

📢 Secure your spot (only a few left), 𝗔𝗽𝗽𝗹𝘆 𝗻𝗼𝘄!

Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control

Claim YOur FREE VLLM POC

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

🙋🏻‍♀️If you like this content please subscribe to our blog newsletter ❤️.

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Get in touch

Latest Podcasts

vLLM Production Stack on Azure AKS with Terraform🧑🏼‍🚀

Intro

📂 Project Structure

🧰Prerequisites

What’s in the stack?📦

Architecture Overview

1. Networking Foundation

2. AKS Cluster

3. Essential Add-ons

4. vLLM Production Stack

5. Hardware Flexibility: CPU vs GPU

🖥️ AWS GPU Instance Types Available

Getting started

🔵 Deployment Steps

1️⃣Clone the repository

2️⃣ Set Up Environment Variables

Usage examples

3️⃣ Run Terraform deployment:

4️⃣ Observability (Grafana Login)

Automatic vLLM Dashboard

5️⃣ Destroying the Infrastructure 🚧

🛠️Configuration knobs

🧪 Quick Test

🎯Troubleshooting:

Certificate Not Issuing

Useful Az cli Debugging Commands

Conclusion

Next Steps 🚀

📚 Additional Resources

Run AI Your Way — In Your Cloud

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Intro

📂 Project Structure

🧰Prerequisites

What’s in the stack?📦

Architecture Overview

1. Networking Foundation

2. AKS Cluster

3. Essential Add-ons

4. vLLM Production Stack

5. Hardware Flexibility: CPU vs GPU

🖥️ AWS GPU Instance Types Available

Getting started

🔵 Deployment Steps

1️⃣Clone the repository

2️⃣ Set Up Environment Variables

Usage examples

3️⃣ Run Terraform deployment:

4️⃣ Observability (Grafana Login)

Automatic vLLM Dashboard

5️⃣ Destroying the Infrastructure 🚧

🛠️Configuration knobs

🧪 Quick Test

🎯Troubleshooting:

Certificate Not Issuing

Useful Az cli Debugging Commands

Conclusion

Next Steps 🚀

📚 Additional Resources

Run AI Your Way — In Your Cloud

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

👋🏻Want to chat about your challenges? We’d love to hear from you!

Don't miss a Bit!

Join countless others! Sign up and get awesome cloud content straight to your inbox. 🚀

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀