
Intro
The vLLM Production Stack is designed to work across any cloud provider with Kubernetes. After covering AWS EKS, Azure AKS, and Google Cloud GKE implementations, today we’re deploying vLLM production-stack on Nebius Managed Kubernetes (MK8s) with the same Terraform approach.
Nebius AI Cloud is purpose-built for AI/ML workloads, offering cutting-edge GPU options from NVIDIA L40 to B200 with pre-baked drivers, VPC-Cilium networking, and 40% cheaper GPU compute. Read more on our Nebius Intro.
This guide shows you how to deploy a production-ready LLM serving environment on Nebius AI Cloud, including automated Let’s Encrypt certificates, GPU provisioning, and comprehensive observability- all via Infrastructure as Code.
💡You can find our code in the CloudThrill repo ➡️ production-stack-terraform.
📂 Project Structure
./nebius/
├── main.tf # MK8s cluster configuration
├── network.tf # VPC, subnets, IP pools
├── provider.tf # Nebius + Helm + kubectl providers
├── variables.tf # All input variables
├── output.tf # HTTPS endpoints, stack details
├── cluster-tools.tf # cert-manager, NGINX, Prometheus
├── data_sources.tf # Ingress data sources
├── vllm-production-stack.tf # vLLM Helm release
├── env-vars.template # Quick environment variable setup
├── terraform.tfvars.template # Terraform variables template
├── config
│ ├── helm
│ │ └── kube-prome-stack.yaml # Prometheus + Grafana values
│ ├── kubeconfig.tpl # Local kubeconfig template
│ ├── llm-stack
│ │ └── helm
│ │ ├── cpu
│ │ │ └── cpu-tinyllama-light-ingress-nebius.tpl
│ │ └── gpu
│ │ ├── gpu-operator-values.yaml
│ │ └── gpu-tinyllama-light-ingress-nebius.tpl
│ ├── manifests
│ │ └── letsencrypt-issuer.yaml # Let's Encrypt ClusterIssuer
│ └── vllm-dashboard.json # Pre-built vLLM Grafana dashboard
└── README.md # ← you are here🧰Prerequisites
Before you begin, ensure you have the following:
| Tool | Version | Notes |
|---|---|---|
| Terraform | ≥ 1.5.7 | tested on 1.9+ |
| nebius CLI | 0.12.109 | profile / authentication |
| kubectl | ≥ 1.30 | ± 1 of control-plane |
| helm | ≥ 3.14 | used by helm_release |
| jq | optional | JSON helper |
Follow the below steps to Install the tools (expend)👇🏼
# Install tools
sudo apt-get install jq
curl -sSL https://storage.eu-north1.nebius.cloud/cli/install.sh | bash
###### Auto completion
nebius completion bash > ~/.nebius/completion.bash.inc
echo 'if [ -f ~/.nebius/completion.bash.inc ]; then source ~/.nebius/completion.bash.inc; fi' >> ~/.bashrc
source ~/.bashrc- Configure Nebius CLI profile
$ nebius profile create
profile name: my-profile
Set api endpoint: api.nebius.cloud
Set federation endpoint: auth.nebius.com
# Opens browser for authentication
✔ Profile "my-profile" configured and activatedWhat’s in the stack?📦
This Terraform stack delivers a production-ready vLLM serving environment on Nebius AI Cloud supporting GPU inference with operational best practices embedded in Nebius Managed Kubernetes.
It’s designed for real-world production workloads with:
✅ GPU-first architecture: Purpose-built for AI/ML with L40S, H100, H200, and B200 GPUs
✅ Pre-baked GPU drivers: No manual driver installation or GPU operator needed
✅ VPC-Cilium networking: eBPF-based networking with Hubble observability
✅ Lightning-fast deployment: Complete stack in ~21 minutes
✅ Secure endpoints: HTTPS-only model serving with NGINX Ingress + Nebius Load Balancer + Let’s Encrypt
Architecture Overview

Deployment layers – The stack provisions infrastructure in logical layers that adapt based on your hardware choice:
| Layer | Component | Deployment Time |
|---|---|---|
| Infrastructure | VPC + Subnet + Managed K8s (MK8S) | ~4 min 03 s |
| Add-ons | cert-manager, NGINX Ingress, kube-prometheus-stack | ~12 min 57 s |
| GPU Nodes | Auto-scaling L40S | ~1 min 56 s |
| vLLM Production Stack | Model server + router + autoscaling layers | ~12 min 49 s |
| Total | End-to-end | ~20 min 41 s |
1. 🛜Networking Foundation
The stack creates a production-grade network topology:
- Single /16 private IP pool (
10.20.0.0/16) shared for nodes + pods - Additional /16 service-CIDR pool (
10.96.0.0/16) carved from the same parent pool - One private subnet per AZ (derived from the pools) – no public subnets, no NAT Gateway
- Native VPC-Cilium CNI (overlay) – VXLAN/Geneve encapsulation, eBPF datapath, Hubble observability
- NGINX Ingress Controller exposed via Nebius Load Balancer
2. ☸️MK8S Cluster
A Control plane v1.30 with two managed node-group Types
| Pool | Instance | Purpose |
|---|---|---|
cpu-pool |
cpu-d3 (8 vCPU / 32 GiB) | Core Kubernetes workload |
gpu-pool |
gpu-l40s-d (8 vCPU / 64 GiB + 1 × L40S) | GPU inference workload |
3. 📦Essential Add-ons
Core Nebius MK8s add-ons can be installed from the catalog, and GPU drivers are already baked in the gpu nodes.
| Category | Component | Description |
|---|---|---|
| CNI | VPC-Cilium | eBPF datapath with Hubble observability |
| Storage | Compute-CSI | Block storage for persistent volumes |
| Ingress | NGINX Ingress | Nebius Load Balancer integration |
| SSL/TLS | cert-manager + Let’s Encrypt | Automated certificate management + free SSL |
| Observability | Prometheus + Grafana + vLLM Dashboard | Complete monitoring stack with GPU & model metrics |
| Core | CoreDNS, Metrics Server | Built-in Kubernetes services |
| GPU (optional) | Pre-baked Drivers | NVIDIA drivers included—no GPU operator needed |
4. 🧠vLLM Production Stack
The heart of the deployment a production-ready model serving:
✅ Model: TinyLlama-1.1B (default, fully customizable)
✅ Load balancing: Round-robin router service across replicas
✅ Secrets: Hugging Face token stored as Kubernetes Secret
✅ Storage: Init container with persistent model caching at `/data/models/`
✅ Monitoring: Prometheus metrics endpoint for observability
✅ HTTPS router endpoints: Automatic TLS with Let’s Encrypt certificates
✅ Default Helm charts: gpu-tinyllama-light-ingress
🖥️ AWS GPU Instance Types Available
Available GPU instances (T4 · L4 · V100 · A10G · A100)
| Platform | GPU | vCPUs | RAM (GiB) | Region | Use-case |
|---|---|---|---|---|---|
| Frontier Training | |||||
gpu-b200-sxm |
8 × B200 NVL72 | 160 | 1792 | us-central1 | Frontier training |
| Large-scale Training | |||||
gpu-h200-sxm |
8 × H200 NVLink | 128 | 1600 | eu-n/w/us | Large-scale training |
gpu-h100-sxm |
1-8 × H100 NVLink | 16-128 | 200-1600 | eu-north1 | High-perf training |
| Cost-effective Inference | |||||
gpu-l40s-a |
1 × L40S PCIe (Intel) | 8-40 | 32-160 | eu-north1 | Cost-effective inference |
gpu-l40s-d |
1 × L40S PCIe (AMD) | 16-192 | 96-1152 | eu-north1 | Cost-effective inference |
Getting started
The deployment automatically provisions only the required infrastructure based on your hardware selection.
| Phase | Component | Action | Condition |
|---|---|---|---|
| 1. Infra | IP pool + subnet | Single private pool (10.20.0.0/16) + service-CIDR (10.96.0.0/16) | Always |
| MK8S cluster | Deploy managed control plane + CPU node group | Always | |
| GPU node group | Auto-scaling L40S/H100/H200/B200 (1-8 nodes) | Always | |
| 2. Add-ons | Ingress + TLS | NGINX controller + cert-manager (Let’s Encrypt) | Always |
| 3. Observability | Prometheus + Grafana | GPU & vLLM dashboards | Always |
| 4. vLLM Stack | HF token secret | Create hf-token-secret for Hugging Face | enable_vllm = true |
| vLLM Helm release | TinyLlama-1.1B model, GPU scheduling, init-container download | enable_vllm = true |
|
| ServiceMonitor | Scrape /metrics endpoint + Dashboard | enable_vllm = true |
|
| HTTPS endpoint | https://vllm-api.<ip>.sslip.io (nip.io optional) |
enable_vllm = true |
🔵 Deployment Steps
1️⃣Clone the repository
The vLLM Nebius MK8s deployment build is located under vllm-production-stack-terraform/nebius directory:
- Navigate to the production-stack-terraform directory and terraform Nebius tutorial folder
$ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
📂..
$ cd vllm-production-stack-terraform/nebius/2️⃣ Set Up Environment Variables
Use an env-vars file to export your TF_VARS or use terraform.tfvars . Replace placeholders with your values:
cp env-vars.template
env-vars
vim env-vars
# Set HF token and customize deployment options
source env-varsUsage examples
- Option 1: Through Environment Variables
# Copy and customize
$ cp env-vars.template env-vars
$ vi env-vars
################################################################################
# Nebius Project Credentials and Region
################################################################################
export TF_VAR_neb_project_id="" # (required) - Fill your Nebius Project ID <==
export TF_VAR_neb_profile="my_nebius_profile" # (Required) replace with your Nebius <==
################################################################################
# Nebius Cluster Configuration
################################################################################
# ☸️ Nebius cluster basics
export TF_VAR_cluster_name="vllm-eks-prod" # default: "vllm-eks-prod"
export TF_VAR_cluster_version="1.30" # default: "1.30" - Kubernetes cluster version
################################################################################
# Cluster / Networking
################################################################################
export TF_VAR_vpc_name="vllm-vpc"
export TF_VAR_vpc_cidr="10.20.0.0/16"
export TF_VAR_service_cidr="10.96.0.0/16"
export TF_VAR_letsencrypt_email="your-email@email.com" # Change me
################################################################################
# 🧠 vLLM Inference Configuration
################################################################################
export TF_VAR_enable_vllm="true" # default false (required) <==
export TF_VAR_hf_token="" # Hugging Face token (sensitive) (required) <==
export TF_VAR_gpu_vllm_helm_config="config/llm-stack/helm/gpu/gpu-tinyllama-light-ingress-nebius.tpl"
################################################################################
# ⚙️ GPU / Nodegroup settings
################################################################################
export TF_VAR_gpu_node_min="0"
export TF_VAR_gpu_node_max="3"
export TF_VAR_gpu_platform="gpu-l40s-d"
.snip
$ source env-vars- Option 2: Through Terraform Variables
# Copy and customize
$ cp terraform.tfvars.example terraform.tfvars
$ vim terraform.tfvars- Load the Variables into Your Shell Before running Terraform, source the env-vars file:
$ source env-vars3️⃣ Run Terraform deployment:
You can now safely run Terraform plan & apply. You will deploy the 100 resources in total, including local kubeconfig.
terraform init
terraform plan
terraform applyFull Plan
Plan: 16 to add, 0 to change, 0 to destroy.
Changes to Outputs:
Stack_Info = "Built with ❤️ by @Cloudthrill"
cluster_endpoint = "private-only"
cluster_id = "mk8scluster-****"
cpu_node = "vllm-neb-gpu-cpu"
cpu_node_platform = "cpu-d3"
cpu_node_preset = "8vcpu-32gb"
gpu_node = "vllm-neb-gpu-gpu"
gpu_node_gpu_settings = {
"drivers_preset" = "cuda12"
}
gpu_node_platform = "gpu-l40s-d"
gpu_node_preset = "1gpu-16vcpu-96gb"
gpu_node_scaling = "[1 x , Max 2]"
gpu_nodegroup_id = "mk8snodegroup-*****"
kubeconfig_cmd = "nebius mk8s cluster get-credentials mk8scluster-***** --external"
project_id = "project-****"
subnet_cidr = {
"pools" = tolist([
{
"cidrs" = tolist([
{
"cidr" = "10.20.0.0/16"
"max_mask_length" = 32
"state" = "AVAILABLE"
},
{
"cidr" = "10.96.0.0/16"
"max_mask_length" = 32
"state" = "AVAILABLE"
},
])
},
])
"use_network_pools" = false
}
subnet_id = "vpcsubnet-*******"
vpc_id = "vpcnetwork-****"
vpc_name = "vllm-neb-gpu-network"
grafana_url = "https://grafana.c3f20d3d.nip.io"
vllm_api_url = "https://vllm-api.c3f20d3d.nip.io/v1"
success_message = "VPC and subnet created successfully! Profile authentication is working."After the deployment you should be able to interact with the cluster using kubectl:
export KUBECONFIG=$PWD/kubeconfig4️⃣ Observability (Grafana Login)
You can access Grafana dashboards using grafana_url output or port forwarding .(i.e http://localhost:3000)
# Get Grafana HTTPS URL (already printed by Terraform) i.e https://grafana.xxxxx.nip.io
terraform output -raw grafana_url
# Or port forward
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack- Run the below command to fetch the password
kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath={.data.admin-password} | base64 -d
- Username: admin
- Password : through kubectl command above
Automatic vLLM Dashboard
In this stack, the vLLM dashboard and service monitor are automatically configured for Grafana.

For Benchmarking vLLM Production Stack Performance check the multi-round QA tutorial
5️⃣ Destroying the Infrastructure 🚧
To delete everything just run the below (Note: sometimes you need to run it twice as the loadbalancer gets tough to die)
terraform destroy -auto-approve
# Destroy complete! Resources: 16 destroyed.🛠️Configuration knobs
This stack provides extensive customization options to tailor your deployment:
| Variable | Default | Description |
|---|---|---|
neb_project_id |
— (required) | Nebius project ID for deployment |
cluster_name |
vllm-neb-gpu | Kubernetes cluster name |
k8s_version |
1.30 | Kubernetes version |
public_endpoint |
true | Enable external API access |
gpu_platform |
gpu-l40s-d | GPU instance type (L40s) |
gpu_node_min |
0 | Minimum GPU nodes |
gpu_node_max |
3 | Maximum GPU nodes |
enable_vllm |
true | Deploy the vLLM stack |
hf_token |
Hugging Face token for model pulls | |
grafana_admin_password |
Admin password for observability stack | |
letsencrypt_email |
info@example.com | Email for TLS certificates (example.com is banned) |
gpu_vllm_helm_config |
config/…gpu-tinyllama-light-ingress-nebius.tpl | Helm values file used for GPU deployment |
📓This is just a subset. For the full list of configurable variables, consult the configuration template : env-vars.template
🧪 Quick Test
1️⃣ Router Endpoint and API URL
1.1 Router Endpoint through port forwarding run the following command:
# Case 1 : Port forwarding
kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80
export vllm_api_url=http://localhost:30080/v11.2 Extracting the Router URL via nginx egress
The endpoint URL can be found in the vllm_api_url output :
# Case 2 : Extract from Terraform output
export vllm_api_url=$(terraform output -raw vllm_api_url)
# Example output:
# https://vllm.a1b2c3d4.nip.io/v1
2️⃣ List models
# check models
curl -s ${vllm_api_url}/models | jq .
3️⃣ Completion Applicable for both ingress and port forwarding URLs
curl ${vllm_api_url}/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/data/models/tinyllama",
"prompt": "Nebius is a",
"max_tokens": 20,
"temperature": 0
}' | jq .choices[].text
4️⃣ vLLM model service
kubectl -n vllm get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
vllm-gpu-router-service ClusterIP 10.96.174.35 <none> 80/TCP,9000/TCP 29m
vllm-gpu-tinyllama-gpu-engine-service ClusterIP 10.96.226.142 <none> 80/TCP,55555/TCP,9999/TCP 29m🎯Troubleshooting:
Certificate Not Issuing
Debug: STATUS: Pending or False
# Check certificate status
kubectl describe certificate -n vllm
# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100
# Check HTTP-01 challenge
kubectl get challenge -n vllm- Symptom
# Message:
Failed to create new order: acme: urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates already issued for: nip.io: see letsencrypt.org/docs/rate-limitsnip.io to sslip.io in the ingress host of the vllm helm charts gpu-tinyllama-light-ingress-nebius.tpl
Useful Nebius CLI Debugging Commands
# Check MK8s cluster status
nebius mk8s cluster list --parent-id <project-id>
nebius mk8s cluster get <cluster-id>
# List node groups
nebius mk8s node-group list --parent-id <cluster-id>
# Check GPU node group details
nebius mk8s node-group get <node-group-id>
# View available GPU platforms
nebius compute platform list --parent-id <project-id>
# Get kubeconfig
nebius mk8s cluster get-credentials <cluster-id> --external --kubeconfig <path>Conclusion
After exploring EKS AKS and GKE implementation of vLLM production-stack, you’ve now successfully deployed a production-ready vLLM serving environment on Nebius AI Cloud! Congratulation🎉
Are you a Cloud Provider not listed in this series?
We’d love to feature your platform! Reach out on LinkedIn to discuss how you can enable us to build and document your integration🤗.
📚 Additional Resources
- vLLM Documentation
- vLLM Production stack documentation
- vLLM Production Stack(repo)
- Nebius Cloud Docs
- Nebius Terraform Provider
- Nebius MK8s
- Nebius Networking Requirements

Run AI Your Way — In Your Cloud
Want full control over your AI backend? The CloudThrill VLLM Private Inference POC is still open — but not forever.
📢 Secure your spot (only a few left), 𝗔𝗽𝗽𝗹𝘆 𝗻𝗼𝘄!
Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control
𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…
🙋🏻♀️If you like this content please subscribe to our blog newsletter ❤️.