vLLM Production Stack on Nebius K8s with Terraform🧑🏼‍🚀

Intro

The vLLM Production Stack is designed to work across any cloud provider with Kubernetes. After covering AWS EKS, Azure AKS, and Google Cloud GKE implementations, today we’re deploying vLLM production-stack on Nebius Managed Kubernetes (MK8s) with the same Terraform approach.

Nebius AI Cloud is purpose-built for AI/ML workloads, offering cutting-edge GPU options from NVIDIA L40 to B200 with pre-baked drivers, VPC-Cilium networking, and 40% cheaper GPU compute. Read more on our Nebius Intro.

This guide shows you how to deploy a production-ready LLM serving environment on Nebius AI Cloud, including automated Let’s Encrypt certificates, GPU provisioning, and comprehensive observability- all via Infrastructure as Code.

💡You can find our code in the CloudThrill repo ➡️ production-stack-terraform.

This is part of CloudThrill‘s ongoing contribution to the vLLM Production Stack project. Extending terraform deployment patterns across AWS, Azure, GCP, Oracle OCI, and Nebius.

📂 Project Structure

./nebius/
├── main.tf                          # MK8s cluster configuration
├── network.tf                       # VPC, subnets, IP pools
├── provider.tf                      # Nebius + Helm + kubectl providers
├── variables.tf                     # All input variables
├── output.tf                        # HTTPS endpoints, stack details
├── cluster-tools.tf                 # cert-manager, NGINX, Prometheus
├── data_sources.tf                  # Ingress data sources
├── vllm-production-stack.tf         # vLLM Helm release
├── env-vars.template                # Quick environment variable setup
├── terraform.tfvars.template        # Terraform variables template
├── config
   ├── helm
      └── kube-prome-stack.yaml   # Prometheus + Grafana values
   ├── kubeconfig.tpl              # Local kubeconfig template
│   ├── llm-stack
│   │   └── helm
│   │       ├── cpu
│   │       │   └── cpu-tinyllama-light-ingress-nebius.tpl
│   │       └── gpu
│   │           ├── gpu-operator-values.yaml
│   │           └── gpu-tinyllama-light-ingress-nebius.tpl
   ├── manifests
      └── letsencrypt-issuer.yaml       # Let's Encrypt ClusterIssuer              
   └── vllm-dashboard.json               # Pre-built vLLM Grafana dashboard
└── README.md                             # ← you are here

🧰Prerequisites

Before you begin, ensure you have the following:

Tool Version Notes
Terraform ≥ 1.5.7 tested on 1.9+
nebius CLI 0.12.109 profile / authentication
kubectl ≥ 1.30 ± 1 of control-plane
helm ≥ 3.14 used by helm_release
jq optional JSON helper
Follow the below steps to Install the tools (expend)👇🏼
# Install tools
sudo apt-get install jq
curl -sSL https://storage.eu-north1.nebius.cloud/cli/install.sh | bash

###### Auto completion
nebius completion bash > ~/.nebius/completion.bash.inc
echo 'if [ -f ~/.nebius/completion.bash.inc ]; then source ~/.nebius/completion.bash.inc; fi' >> ~/.bashrc
source ~/.bashrc
  • Configure Nebius CLI profile
$ nebius profile create
profile name: my-profile
Set api endpoint: api.nebius.cloud
Set federation endpoint: auth.nebius.com

# Opens browser for authentication
Profile "my-profile" configured and activated

What’s in the stack?📦

This Terraform stack delivers a production-ready vLLM serving environment on Nebius AI Cloud supporting GPU inference with operational best practices embedded in Nebius Managed Kubernetes.

It’s designed for real-world production workloads with:
GPU-first architecture: Purpose-built for AI/ML with L40S, H100, H200, and B200 GPUs
Pre-baked GPU drivers: No manual driver installation or GPU operator needed
VPC-Cilium networking: eBPF-based networking with Hubble observability
Lightning-fast deployment: Complete stack in ~21 minutes
Secure endpoints: HTTPS-only model serving with NGINX Ingress + Nebius Load Balancer + Let’s Encrypt

Architecture Overview

Deployment layers – The stack provisions infrastructure in logical layers that adapt based on your hardware choice:

Layer Component Deployment Time
Infrastructure VPC + Subnet + Managed K8s (MK8S) ~4 min 03 s
Add-ons cert-manager, NGINX Ingress, kube-prometheus-stack ~12 min 57 s
GPU Nodes Auto-scaling L40S ~1 min 56 s
vLLM Production Stack Model server + router + autoscaling layers ~12 min 49 s
Total End-to-end ~20 min 41 s

1. 🛜Networking Foundation

The stack creates a production-grade network topology:

  • Single /16 private IP pool (10.20.0.0/16) shared for nodes + pods
  • Additional /16 service-CIDR pool (10.96.0.0/16) carved from the same parent pool
  • One private subnet per AZ (derived from the pools) – no public subnets, no NAT Gateway
  • Native VPC-Cilium CNI (overlay) – VXLAN/Geneve encapsulation, eBPF datapath, Hubble observability
  • NGINX Ingress Controller exposed via Nebius Load Balancer

2. ☸️MK8S Cluster

A Control plane v1.30 with two managed node-group Types

Pool Instance Purpose
cpu-pool cpu-d3 (8 vCPU / 32 GiB) Core Kubernetes workload
gpu-pool gpu-l40s-d (8 vCPU / 64 GiB + 1 × L40S) GPU inference workload

3. 📦Essential Add-ons

Core Nebius MK8s add-ons can be installed from the catalog, and GPU drivers are already baked in the gpu nodes.

Category Component Description
CNI VPC-Cilium eBPF datapath with Hubble observability
Storage Compute-CSI Block storage for persistent volumes
Ingress NGINX Ingress Nebius Load Balancer integration
SSL/TLS cert-manager + Let’s Encrypt Automated certificate management + free SSL
Observability Prometheus + Grafana + vLLM Dashboard Complete monitoring stack with GPU & model metrics
Core CoreDNS, Metrics Server Built-in Kubernetes services
GPU (optional) Pre-baked Drivers NVIDIA drivers included—no GPU operator needed

4. 🧠vLLM Production Stack

The heart of the deployment a production-ready model serving:

✅ Model: TinyLlama-1.1B (default, fully customizable)
✅ Load balancing: Round-robin router service across replicas
✅ Secrets: Hugging Face token stored as Kubernetes Secret
Storage: Init container with persistent model caching at `/data/models/`
Monitoring: Prometheus metrics endpoint for observability
HTTPS router endpoints: Automatic TLS with Let’s Encrypt certificates
Default Helm chartsgpu-tinyllama-light-ingress

🖥️ AWS GPU Instance Types Available

Available GPU instances (T4 · L4 · V100 · A10G · A100)
Platform GPU vCPUs RAM (GiB) Region Use-case
Frontier Training
gpu-b200-sxm 8 × B200 NVL72 160 1792 us-central1 Frontier training
Large-scale Training
gpu-h200-sxm 8 × H200 NVLink 128 1600 eu-n/w/us Large-scale training
gpu-h100-sxm 1-8 × H100 NVLink 16-128 200-1600 eu-north1 High-perf training
Cost-effective Inference
gpu-l40s-a 1 × L40S PCIe (Intel) 8-40 32-160 eu-north1 Cost-effective inference
gpu-l40s-d 1 × L40S PCIe (AMD) 16-192 96-1152 eu-north1 Cost-effective inference
Note: Check the full list of Nebius GPU instance offerings in our last blog

Getting started

The deployment automatically provisions only the required infrastructure based on your hardware selection.

Phase Component Action Condition
1. Infra IP pool + subnet Single private pool (10.20.0.0/16) + service-CIDR (10.96.0.0/16) Always
MK8S cluster Deploy managed control plane + CPU node group Always
GPU node group Auto-scaling L40S/H100/H200/B200 (1-8 nodes) Always
2. Add-ons Ingress + TLS NGINX controller + cert-manager (Let’s Encrypt) Always
3. Observability Prometheus + Grafana GPU & vLLM dashboards Always
4. vLLM Stack HF token secret Create hf-token-secret for Hugging Face enable_vllm = true
vLLM Helm release TinyLlama-1.1B model, GPU scheduling, init-container download enable_vllm = true
ServiceMonitor Scrape /metrics endpoint + Dashboard enable_vllm = true
HTTPS endpoint https://vllm-api.<ip>.sslip.io (nip.io optional) enable_vllm = true

🔵 Deployment Steps

1️⃣Clone the repository

The vLLM Nebius MK8s deployment build is located under vllm-production-stack-terraform/nebius directory:

  • Navigate to the production-stack-terraform directory and terraform Nebius tutorial folder
$ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
 📂.. 
$ cd vllm-production-stack-terraform/nebius/

2️⃣ Set Up Environment Variables

Use an env-vars file to export your TF_VARS or use terraform.tfvars . Replace placeholders with your values:

cp env-vars.template 
env-vars
vim env-vars  
# Set HF token and customize deployment options
source env-vars

Usage examples

  • Option 1: Through Environment Variables
# Copy and customize
$ cp env-vars.template env-vars
$ vi env-vars
################################################################################
# Nebius Project Credentials and Region
################################################################################
export TF_VAR_neb_project_id=""  # (required) - Fill your Nebius Project ID         <==
export TF_VAR_neb_profile="my_nebius_profile" # (Required) replace with your Nebius <==
################################################################################
# Nebius Cluster Configuration
################################################################################
# ☸️ Nebius cluster basics
export TF_VAR_cluster_name="vllm-eks-prod" # default: "vllm-eks-prod"
export TF_VAR_cluster_version="1.30"       # default: "1.30" - Kubernetes cluster version
################################################################################
# Cluster / Networking
################################################################################
export TF_VAR_vpc_name="vllm-vpc"
export TF_VAR_vpc_cidr="10.20.0.0/16"
export TF_VAR_service_cidr="10.96.0.0/16"
export TF_VAR_letsencrypt_email="your-email@email.com"  # Change me
################################################################################
#  🧠 vLLM  Inference Configuration
################################################################################ 
export TF_VAR_enable_vllm="true"   # default false (required)                    <==        
export TF_VAR_hf_token=""   # Hugging Face token (sensitive) (required)          <==      
export TF_VAR_gpu_vllm_helm_config="config/llm-stack/helm/gpu/gpu-tinyllama-light-ingress-nebius.tpl"
################################################################################
# ⚙️ GPU / Nodegroup settings
################################################################################
export TF_VAR_gpu_node_min="0"
export TF_VAR_gpu_node_max="3"
export TF_VAR_gpu_platform="gpu-l40s-d"
.snip
$ source env-vars
  • Option 2: Through Terraform Variables
 # Copy and customize
 $ cp terraform.tfvars.example terraform.tfvars
 $ vim terraform.tfvars
  • Load the Variables into Your Shell Before running Terraform, source the env-vars file:
$ source env-vars

3️⃣ Run Terraform deployment:

You can now safely run Terraform plan & apply. You will deploy the 100 resources in total, including local kubeconfig.

terraform init
terraform plan
terraform apply
Full Plan
Plan: 16 to add, 0 to change, 0 to destroy.

Changes to Outputs:
 Stack_Info = "Built with ❤️ by @Cloudthrill"
 cluster_endpoint = "private-only"
 cluster_id = "mk8scluster-****"
 cpu_node = "vllm-neb-gpu-cpu"
 cpu_node_platform = "cpu-d3"
 cpu_node_preset = "8vcpu-32gb"
 gpu_node = "vllm-neb-gpu-gpu"
 gpu_node_gpu_settings = {
  "drivers_preset" = "cuda12"
  }
 gpu_node_platform = "gpu-l40s-d"
 gpu_node_preset = "1gpu-16vcpu-96gb"
 gpu_node_scaling = "[1 x , Max 2]"
 gpu_nodegroup_id = "mk8snodegroup-*****"
 kubeconfig_cmd = "nebius mk8s cluster get-credentials mk8scluster-***** --external"
 project_id = "project-****"
 subnet_cidr = {
   "pools" = tolist([
     {
       "cidrs" = tolist([
         {
           "cidr" = "10.20.0.0/16"
           "max_mask_length" = 32
           "state" = "AVAILABLE"
         },
         {
           "cidr" = "10.96.0.0/16"
           "max_mask_length" = 32
           "state" = "AVAILABLE"
         },
       ])
     },
   ])
   "use_network_pools" = false
 }
 subnet_id = "vpcsubnet-*******"
 vpc_id = "vpcnetwork-****"
 vpc_name = "vllm-neb-gpu-network"
 grafana_url = "https://grafana.c3f20d3d.nip.io"
 vllm_api_url = "https://vllm-api.c3f20d3d.nip.io/v1"
 success_message = "VPC and subnet created successfully! Profile authentication is working."

After the deployment you should be able to interact with the cluster using kubectl:

export KUBECONFIG=$PWD/kubeconfig

4️⃣ Observability (Grafana Login)

You can access Grafana dashboards using grafana_url output or port forwarding .(i.e http://localhost:3000)

# Get Grafana HTTPS URL (already printed by Terraform) i.e https://grafana.xxxxx.nip.io
terraform output -raw grafana_url 
# Or port forward
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack
  • Run the below command to fetch the password
kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath={.data.admin-password} | base64 -d
  • Username: admin
  • Password : through kubectl command above

Automatic vLLM Dashboard

In this stack, the vLLM dashboard and service monitor are automatically configured for Grafana.

For Benchmarking vLLM Production Stack Performance check the multi-round QA tutorial

5️⃣ Destroying the Infrastructure 🚧

To delete everything just run the below (Note: sometimes you need to run it twice as the loadbalancer gets tough to die)

terraform destroy -auto-approve
# Destroy complete! Resources: 16 destroyed.



🛠️Configuration knobs

This stack provides extensive customization options to tailor your deployment:

Variable Default Description
neb_project_id — (required) Nebius project ID for deployment
cluster_name vllm-neb-gpu Kubernetes cluster name
k8s_version 1.30 Kubernetes version
public_endpoint true Enable external API access
gpu_platform gpu-l40s-d GPU instance type (L40s)
gpu_node_min 0 Minimum GPU nodes
gpu_node_max 3 Maximum GPU nodes
enable_vllm true Deploy the vLLM stack
hf_token Hugging Face token for model pulls
grafana_admin_password Admin password for observability stack
letsencrypt_email info@example.com Email for TLS certificates (example.com is banned)
gpu_vllm_helm_config config/…gpu-tinyllama-light-ingress-nebius.tpl Helm values file used for GPU deployment

📓This is just a subset. For the full list of configurable variables, consult the configuration template : env-vars.template

🧪 Quick Test

1️⃣ Router Endpoint and API URL

1.1 Router Endpoint through port forwarding run the following command:

# Case 1 : Port forwarding
kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80
export vllm_api_url=http://localhost:30080/v1

1.2 Extracting the Router URL via nginx egress 
The endpoint URL can be found in the vllm_api_url output :

# Case 2 : Extract from Terraform output 
export vllm_api_url=$(terraform output -raw vllm_api_url)
# Example output:
# https://vllm.a1b2c3d4.nip.io/v1


2️⃣ List models

# check models
curl -s ${vllm_api_url}/models | jq .


3️⃣ Completion Applicable for both ingress and port forwarding URLs

curl ${vllm_api_url}/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/data/models/tinyllama",
    "prompt": "Nebius is a",
    "max_tokens": 20,
    "temperature": 0
  }' | jq .choices[].text


4️⃣ vLLM model service

kubectl -n vllm get svc
NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                     AGE
vllm-gpu-router-service                 ClusterIP   10.96.174.35    <none>        80/TCP,9000/TCP             29m
vllm-gpu-tinyllama-gpu-engine-service   ClusterIP   10.96.226.142   <none>        80/TCP,55555/TCP,9999/TCP   29m

🎯Troubleshooting:

Certificate Not Issuing

Debug: STATUS: Pending or False

# Check certificate status
kubectl describe certificate -n vllm
# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100
# Check HTTP-01 challenge
kubectl get challenge -n vllm

  • Symptom
# Message: 
Failed to create new order: acme: urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates already issued for: nip.io: see letsencrypt.org/docs/rate-limits
Fix: Change nip.io to sslip.io in the ingress host of the vllm helm charts
 gpu-tinyllama-light-ingress-nebius.tpl

Useful Nebius CLI Debugging Commands

# Check MK8s cluster status
nebius mk8s cluster list --parent-id <project-id>
nebius mk8s cluster get <cluster-id>

# List node groups
nebius mk8s node-group list --parent-id <cluster-id>

# Check GPU node group details
nebius mk8s node-group get <node-group-id> 

# View available GPU platforms
nebius compute platform list --parent-id <project-id>

# Get kubeconfig
nebius mk8s cluster get-credentials <cluster-id> --external  --kubeconfig <path>

Conclusion

After exploring EKS AKS and GKE implementation of vLLM production-stack, you’ve now successfully deployed a production-ready vLLM serving environment on Nebius AI Cloud! Congratulation🎉

Are you a Cloud Provider not listed in this series?

We’d love to feature your platform! Reach out on LinkedIn to discuss how you can enable us to build and document your integration🤗.

📚 Additional Resources


Run AI Your Way — In Your Cloud


Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

🙋🏻‍♀️If you like this content please subscribe to our blog newsletter ❤️.

👋🏻Want to chat about your challenges?
We’d love to hear from you! 

Share this…

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Start your Cloud journey with us today .