Intro

The vLLM Production Stack is designed to work across any cloud provider with Kubernetes. After covering AWS EKS, Azure AKS, and Google Cloud GKE implementations, today we’re deploying vLLM production-stack on Nebius Managed Kubernetes (MK8s) with the same Terraform approach.

Nebius AI Cloud is purpose-built for AI/ML workloads, offering cutting-edge GPU options from NVIDIA L40 to B200 with pre-baked drivers, VPC-Cilium networking, and 40% cheaper GPU compute. Read more on our Nebius Intro.

This guide shows you how to deploy a production-ready LLM serving environment on Nebius AI Cloud, including automated Let’s Encrypt certificates, GPU provisioning, and comprehensive observability- all via Infrastructure as Code.

💡You can find our code in the CloudThrill repo ➡️ production-stack-terraform.

This is part of CloudThrill‘s ongoing contribution to the vLLM Production Stack project. Extending terraform deployment patterns across AWS, Azure, GCP, Oracle OCI, and Nebius.

📂 Project Structure

./nebius/
├── main.tf                          # MK8s cluster configuration
├── network.tf                       # VPC, subnets, IP pools
├── provider.tf                      # Nebius + Helm + kubectl providers
├── variables.tf                     # All input variables
├── output.tf                        # HTTPS endpoints, stack details
├── cluster-tools.tf                 # cert-manager, NGINX, Prometheus
├── data_sources.tf                  # Ingress data sources
├── vllm-production-stack.tf         # vLLM Helm release
├── env-vars.template                # Quick environment variable setup
├── terraform.tfvars.template        # Terraform variables template
├── config
│   ├── helm
│   │   └── kube-prome-stack.yaml   # Prometheus + Grafana values
│   ├── kubeconfig.tpl              # Local kubeconfig template
│   ├── llm-stack
│   │   └── helm
│   │       ├── cpu
│   │       │   └── cpu-tinyllama-light-ingress-nebius.tpl
│   │       └── gpu
│   │           ├── gpu-operator-values.yaml
│   │           └── gpu-tinyllama-light-ingress-nebius.tpl
│   ├── manifests
│   │   └── letsencrypt-issuer.yaml       # Let's Encrypt ClusterIssuer              
│   └── vllm-dashboard.json               # Pre-built vLLM Grafana dashboard
└── README.md                             # ← you are here

./nebius/
├── main.tf                          # MK8s cluster configuration
├── network.tf                       # VPC, subnets, IP pools
├── provider.tf                      # Nebius + Helm + kubectl providers
├── variables.tf                     # All input variables
├── output.tf                        # HTTPS endpoints, stack details
├── cluster-tools.tf                 # cert-manager, NGINX, Prometheus
├── data_sources.tf                  # Ingress data sources
├── vllm-production-stack.tf         # vLLM Helm release
├── env-vars.template                # Quick environment variable setup
├── terraform.tfvars.template        # Terraform variables template
├── config
│   ├── helm
│   │   └── kube-prome-stack.yaml   # Prometheus + Grafana values
│   ├── kubeconfig.tpl              # Local kubeconfig template
│   ├── llm-stack
│   │   └── helm
│   │       ├── cpu
│   │       │   └── cpu-tinyllama-light-ingress-nebius.tpl
│   │       └── gpu
│   │           ├── gpu-operator-values.yaml
│   │           └── gpu-tinyllama-light-ingress-nebius.tpl
│   ├── manifests
│   │   └── letsencrypt-issuer.yaml       # Let's Encrypt ClusterIssuer              
│   └── vllm-dashboard.json               # Pre-built vLLM Grafana dashboard
└── README.md                             # ← you are here

🧰Prerequisites

Before you begin, ensure you have the following:

Tool	Version	Notes
Terraform	≥ 1.5.7	tested on 1.9+
nebius CLI	0.12.109	profile / authentication
kubectl	≥ 1.30	± 1 of control-plane
helm	≥ 3.14	used by `helm_release`
jq	optional	JSON helper

Follow the below steps to Install the tools (expend)👇🏼

# Install tools
sudo apt-get install jq
curl -sSL https://storage.eu-north1.nebius.cloud/cli/install.sh | bash

###### Auto completion
nebius completion bash > ~/.nebius/completion.bash.inc
echo 'if [ -f ~/.nebius/completion.bash.inc ]; then source ~/.nebius/completion.bash.inc; fi' >> ~/.bashrc
source ~/.bashrc

# Install tools
sudo apt-get install jq
curl -sSL https://storage.eu-north1.nebius.cloud/cli/install.sh | bash

###### Auto completion
nebius completion bash > ~/.nebius/completion.bash.inc
echo 'if [ -f ~/.nebius/completion.bash.inc ]; then source ~/.nebius/completion.bash.inc; fi' >> ~/.bashrc
source ~/.bashrc

Configure Nebius CLI profile

$ nebius profile create
profile name: my-profile
Set api endpoint: api.nebius.cloud
Set federation endpoint: auth.nebius.com

# Opens browser for authentication
✔ Profile "my-profile" configured and activated

$ nebius profile create
profile name: my-profile
Set api endpoint: api.nebius.cloud
Set federation endpoint: auth.nebius.com

# Opens browser for authentication
✔ Profile "my-profile" configured and activated

What’s in the stack?📦

This Terraform stack delivers a production-ready vLLM serving environment on Nebius AI Cloud supporting GPU inference with operational best practices embedded in Nebius Managed Kubernetes.

It’s designed for real-world production workloads with:
✅ GPU-first architecture: Purpose-built for AI/ML with L40S, H100, H200, and B200 GPUs
✅ Pre-baked GPU drivers: No manual driver installation or GPU operator needed
✅ VPC-Cilium networking: eBPF-based networking with Hubble observability
✅ Lightning-fast deployment: Complete stack in ~21 minutes
✅ Secure endpoints: HTTPS-only model serving with NGINX Ingress + Nebius Load Balancer + Let’s Encrypt

Architecture Overview

Deployment layers – The stack provisions infrastructure in logical layers that adapt based on your hardware choice:

Layer	Component	Deployment Time
Infrastructure	VPC + Subnet + Managed K8s (MK8S)	~4 min 03 s
Add-ons	cert-manager, NGINX Ingress, kube-prometheus-stack	~12 min 57 s
GPU Nodes	Auto-scaling L40S	~1 min 56 s
vLLM Production Stack	Model server + router + autoscaling layers	~12 min 49 s
Total	End-to-end	~20 min 41 s

1. 🛜Networking Foundation

The stack creates a production-grade network topology:

Single /16 private IP pool (10.20.0.0/16) shared for nodes + pods
Additional /16 service-CIDR pool (10.96.0.0/16) carved from the same parent pool
One private subnet per AZ (derived from the pools) – no public subnets, no NAT Gateway
Native VPC-Cilium CNI (overlay) – VXLAN/Geneve encapsulation, eBPF datapath, Hubble observability
NGINX Ingress Controller exposed via Nebius Load Balancer

2. ☸️MK8S Cluster

A Control plane v1.30 with two managed node-group Types

Pool	Instance	Purpose
`cpu-pool`	cpu-d3 (8 vCPU / 32 GiB)	Core Kubernetes workload
`gpu-pool`	gpu-l40s-d (8 vCPU / 64 GiB + 1 × L40S)	GPU inference workload

3. 📦Essential Add-ons

Core Nebius MK8s add-ons can be installed from the catalog, and GPU drivers are already baked in the gpu nodes.

Category	Component	Description
CNI	VPC-Cilium	eBPF datapath with Hubble observability
Storage	Compute-CSI	Block storage for persistent volumes
Ingress	NGINX Ingress	Nebius Load Balancer integration
SSL/TLS	cert-manager + Let’s Encrypt	Automated certificate management + free SSL
Observability	Prometheus + Grafana + vLLM Dashboard	Complete monitoring stack with GPU & model metrics
Core	CoreDNS, Metrics Server	Built-in Kubernetes services
GPU (optional)	Pre-baked Drivers	NVIDIA drivers included—no GPU operator needed

4. 🧠vLLM Production Stack

The heart of the deployment a production-ready model serving:

✅ Model: TinyLlama-1.1B (default, fully customizable)
✅ Load balancing: Round-robin router service across replicas
✅ Secrets: Hugging Face token stored as Kubernetes Secret
✅ Storage: Init container with persistent model caching at `/data/models/`
✅ Monitoring: Prometheus metrics endpoint for observability
✅ HTTPS router endpoints: Automatic TLS with Let’s Encrypt certificates
✅ Default Helm charts: gpu-tinyllama-light-ingress

🖥️ AWS GPU Instance Types Available

Available GPU instances (T4 · L4 · V100 · A10G · A100)

Platform	GPU	vCPUs	RAM (GiB)	Region	Use-case
Frontier Training
`gpu-b200-sxm`	8 × B200 NVL72	160	1792	us-central1	Frontier training
Large-scale Training
`gpu-h200-sxm`	8 × H200 NVLink	128	1600	eu-n/w/us	Large-scale training
`gpu-h100-sxm`	1-8 × H100 NVLink	16-128	200-1600	eu-north1	High-perf training
Cost-effective Inference
`gpu-l40s-a`	1 × L40S PCIe (Intel)	8-40	32-160	eu-north1	Cost-effective inference
`gpu-l40s-d`	1 × L40S PCIe (AMD)	16-192	96-1152	eu-north1	Cost-effective inference

Note: Check the full list of Nebius GPU instance offerings in our last blog

Getting started

The deployment automatically provisions only the required infrastructure based on your hardware selection.

Phase	Component	Action	Condition
1. Infra	IP pool + subnet	Single private pool (10.20.0.0/16) + service-CIDR (10.96.0.0/16)	Always
	MK8S cluster	Deploy managed control plane + CPU node group	Always
	GPU node group	Auto-scaling L40S/H100/H200/B200 (1-8 nodes)	Always
2. Add-ons	Ingress + TLS	NGINX controller + cert-manager (Let’s Encrypt)	Always
3. Observability	Prometheus + Grafana	GPU & vLLM dashboards	Always
4. vLLM Stack	HF token secret	Create hf-token-secret for Hugging Face	`enable_vllm = true`
	vLLM Helm release	TinyLlama-1.1B model, GPU scheduling, init-container download	`enable_vllm = true`
	ServiceMonitor	Scrape /metrics endpoint + Dashboard	`enable_vllm = true`
	HTTPS endpoint	`https://vllm-api.<ip>.sslip.io` (nip.io optional)	`enable_vllm = true`

🔵 Deployment Steps

1️⃣Clone the repository

The vLLM Nebius MK8s deployment build is located under vllm-production-stack-terraform/nebius directory:

Navigate to the production-stack-terraform directory and terraform Nebius tutorial folder

$ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
 📂.. 
$ cd vllm-production-stack-terraform/nebius/

$ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
 📂.. 
$ cd vllm-production-stack-terraform/nebius/

2️⃣ Set Up Environment Variables

Use an env-vars file to export your TF_VARS or use terraform.tfvars . Replace placeholders with your values:

cp env-vars.template 
env-vars
vim env-vars  
# Set HF token and customize deployment options
source env-vars

cp env-vars.template 
env-vars
vim env-vars  
# Set HF token and customize deployment options
source env-vars

Usage examples

Option 1: Through Environment Variables

# Copy and customize
$ cp env-vars.template env-vars
$ vi env-vars
################################################################################
# Nebius Project Credentials and Region
################################################################################
export TF_VAR_neb_project_id=""  # (required) - Fill your Nebius Project ID         <==
export TF_VAR_neb_profile="my_nebius_profile" # (Required) replace with your Nebius <==
################################################################################
# Nebius Cluster Configuration
################################################################################
# ☸️ Nebius cluster basics
export TF_VAR_cluster_name="vllm-eks-prod" # default: "vllm-eks-prod"
export TF_VAR_cluster_version="1.30"       # default: "1.30" - Kubernetes cluster version
################################################################################
# Cluster / Networking
################################################################################
export TF_VAR_vpc_name="vllm-vpc"
export TF_VAR_vpc_cidr="10.20.0.0/16"
export TF_VAR_service_cidr="10.96.0.0/16"
export TF_VAR_letsencrypt_email="your-email@email.com"  # Change me
################################################################################
#  🧠 vLLM  Inference Configuration
################################################################################ 
export TF_VAR_enable_vllm="true"   # default false (required)                    <==        
export TF_VAR_hf_token=""   # Hugging Face token (sensitive) (required)          <==      
export TF_VAR_gpu_vllm_helm_config="config/llm-stack/helm/gpu/gpu-tinyllama-light-ingress-nebius.tpl"
################################################################################
# ⚙️ GPU / Nodegroup settings
################################################################################
export TF_VAR_gpu_node_min="0"
export TF_VAR_gpu_node_max="3"
export TF_VAR_gpu_platform="gpu-l40s-d"
.snip
$ source env-vars

# Copy and customize
$ cp env-vars.template env-vars
$ vi env-vars
################################################################################
# Nebius Project Credentials and Region
################################################################################
export TF_VAR_neb_project_id=""  # (required) - Fill your Nebius Project ID         <==
export TF_VAR_neb_profile="my_nebius_profile" # (Required) replace with your Nebius <==
################################################################################
# Nebius Cluster Configuration
################################################################################
# ☸️ Nebius cluster basics
export TF_VAR_cluster_name="vllm-eks-prod" # default: "vllm-eks-prod"
export TF_VAR_cluster_version="1.30"       # default: "1.30" - Kubernetes cluster version
################################################################################
# Cluster / Networking
################################################################################
export TF_VAR_vpc_name="vllm-vpc"
export TF_VAR_vpc_cidr="10.20.0.0/16"
export TF_VAR_service_cidr="10.96.0.0/16"
export TF_VAR_letsencrypt_email="your-email@email.com"  # Change me
################################################################################
#  🧠 vLLM  Inference Configuration
################################################################################ 
export TF_VAR_enable_vllm="true"   # default false (required)                    <==        
export TF_VAR_hf_token=""   # Hugging Face token (sensitive) (required)          <==      
export TF_VAR_gpu_vllm_helm_config="config/llm-stack/helm/gpu/gpu-tinyllama-light-ingress-nebius.tpl"
################################################################################
# ⚙️ GPU / Nodegroup settings
################################################################################
export TF_VAR_gpu_node_min="0"
export TF_VAR_gpu_node_max="3"
export TF_VAR_gpu_platform="gpu-l40s-d"
.snip
$ source env-vars

Option 2: Through Terraform Variables

 # Copy and customize
 $ cp terraform.tfvars.example terraform.tfvars
 $ vim terraform.tfvars

 # Copy and customize
 $ cp terraform.tfvars.example terraform.tfvars
 $ vim terraform.tfvars

Load the Variables into Your Shell Before running Terraform, source the env-vars file:

$ source env-vars

$ source env-vars

3️⃣ Run Terraform deployment:

You can now safely run Terraform plan & apply. You will deploy the 100 resources in total, including local kubeconfig.

terraform init
terraform plan
terraform apply

terraform init
terraform plan
terraform apply

Full Plan

Plan: 16 to add, 0 to change, 0 to destroy.

Changes to Outputs:
 Stack_Info = "Built with ❤️ by @Cloudthrill"
 cluster_endpoint = "private-only"
 cluster_id = "mk8scluster-****"
 cpu_node = "vllm-neb-gpu-cpu"
 cpu_node_platform = "cpu-d3"
 cpu_node_preset = "8vcpu-32gb"
 gpu_node = "vllm-neb-gpu-gpu"
 gpu_node_gpu_settings = {
  "drivers_preset" = "cuda12"
  }
 gpu_node_platform = "gpu-l40s-d"
 gpu_node_preset = "1gpu-16vcpu-96gb"
 gpu_node_scaling = "[1 x , Max 2]"
 gpu_nodegroup_id = "mk8snodegroup-*****"
 kubeconfig_cmd = "nebius mk8s cluster get-credentials mk8scluster-***** --external"
 project_id = "project-****"
 subnet_cidr = {
   "pools" = tolist([
     {
       "cidrs" = tolist([
         {
           "cidr" = "10.20.0.0/16"
           "max_mask_length" = 32
           "state" = "AVAILABLE"
         },
         {
           "cidr" = "10.96.0.0/16"
           "max_mask_length" = 32
           "state" = "AVAILABLE"
         },
       ])
     },
   ])
   "use_network_pools" = false
 }
 subnet_id = "vpcsubnet-*******"
 vpc_id = "vpcnetwork-****"
 vpc_name = "vllm-neb-gpu-network"
 grafana_url = "https://grafana.c3f20d3d.nip.io"
 vllm_api_url = "https://vllm-api.c3f20d3d.nip.io/v1"
 success_message = "VPC and subnet created successfully! Profile authentication is working."

Plan: 16 to add, 0 to change, 0 to destroy.

Changes to Outputs:
 Stack_Info = "Built with ❤️ by @Cloudthrill"
 cluster_endpoint = "private-only"
 cluster_id = "mk8scluster-****"
 cpu_node = "vllm-neb-gpu-cpu"
 cpu_node_platform = "cpu-d3"
 cpu_node_preset = "8vcpu-32gb"
 gpu_node = "vllm-neb-gpu-gpu"
 gpu_node_gpu_settings = {
  "drivers_preset" = "cuda12"
  }
 gpu_node_platform = "gpu-l40s-d"
 gpu_node_preset = "1gpu-16vcpu-96gb"
 gpu_node_scaling = "[1 x , Max 2]"
 gpu_nodegroup_id = "mk8snodegroup-*****"
 kubeconfig_cmd = "nebius mk8s cluster get-credentials mk8scluster-***** --external"
 project_id = "project-****"
 subnet_cidr = {
   "pools" = tolist([
     {
       "cidrs" = tolist([
         {
           "cidr" = "10.20.0.0/16"
           "max_mask_length" = 32
           "state" = "AVAILABLE"
         },
         {
           "cidr" = "10.96.0.0/16"
           "max_mask_length" = 32
           "state" = "AVAILABLE"
         },
       ])
     },
   ])
   "use_network_pools" = false
 }
 subnet_id = "vpcsubnet-*******"
 vpc_id = "vpcnetwork-****"
 vpc_name = "vllm-neb-gpu-network"
 grafana_url = "https://grafana.c3f20d3d.nip.io"
 vllm_api_url = "https://vllm-api.c3f20d3d.nip.io/v1"
 success_message = "VPC and subnet created successfully! Profile authentication is working."

After the deployment you should be able to interact with the cluster using kubectl:

export KUBECONFIG=$PWD/kubeconfig

export KUBECONFIG=$PWD/kubeconfig

4️⃣ Observability (Grafana Login)

You can access Grafana dashboards using grafana_url output or port forwarding .(i.e http://localhost:3000)

# Get Grafana HTTPS URL (already printed by Terraform) i.e https://grafana.xxxxx.nip.io
terraform output -raw grafana_url 
# Or port forward
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack

# Get Grafana HTTPS URL (already printed by Terraform) i.e https://grafana.xxxxx.nip.io
terraform output -raw grafana_url 
# Or port forward
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack

Run the below command to fetch the password

kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath={.data.admin-password} | base64 -d

kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath={.data.admin-password} | base64 -d

Username: admin
Password : through kubectl command above

Automatic vLLM Dashboard

In this stack, the vLLM dashboard and service monitor are automatically configured for Grafana.

For Benchmarking vLLM Production Stack Performance check the multi-round QA tutorial

5️⃣ Destroying the Infrastructure 🚧

To delete everything just run the below (Note: sometimes you need to run it twice as the loadbalancer gets tough to die)

terraform destroy -auto-approve
# Destroy complete! Resources: 16 destroyed.

terraform destroy -auto-approve
# Destroy complete! Resources: 16 destroyed.

🛠️Configuration knobs

This stack provides extensive customization options to tailor your deployment:

Variable	Default	Description
`neb_project_id`	— (required)	Nebius project ID for deployment
`cluster_name`	vllm-neb-gpu	Kubernetes cluster name
`k8s_version`	1.30	Kubernetes version
`public_endpoint`	true	Enable external API access
`gpu_platform`	gpu-l40s-d	GPU instance type (L40s)
`gpu_node_min`	0	Minimum GPU nodes
`gpu_node_max`	3	Maximum GPU nodes
`enable_vllm`	true	Deploy the vLLM stack
`hf_token`		Hugging Face token for model pulls
`grafana_admin_password`		Admin password for observability stack
`letsencrypt_email`	info@example.com	Email for TLS certificates (example.com is banned)
`gpu_vllm_helm_config`	config/…gpu-tinyllama-light-ingress-nebius.tpl	Helm values file used for GPU deployment

📓This is just a subset. For the full list of configurable variables, consult the configuration template : env-vars.template

🧪 Quick Test

1️⃣ Router Endpoint and API URL

1.1 Router Endpoint through port forwarding run the following command:

# Case 1 : Port forwarding
kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80
export vllm_api_url=http://localhost:30080/v1

# Case 1 : Port forwarding
kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80
export vllm_api_url=http://localhost:30080/v1

1.2 Extracting the Router URL via nginx egress
The endpoint URL can be found in the vllm_api_url output :

# Case 2 : Extract from Terraform output 
export vllm_api_url=$(terraform output -raw vllm_api_url)
# Example output:
# https://vllm.a1b2c3d4.nip.io/v1

# Case 2 : Extract from Terraform output 
export vllm_api_url=$(terraform output -raw vllm_api_url)
# Example output:
# https://vllm.a1b2c3d4.nip.io/v1

2️⃣ List models

# check models
curl -s ${vllm_api_url}/models | jq .

# check models
curl -s ${vllm_api_url}/models | jq .

3️⃣ Completion Applicable for both ingress and port forwarding URLs

curl ${vllm_api_url}/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/data/models/tinyllama",
    "prompt": "Nebius is a",
    "max_tokens": 20,
    "temperature": 0
  }' | jq .choices[].text

curl ${vllm_api_url}/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/data/models/tinyllama",
    "prompt": "Nebius is a",
    "max_tokens": 20,
    "temperature": 0
  }' | jq .choices[].text

4️⃣ vLLM model service

kubectl -n vllm get svc
NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                     AGE
vllm-gpu-router-service                 ClusterIP   10.96.174.35    <none>        80/TCP,9000/TCP             29m
vllm-gpu-tinyllama-gpu-engine-service   ClusterIP   10.96.226.142   <none>        80/TCP,55555/TCP,9999/TCP   29m

kubectl -n vllm get svc
NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                     AGE
vllm-gpu-router-service                 ClusterIP   10.96.174.35    <none>        80/TCP,9000/TCP             29m
vllm-gpu-tinyllama-gpu-engine-service   ClusterIP   10.96.226.142   <none>        80/TCP,55555/TCP,9999/TCP   29m

🎯Troubleshooting:

Certificate Not Issuing

Debug: STATUS: Pending or False

# Check certificate status
kubectl describe certificate -n vllm
# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100
# Check HTTP-01 challenge
kubectl get challenge -n vllm

# Check certificate status
kubectl describe certificate -n vllm
# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100
# Check HTTP-01 challenge
kubectl get challenge -n vllm

Symptom

# Message: 
Failed to create new order: acme: urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates already issued for: nip.io: see letsencrypt.org/docs/rate-limits

# Message: 
Failed to create new order: acme: urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates already issued for: nip.io: see letsencrypt.org/docs/rate-limits

Fix: Change nip.io to sslip.io in the ingress host of the vllm helm charts
gpu-tinyllama-light-ingress-nebius.tpl

Useful Nebius CLI Debugging Commands

# Check MK8s cluster status
nebius mk8s cluster list --parent-id <project-id>
nebius mk8s cluster get <cluster-id>

# List node groups
nebius mk8s node-group list --parent-id <cluster-id>

# Check GPU node group details
nebius mk8s node-group get <node-group-id> 

# View available GPU platforms
nebius compute platform list --parent-id <project-id>

# Get kubeconfig
nebius mk8s cluster get-credentials <cluster-id> --external  --kubeconfig <path>

# Check MK8s cluster status
nebius mk8s cluster list --parent-id <project-id>
nebius mk8s cluster get <cluster-id>

# List node groups
nebius mk8s node-group list --parent-id <cluster-id>

# Check GPU node group details
nebius mk8s node-group get <node-group-id> 

# View available GPU platforms
nebius compute platform list --parent-id <project-id>

# Get kubeconfig
nebius mk8s cluster get-credentials <cluster-id> --external  --kubeconfig <path>

Conclusion

After exploring EKS AKS and GKE implementation of vLLM production-stack, you’ve now successfully deployed a production-ready vLLM serving environment on Nebius AI Cloud! Congratulation🎉

Are you a Cloud Provider not listed in this series?

We’d love to feature your platform! Reach out on LinkedIn to discuss how you can enable us to build and document your integration🤗.

📚 Additional Resources

Run AI Your Way — In Your Cloud

Want full control over your AI backend? The CloudThrill VLLM Private Inference POC is still open — but not forever.

📢 Secure your spot (only a few left), 𝗔𝗽𝗽𝗹𝘆 𝗻𝗼𝘄!

Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control

Claim YOur FREE VLLM POC

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

🙋🏻‍♀️If you like this content please subscribe to our blog newsletter ❤️.

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Get in touch

Latest Podcasts

vLLM Production Stack on Nebius K8s with Terraform🧑🏼‍🚀

Intro

📂 Project Structure

🧰Prerequisites

What’s in the stack?📦

Architecture Overview

1. 🛜Networking Foundation

2. ☸️MK8S Cluster

3. 📦Essential Add-ons

4. 🧠vLLM Production Stack

🖥️ AWS GPU Instance Types Available

Getting started

🔵 Deployment Steps

1️⃣Clone the repository

2️⃣ Set Up Environment Variables

Usage examples

3️⃣ Run Terraform deployment:

4️⃣ Observability (Grafana Login)

Automatic vLLM Dashboard

5️⃣ Destroying the Infrastructure 🚧

🛠️Configuration knobs

🧪 Quick Test

🎯Troubleshooting:

Certificate Not Issuing

Useful Nebius CLI Debugging Commands

Conclusion

Are you a Cloud Provider not listed in this series?

📚 Additional Resources

Run AI Your Way — In Your Cloud

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Intro

📂 Project Structure

🧰Prerequisites

What’s in the stack?📦

Architecture Overview

1. 🛜Networking Foundation

2. ☸️MK8S Cluster

3. 📦Essential Add-ons

4. 🧠vLLM Production Stack

🖥️ AWS GPU Instance Types Available

Getting started

🔵 Deployment Steps

1️⃣Clone the repository

2️⃣ Set Up Environment Variables

Usage examples

3️⃣ Run Terraform deployment:

4️⃣ Observability (Grafana Login)

Automatic vLLM Dashboard

5️⃣ Destroying the Infrastructure 🚧

🛠️Configuration knobs

🧪 Quick Test

🎯Troubleshooting:

Certificate Not Issuing

Useful Nebius CLI Debugging Commands

Conclusion

Are you a Cloud Provider not listed in this series?

📚 Additional Resources

Run AI Your Way — In Your Cloud

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

👋🏻Want to chat about your challenges? We’d love to hear from you!

Don't miss a Bit!

Join countless others! Sign up and get awesome cloud content straight to your inbox. 🚀

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀