Intro

The vLLM Production Stack is designed to run on any Kubernetes-based infrastructure. After covering AWS , Azure, Google Cloud and Nebius MK8s implementations, today we’re deploying vLLM production-stack on CoreWeave Kubernetes (CKS) with the same Terraform framework.

CoreWeave is one of the hottest NeoCould built on the idea that GenAI workloads don’t need virtualization; they need direct access to hardware. Read more on our Inside CoreWeave blog.
This guide shows you how to automate the deployment of a production-ready vLLM serving environment on Coreweave Cloud, with multi model support, HTTPS endpoints, and full LLM observability (Grafana).

This is part of CloudThrill‘s ongoing contribution to the vLLM Production Stack project.

💡You can find our code in the CloudThrill repo ➡️ production-stack-terraform.

📂 Project Structure

./
├── cluster-tools.tf          # Cert-manager, Traefik, Monitoring, Metrics-server
├── main.tf                   # CKS Cluster & NodePool logic
├── network.tf                # CoreWeave VPC & IP Prefixes (Pod/Svc/LB CIDRs)
├── output.tf                 # Unified Stack Dashboard output
├── provider.tf               # CoreWeave, Helm, & Kubectl provider config
├── variables.tf              # Configurable knobs & defaults
├── vllm-production-stack.tf  # vLLM Helm release & logic
├── env-vars.template         # Environment variable boilerplate
├── terrafom.tfvars.template  # Terraform variable boilerplate
├── config/
│   ├── helm/
│   │   └── kube-prome-stack.yaml
│   ├── llm-stack/
│   │   └── helm/
│   │       └── gpu/
│   │           ├── gpu-gpt-oss-20-cw.tpl        # Open Source models config
│   │           ├── gpu-gpt-qwn-gem-glm-cw.tpl   # Qwen/Gemma/GLM variants
│   │           └── gpu-llama-light-ingress-cw.tpl
│   ├── manifests/
│   │   ├── audit-policy.yaml
│   │   ├── letsencrypt-issuer-prod.yaml        # Let's Encrypt config
│   │   ├── letsencrypt-issuer-stage.yaml       # Staging Let's Encrypt config
│   │   ├── nodepool-cpu.yaml                   # CPU NodePool CRD manifest
│   │   └── nodepool-gpu.yaml                   # GPU NodePool CRD manifest
│   ├── vllm-dashboard-oci.json                 # vLLM per model Inference Observability
│   └── vllm-dashboard.json                     # vllm and Kvcache observability
└── README.md                                # ← You are here

./
├── cluster-tools.tf          # Cert-manager, Traefik, Monitoring, Metrics-server
├── main.tf                   # CKS Cluster & NodePool logic
├── network.tf                # CoreWeave VPC & IP Prefixes (Pod/Svc/LB CIDRs)
├── output.tf                 # Unified Stack Dashboard output
├── provider.tf               # CoreWeave, Helm, & Kubectl provider config
├── variables.tf              # Configurable knobs & defaults
├── vllm-production-stack.tf  # vLLM Helm release & logic
├── env-vars.template         # Environment variable boilerplate
├── terrafom.tfvars.template  # Terraform variable boilerplate
├── config/
│   ├── helm/
│   │   └── kube-prome-stack.yaml
│   ├── llm-stack/
│   │   └── helm/
│   │       └── gpu/
│   │           ├── gpu-gpt-oss-20-cw.tpl        # Open Source models config
│   │           ├── gpu-gpt-qwn-gem-glm-cw.tpl   # Qwen/Gemma/GLM variants
│   │           └── gpu-llama-light-ingress-cw.tpl
│   ├── manifests/
│   │   ├── audit-policy.yaml
│   │   ├── letsencrypt-issuer-prod.yaml        # Let's Encrypt config
│   │   ├── letsencrypt-issuer-stage.yaml       # Staging Let's Encrypt config
│   │   ├── nodepool-cpu.yaml                   # CPU NodePool CRD manifest
│   │   └── nodepool-gpu.yaml                   # GPU NodePool CRD manifest
│   ├── vllm-dashboard-oci.json                 # vLLM per model Inference Observability
│   └── vllm-dashboard.json                     # vllm and Kvcache observability
└── README.md                                # ← You are here

🧰Prerequisites

Before you begin, ensure you have the following:

Tool	Version	Notes
Terraform	≥ 1.5.7	tested on 1.9+
CoreWeave CLI (cwic)	latest	Primary CLI for CKS interaction and Kubeconfig generation
CoreWeave provider	0.10.1	Native provider for CKS clusters and NodePool orchestration.
kubectl	≥ 1.31	± 1 of control-plane
jq	optional	JSON helper

Follow the below steps to Install the tools (expend)👇🏼

# 1. Download and extract the binary
curl -fsSL https://github.com/coreweave/cwic/releases/latest/download/cwic_$(uname)_$(uname -m).tar.gz | tar zxf - cwic && mv cwic $HOME/.local/bin/

# 1. Download and extract the binary
curl -fsSL https://github.com/coreweave/cwic/releases/latest/download/cwic_$(uname)_$(uname -m).tar.gz | tar zxf - cwic && mv cwic $HOME/.local/bin/

Configure Coreweave CLI profile

# 2. Authenticate (Interactive)
cwic auth login

# OR Authenticate using a Token
cwic auth login <YOUR_TOKEN> --name "Production"

# 3. Verify Identity
cwic auth whoami

# 2. Authenticate (Interactive)
cwic auth login

# OR Authenticate using a Token
cwic auth login <YOUR_TOKEN> --name "Production"

# 3. Verify Identity
cwic auth whoami

Learn more about cwic commands on my coreweave-blog-post

What’s in the stack?📦

This Terraform stack delivers a production-ready vLLM inference platform on CoreWeave CKS with:

✅ GPU-native Kubernetes scheduling (Node Pools)
✅ Pre-baked GPU drivers: CoreWeave manages drivers + runtime
✅ Traefik ingress with automatic TLS (cert-manager + Let’s Encrypt)
✅ Shared persistent storage (VAST / RWX PVCs)
✅ Multi-model vLLM deployment (per-model GPU isolation)
✅ Prometheus + Grafana observability (vLLM dashboards included)

Architecture Overview

Deployment layers:
The stack provisions infrastructure in logical layers that adapt based on your hardware choice.
Example: gpt-oss-20b | Total e-to-e deployment is approximately 40 minutes.

Layer	Component	Deployment Time
Infrastructure	Custom VPC/Subnets (Pod/Service/LB CIDRs) + CKS control plane	~4 min
Add-ons	cert-manager, Metrics Server, Traefik ingress, kube-prometheus-stack	~12 min 57 s
CPU Node Pool (Bare metal)	1-Intel/AMD-Bare-metal Node	~19 min
GPU Node Pool (Bare metal)	1-(8xH100) Bare-metal Node once the CPU-nodepool is stable.	~15 min
vLLM Production Stack	Model server + router + autoscaling layers	~12 min
Total	End-to-end	~40-44 min

Model Options: There are 3 different vllm deployment charts:

gpu-gpt-oss-20b | oss flagship LLM collection | tiny-llama . (for DeepseekV3 read 🐳here)

⚙️ Provisioning Highlights

✅ One-Click Deployment: 100% automated vLLM stack with zero manual intervention
✅ Zero Dependencies: No pre-existing cluster/kubeconfig required – built from scratch with just a user token
✅ Full Add-on Suite: Traefik, cert-manager, metrics-server, Let’s Encrypt Issuers, Prometheus, and Grafana
✅ Smart GPU Mapping: Friendly names (H100, B200) → actual CoreWeave instance IDs with AZ validation logic
✅ Production SSL: Auto-provisioned HTTPS for vLLM & Grafana via

--- vllm
https://<vllm_prefix>.<org_id>-<cluster_name>.coreweave.app

-- Grafana
https://grafana.<org_id>-<cluster_name>.coreweave.app

--- vllm
https://<vllm_prefix>.<org_id>-<cluster_name>.coreweave.app

-- Grafana
https://grafana.<org_id>-<cluster_name>.coreweave.app

🏁🛑Race Condition Guards

To avoid flaky Terraform applies, the stack includes explicit gates before moving to the next layer:

Guard Type Purpose Behavior

NodePool State Polling Ensures nodes are Ready Waits for node Ready status before deploying workloads

DNS Readiness Guard Ensure control plane reachability Polls Cloudflare DNS for K8s API server until resolution before proceeding

Kubeconfig Gate Avoid invalid provider init error Kubeconfig only used after Control plane creation (no k8s provider needed)

Note: These guards eliminate Terraform applies errors caused by control plane latency, DNS propagation delays, and node provisioning lag.

1. 🛜Networking Foundation

The stack creates a production-grade network topology:

Feature	Configuration	Details
Load Balancer CIDR	10.20.0.0/22	Dedicated prefix for ingress endpoints.
Pod CIDR	10.244.0.0/16	Massive IP space for high-density GPU scaling.
Service CIDR	10.96.0.0/16	Internal cluster-IP orchestration.
CNI	Native Cilium CNI	eBPF-powered policy enforcement, Overlay (VXLAN), and Hubble flow observability.
Ingress	Traefik Controller	Exposed via CoreWeave Load Balancer.
SSL Certification	Letsencrypt ClusterIssuer	Encrypts and certifies VLLM and Grafana ingress endpoints.

2. ☸️CKS Cluster

A Control plane v1.35 with two managed node-groups( The only way to create node pools is via manifests)

Pool	Instance	Purpose
`cpu-pool`	cd-gp-i64-erapids (64 vCPU / 512 GiB)	Intel Emerald Rapids for Core Kubernetes workload
`gpu-pool (based on quota)`	gd-8xh100ib-i128 (8x H100-80GB /128GB RAM)	GPU inference workload

3. 📦Essential Add-ons

Core CKS add-ons are pre-optimized for AI workloads. We added below K8s addons (deployed after CPU node).

Category	Add-on	Notes
Ingress/LB	Traefik Ingress	Integrated with CoreWeave LB
Observability	kube-prometheus-stack / metrics-server	Includes GPU-specific DCGM metrics
Security	cert-manager	Let’s Encrypt HTTP-01 automation
GPU	Pre-baked NVIDIA drivers	No separate GPU operator required

4. 🧠vLLM Production Stack

The heart of the deployment a production-ready model serving:

✅ Model: (Default) Single GPT-OSS-20B model replica. Also available (trinity-mini, gemma-3, qwen3-next-80b).
✅ Load balancing: Round-robin router service across replicas.
✅ lmcache config: enabled with KV Cache CPU offloading.
✅ Storage: Init container with persistent model caching at `/data/models/` using VAST Data.
✅ Monitoring: Prometheus metrics wit 2 Grafana vllm dashbaords for observability.
✅ HTTPS router endpoints: Automatic TLS with Let’s Encrypt certificates.
✅ vllm Helm charts: gpu-gpt-oss-20-cw.tpl. Find more here.

🖥️ CoreWeave GPU Instance Types Available

From high-density Blackwell clusters to cost-optimized L40S inference nodes. View the full CoreWeave GPU Catalog.

Available GPU instances

GPU Instance Model	GPU Count	VRAM (GB)	vCPUs	RAM (GB)	Price/h
NVIDIA GB300 NVL72	1 (Rack)	20,736	2,592	18,432	Contact Sales
NVIDIA GB200 NVL72	1 (Rack)	13,824	2,592	18,432	$42.00*
NVIDIA B200 SXM	8	1,536	128	2,048	$68.80
RTX 6000 Blackwell	8	768	128	1,024	$20.00
NVIDIA HGX H100	8	640	128	2,048	$49.24
NVIDIA HGX H200	8	1,128	128	2,048	$50.44
NVIDIA GH200	1	96	72	480	$6.50
NVIDIA L40	8	384	128	1,024	$10.00
NVIDIA L40S	8	384	128	1,024	$18.00
NVIDIA A100 (80GB)	8	640	128	2,048	$21.60

*Estimated entry price for reserved capacity.

Note: Check the full list of CoreWeave GPU instances and offerings in our last blog

🛠️ Configuration Knobs

The stack supports over 20+ configurable options around Networking | Nodepools | Observability | vLLM Tuning.

Variable	Default	Description
cw_token	— (required)	CoreWeave API Token
org_id	— (required)	CoreWeave Organization ID (`cwic auth whoami`)
region	`US-EAST-06`	Deployment region (e.g., US-EAST-06, ORD1)
zone	`US-EAST-06A`	Specific CoreWeave Availability Zone
cluster_name	`vllm-cw-prod`	Name of the CKS Managed Cluster
k8s_version	`1.34`	Kubernetes version (e.g., 1.34, 1.35)
enable_nodepool_gpu	`true`	Enable/Disable external GPU node
public_endpoint	`true`	Enable/Disable external API access
cpu_instance_id	`cd-gp-i64-erapids`	Bare-metal CPU type (e.g., Turin, Emerald Rapids)
gpu_instance_type	`H100`	Bare-metal GPU type (H100, A100, L40S, etc.)
enable_vllm	`true`	Deploy the vLLM engine and request router
hf_token	«secret»	Hugging Face token for model downloads
grafana_admin_password	`admin1234`	Admin temporary password for monitoring dashboards (change it)
letsencrypt_email	— (required)	Email for SSL/TLS certificate registration

📋 Complete Configuration Options

Full list of variables can be found here. There are two ways to customize their values :

Environment variables: env-vars.template
Terraform variables: terraform.tfvars.template

🔵 Quick start

1️⃣Clone the repository

The vLLM CoreWeave CKS template is located under the official repo vllm-production-stack or the Cloudthrill repo:

Navigate to the production-stack-terraform directory and terraform coreweave tutorial folder

$ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
📂.. 
$ cd vllm-production-stack-terraform/coreweave/cks-vllm

$ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
📂.. 
$ cd vllm-production-stack-terraform/coreweave/cks-vllm

2️⃣ Set Up Environment Variables

Use an env-vars file to export your TF_VARS or use terraform.tfvars . Replace placeholders with your values:

  # Copy and customize
  $ cp env-vars.template env-vars
  $ vi env-vars
  $ source env-vars

  # Copy and customize
  $ cp env-vars.template env-vars
  $ vi env-vars
  $ source env-vars

Usage examples

Option 1: Through Environment Variables

 ################################################################################
 # 🔐 CORE PROVIDER CREDENTIALS AND REGION
 ################################################################################
 export TF_VAR_cw_token="<YOUR_TOKEN>"    # (required) - CoreWeave API token
 export TF_VAR_org_id="<YOUR_ORG_ID>"     # (required) - CoreWeave Organization ID
 export TF_VAR_region="US-EAST-06"        # Deployment region
 export TF_VAR_zone="US-EAST-06A"         # Specific availability zone
 ################################################################################
 # 🧠 vLLM Inference Configuration
 ################################################################################
 export TF_VAR_cluster_name="vllm-cw-prod" # default: "vllm-cw-prod"
 export TF_VAR_enable_vllm="true"
 export TF_VAR_vllm_host_prefix="vllm"
 export TF_VAR_hf_token="<HF_TOKEN>"       # Hugging Face token (sensitive)
 export TF_VAR_gpu_vllm_helm_config="config/llm-stack/helm/gpu/gpu-gpt-oss-20-cw.tpl"
 ################################################################################
 # ⚙️ GPU / Nodegroup Settings
 ################################################################################
 export TF_VAR_enable_nodepool_gpu="true"
 export TF_VAR_gpu_instance_type="H100"    # GPU platform (H100, L40S, A100)
 export TF_VAR_cpu_instance_id="cd-gp-i64-erapids" # Bare-metal CPU pool
 export TF_VAR_gpu_node_target="1"         # Number of GPU nodes

 $ source env-vars

 ################################################################################
 # 🔐 CORE PROVIDER CREDENTIALS AND REGION
 ################################################################################
 export TF_VAR_cw_token="<YOUR_TOKEN>"    # (required) - CoreWeave API token
 export TF_VAR_org_id="<YOUR_ORG_ID>"     # (required) - CoreWeave Organization ID
 export TF_VAR_region="US-EAST-06"        # Deployment region
 export TF_VAR_zone="US-EAST-06A"         # Specific availability zone
 ################################################################################
 # 🧠 vLLM Inference Configuration
 ################################################################################
 export TF_VAR_cluster_name="vllm-cw-prod" # default: "vllm-cw-prod"
 export TF_VAR_enable_vllm="true"
 export TF_VAR_vllm_host_prefix="vllm"
 export TF_VAR_hf_token="<HF_TOKEN>"       # Hugging Face token (sensitive)
 export TF_VAR_gpu_vllm_helm_config="config/llm-stack/helm/gpu/gpu-gpt-oss-20-cw.tpl"
 ################################################################################
 # ⚙️ GPU / Nodegroup Settings
 ################################################################################
 export TF_VAR_enable_nodepool_gpu="true"
 export TF_VAR_gpu_instance_type="H100"    # GPU platform (H100, L40S, A100)
 export TF_VAR_cpu_instance_id="cd-gp-i64-erapids" # Bare-metal CPU pool
 export TF_VAR_gpu_node_target="1"         # Number of GPU nodes

 $ source env-vars

Option 2: Through Terraform Variables

 # Copy and customize
 $ cp terraform.tfvars.example terraform.tfvars
 $ vim terraform.tfvars

 # Copy and customize
 $ cp terraform.tfvars.example terraform.tfvars
 $ vim terraform.tfvars

3️⃣ Run Terraform deployment:

You can now safely run Terraform plan & apply. You will deploy the 19 resources in total, including a local Kubeconfig.

terraform init
terraform plan
terraform apply

terraform init
terraform plan
terraform apply

🔵Full terraform output ( vllm_host_prefix="vllm" and cluster_name=vllm-cw-prod)

Apply complete! Resources: 19 added, 0 changed, 0 destroyed.

Outputs:

vllm_stack_summary = <<EOT
✅ CoreWeave CKS cluster deployed successfully!

  🚀 VLLM PRODUCTION STACK ON COREWEAVE 🚀
  -----------------------------------------------------------
  ORG ID            : myorg
  CLUSTER           : vllm-cw-prod
  ENDPOINT          : https://<myorg>-2160f14f.k8s.us-east-06a.coreweave.com
  VPC               : vllm-vpc (US-EAST-06A)
  NETWORKING        : lb-cidr: 10.20.0.0/22 | pod-cidr: 10.244.0.0/16 | service-cidr: 10.96.0.0/16

  🖥️  NODEPOOL INFRASTRUCTURE
  -----------------------------------------------------------
  CPU POOL [cd-gp-i64-erapids] : cpu-pool
  GPU POOL [gd-8xh100ib-i128] : gpu-pool
  CPU SCALING       : [target=1, min=1, max=2, autoscaling=true]
  GPU SCALING       : [target=1, min=1, max=2, autoscaling=true]
  VLLM CONFIG       : ./config/llm-stack/helm/gpu/gpu-gpt-oss-20-cw.tpl

  🌐 ACCESS ENDPOINTS
  -----------------------------------------------------------
  VLLM API          : https://vllm.<myorg>-vllm-cw-prod.coreweave.app/v1   <<👈🏻------- * vllm_host_prefix="vllm"
  GRAFANA           : https://grafana.<myorg>-vllm-cw-prod.coreweave.app

  🛠️  QUICK START COMMANDS
  -----------------------------------------------------------
  1. Set Context   : export KUBECONFIG="./kubeconfig"
  2. Test Model    : curl -k "https://vllm.<myorg>-vllm-cw-prod.coreweave.app/v1/models"

Apply complete! Resources: 19 added, 0 changed, 0 destroyed.

Outputs:

vllm_stack_summary = <<EOT
✅ CoreWeave CKS cluster deployed successfully!

  🚀 VLLM PRODUCTION STACK ON COREWEAVE 🚀
  -----------------------------------------------------------
  ORG ID            : myorg
  CLUSTER           : vllm-cw-prod
  ENDPOINT          : https://<myorg>-2160f14f.k8s.us-east-06a.coreweave.com
  VPC               : vllm-vpc (US-EAST-06A)
  NETWORKING        : lb-cidr: 10.20.0.0/22 | pod-cidr: 10.244.0.0/16 | service-cidr: 10.96.0.0/16

  🖥️  NODEPOOL INFRASTRUCTURE
  -----------------------------------------------------------
  CPU POOL [cd-gp-i64-erapids] : cpu-pool
  GPU POOL [gd-8xh100ib-i128] : gpu-pool
  CPU SCALING       : [target=1, min=1, max=2, autoscaling=true]
  GPU SCALING       : [target=1, min=1, max=2, autoscaling=true]
  VLLM CONFIG       : ./config/llm-stack/helm/gpu/gpu-gpt-oss-20-cw.tpl

  🌐 ACCESS ENDPOINTS
  -----------------------------------------------------------
  VLLM API          : https://vllm.<myorg>-vllm-cw-prod.coreweave.app/v1   <<👈🏻------- * vllm_host_prefix="vllm"
  GRAFANA           : https://grafana.<myorg>-vllm-cw-prod.coreweave.app

  🛠️  QUICK START COMMANDS
  -----------------------------------------------------------
  1. Set Context   : export KUBECONFIG="./kubeconfig"
  2. Test Model    : curl -k "https://vllm.<myorg>-vllm-cw-prod.coreweave.app/v1/models"

After the deployment you should be able to interact with the cluster using kubectl:

export KUBECONFIG=$PWD/kubeconfig

export KUBECONFIG=$PWD/kubeconfig

4️⃣🧪 Test

1️⃣ Router Endpoint and API URL

1.1 The deployment provides the following vLLM API endpoint (vllm_api_url output):

terraform output vllm_stack_summary | grep "VLLM API"

 https://vllm.<myorg>-<myclustername>.coreweave.app/v1

terraform output vllm_stack_summary | grep "VLLM API"

 https://vllm.<myorg>-<myclustername>.coreweave.app/v1

1.2 set as environment variable:

export VLLM_API_URL="https://vllm.<myorg>-vllm-cw-prod.coreweave.app/v1"

export VLLM_API_URL="https://vllm.<myorg>-vllm-cw-prod.coreweave.app/v1"

2️⃣ List models

# check models
curl -k "${VLLM_API_URL}/models" | jq .

# check models
curl -k "${VLLM_API_URL}/models" | jq .

Example output:

{
  "object": "list",
  "data": [
    {"id": "gpt-oss-120b", "object": "model", "created": 1739145600},
    {"id": "gemma-3-27b-vision", "object": "model", "created": 1739145600},
    {"id": "qwen3-next-80b", "object": "model", "created": 1739145600},
    {"id": "trinity-mini", "object": "model", "created": 1739145600}
  ]
}

{
  "object": "list",
  "data": [
    {"id": "gpt-oss-120b", "object": "model", "created": 1739145600},
    {"id": "gemma-3-27b-vision", "object": "model", "created": 1739145600},
    {"id": "qwen3-next-80b", "object": "model", "created": 1739145600},
    {"id": "trinity-mini", "object": "model", "created": 1739145600}
  ]
}

3️⃣ Completion request for GPT-OSS-120B (Reasoning):

curl -k "${VLLM_API_URL}/completions" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "gpt-oss-120b",
        "prompt": "Toronto is a",
        "max_tokens": 50,
        "temperature": 0.7
    }' | jq .choices[].text

//*
"city that is known for its vibrant nightlife, and there are plenty of bars and clubs"
//*

curl -k "${VLLM_API_URL}/completions" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "gpt-oss-120b",
        "prompt": "Toronto is a",
        "max_tokens": 50,
        "temperature": 0.7
    }' | jq .choices[].text

//*
"city that is known for its vibrant nightlife, and there are plenty of bars and clubs"
//*

Trinity Mini (Agentic/Tool Calling):

curl -k "${VLLM_API_URL}/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "trinity-mini",
     "messages": [{"role": "user", "content": "Explain tensor parallelism in 2 sentences"}],
        "max_tokens": 100,
        "temperature": 0.7
    }' | jq .choices[].message.content

curl -k "${VLLM_API_URL}/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "trinity-mini",
     "messages": [{"role": "user", "content": "Explain tensor parallelism in 2 sentences"}],
        "max_tokens": 100,
        "temperature": 0.7
    }' | jq .choices[].message.content

4️⃣ Browser WebUI (Optional)

You can also use Page Assist Chrome extension for a lightweight chat UI to test your vLLM Endpoint.

Setup: Settings → OpenAI Compatible API → Add provider → Base URL:
https://vllm.<myorg>-vllm-cw-prod.coreweave.app/v1

5️⃣ 🔬 Observability (Grafana)

You can access Grafana dashboards using grafana_url output (see example below)

terraform output vllm_stack_summary | grep "GRAFANA"
https://grafana.<myorg>-vllm-cw-prod.coreweave.app

terraform output vllm_stack_summary | grep "GRAFANA"
https://grafana.<myorg>-vllm-cw-prod.coreweave.app

Username: admin
Password: use the below command

kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath={.data.admin-password} | base64 -d

kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath={.data.admin-password} | base64 -d

Automatic vLLM Dashboards

vLLM/lmcache Dashboard:
Latency, TTFT, ITL, QPS information, Serving engine load(KV cache usage) across all vllm instances.

2. Model based Inference dashboard: This is even more granular.
Per model => GPU utilization / Throughput (tokens/sec) / Prefill vs decode metrics / [Prompt|output] length.

Note: For Benchmarking vLLM Production Stack Performance check the multi-round QA Tutorial.

6️⃣Destroying the Infrastructure 🚧

To delete everything just run the below:

terraform destroy -auto-approve
# Destroy complete! Resources: 19 destroyed.

terraform destroy -auto-approve
# Destroy complete! Resources: 19 destroyed.

🎯Troubleshooting:

1. Let’s Encrypt rate limit

Let’s Encrypt limits duplicate certificate issuance to 5 per week for the exact same set of domain names.
If you hit the limit after repeated deploy change the vllm hostname prefix.

export TF_VAR_vllm_host_prefix="vllm-random"  # CHANGE ME

export TF_VAR_vllm_host_prefix="vllm-random"  # CHANGE ME

2. terraform resources in use after second apply

In rare cases terraform refresh might ignore resources for some reason and tries to recreate it during the second apply.
You might want to re-import the resource into the statefile.

Fix => Run terraform import to re-fetch the ressources (expand to see the commands)

//🌐 Import resource examples //

# Import GPU nodepool
terraform import 'kubectl_manifest.nodepool_gpu["gpu"]' "compute.coreweave.com/v1alpha1//NodePool//gpu-pool//"

##### networking
terraform import 'helm_release.traefik' traefik/traefik
terraform import 'kubectl_manifest.letsencrypt_issuer["letsencrypt"]' 'cert-manager.io/v1//ClusterIssuer//letsencrypt-prod//'

## cluster addons
terraform import 'helm_release.metrics_server["metrics_server"]' kube-system/metrics-server
terraform import  'helm_release.kube_prometheus_stack["kube_prometheus_stack"]' kube-prometheus-stack/kube-prometheus-stack
terraform import 'helm_release.cert_manager["cert-manager"]'  cert-manager/cert-manager

#### vLLM
terraform import 'kubectl_manifest.vllm_service_monitor["vllm_monitor"]' "monitoring.coreos.com/v1//ServiceMonitor//vllm-monitor

#### vLLM dashbaord
terraform import 'kubernetes_config_map.vllm_dashboard["vllm_dashboard"]' kube-prometheus-stack/vllm-dashboard
terraform import 'kubernetes_config_map.vllm_dashboard["vllm_dashboard_oci"]' kube-prometheus-stack/vllm-dashboard-oci

//🌐 Import resource examples //

# Import GPU nodepool
terraform import 'kubectl_manifest.nodepool_gpu["gpu"]' "compute.coreweave.com/v1alpha1//NodePool//gpu-pool//"

##### networking
terraform import 'helm_release.traefik' traefik/traefik
terraform import 'kubectl_manifest.letsencrypt_issuer["letsencrypt"]' 'cert-manager.io/v1//ClusterIssuer//letsencrypt-prod//'

## cluster addons
terraform import 'helm_release.metrics_server["metrics_server"]' kube-system/metrics-server
terraform import  'helm_release.kube_prometheus_stack["kube_prometheus_stack"]' kube-prometheus-stack/kube-prometheus-stack
terraform import 'helm_release.cert_manager["cert-manager"]'  cert-manager/cert-manager

#### vLLM
terraform import 'kubectl_manifest.vllm_service_monitor["vllm_monitor"]' "monitoring.coreos.com/v1//ServiceMonitor//vllm-monitor

#### vLLM dashbaord
terraform import 'kubernetes_config_map.vllm_dashboard["vllm_dashboard"]' kube-prometheus-stack/vllm-dashboard
terraform import 'kubernetes_config_map.vllm_dashboard["vllm_dashboard_oci"]' kube-prometheus-stack/vllm-dashboard-oci

Useful CoreWeave CLI Debugging Commands

You can explore the full list of cwic commands on our coreweave-blog-post

# List/describe nodes
cwic nodepool list
NAME       INSTANCE TYPE       TARGET   QUEUED   INPROGRESS   CURRENT   PENDING CONFIG   STAGED NODES   REQUIRING 
cpu-pool   cd-gp-i64-erapids   1        0        0            1         false            false          0/1
gpu-pool   gd-8xh100ib-i128    1        0        0            1         false            false          0/1

### list nodes from gpu-pool nodegroup
cwic nodepool node get gpu-pool
NAME       IP          TYPE            RESERVED NODEPOOL  READY   ACTIVE  VERSION  STATE     
x55a80i   10.x.x.x   gd-8xh100ib-i128   xx      gpu-pool   true    true   2.31.0  production

# cwic node <action> <node-name>
cwic node get  <node-name>
cwic node describe <node-name>

# cwic node describe <node-name>
cwic node describe x55a80i

# List/describe nodes
cwic nodepool list
NAME       INSTANCE TYPE       TARGET   QUEUED   INPROGRESS   CURRENT   PENDING CONFIG   STAGED NODES   REQUIRING 
cpu-pool   cd-gp-i64-erapids   1        0        0            1         false            false          0/1
gpu-pool   gd-8xh100ib-i128    1        0        0            1         false            false          0/1

### list nodes from gpu-pool nodegroup
cwic nodepool node get gpu-pool
NAME       IP          TYPE            RESERVED NODEPOOL  READY   ACTIVE  VERSION  STATE     
x55a80i   10.x.x.x   gd-8xh100ib-i128   xx      gpu-pool   true    true   2.31.0  production

# cwic node <action> <node-name>
cwic node get  <node-name>
cwic node describe <node-name>

# cwic node describe <node-name>
cwic node describe x55a80i

Conclusion

You’ve now successfully deployed a production-ready vLLM serving environment on CoreWeave Cloud in a single Terraform apply. The deployment enforces deterministic sequencing, GPU isolation, and full observability, making it ready for real workloads, not just demos. Next, we will share another deployment walkthrough of a Deepseekv3.2 distributed inference across multi nodes. Stay tunned!

Note

This contribution to the open-source vLLM production stack was made possible with the help of the CoreWeave team.
We would like to extend the appreciation for their support throughout this effort.

📚 Additional Resources

🙋🏻‍♀️If you like this content please subscribe to our blog newsletter ❤️.

Run AI Your Way — In Your Cloud

Want full control over your AI backend? The CloudThrill VLLM Private Inference POC is still open — but not forever.

📢 Secure your spot (only a few left), 𝗔𝗽𝗽𝗹𝘆 𝗻𝗼𝘄!

Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control

Claim YOur FREE VLLM POC

Latest Podcasts

vLLM Production Stack on CoreWeave CKS with Terraform🧑🏼‍🚀

Intro

📂 Project Structure

🧰Prerequisites

What’s in the stack?📦

Architecture Overview

⚙️ Provisioning Highlights

🏁🛑Race Condition Guards

1. 🛜Networking Foundation

2. ☸️CKS Cluster

3. 📦Essential Add-ons

4. 🧠vLLM Production Stack

🖥️ CoreWeave GPU Instance Types Available

🛠️ Configuration Knobs

📋 Complete Configuration Options

🔵 Quick start

1️⃣Clone the repository

2️⃣ Set Up Environment Variables

Usage examples

3️⃣ Run Terraform deployment:

4️⃣🧪 Test

5️⃣ 🔬 Observability (Grafana)

Automatic vLLM Dashboards

6️⃣Destroying the Infrastructure 🚧

🎯Troubleshooting:

1. Let’s Encrypt rate limit

2. terraform resources in use after second apply

Useful CoreWeave CLI Debugging Commands

Conclusion

📚 Additional Resources

Run AI Your Way — In Your Cloud

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Intro

📂 Project Structure

🧰Prerequisites

What’s in the stack?📦

Architecture Overview

⚙️ Provisioning Highlights

🏁🛑Race Condition Guards

1. 🛜Networking Foundation

2. ☸️CKS Cluster

3. 📦Essential Add-ons

4. 🧠vLLM Production Stack

🖥️ CoreWeave GPU Instance Types Available

🛠️ Configuration Knobs

📋 Complete Configuration Options

🔵 Quick start

1️⃣Clone the repository

2️⃣ Set Up Environment Variables

Usage examples

3️⃣ Run Terraform deployment:

4️⃣🧪 Test

5️⃣ 🔬 Observability (Grafana)

Automatic vLLM Dashboards

6️⃣Destroying the Infrastructure 🚧

🎯Troubleshooting:

1. Let’s Encrypt rate limit

2. terraform resources in use after second apply

Useful CoreWeave CLI Debugging Commands

Conclusion

📚 Additional Resources

Run AI Your Way — In Your Cloud

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

Don't miss a Bit!

Join countless others! Sign up and get awesome cloud content straight to your inbox. 🚀

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀