
Intro
Welcome back to the terraform vLLM Production Stack series! After covering AWS EKS and Azure AKS, today we’re deploying vLLM production-stack on Google Cloud GKE with the same Terraform approach.
This guide shows you how to deploy a production-ready LLM serving environment on Google Cloud, with GCP-specific optimizations including Dataplane V2 (Cilium eBPF), VPC-native networking, and pre-installed GPU drivers.
We’ll cover network architecture, certificate automation (using Let’s Encrypt), GPU provisioning, and comprehensive observability for both CPU and GPU inference- all using Infrastructure as Code.
💡You can find our code in the CloudThrill repo ➡️ production-stack-terraform.
📂 Project Structure
./
├── main.tf # Project + GKE cluster
├── network.tf # VPC, subnet, NAT, reserved IPs
├── provider.tf # GCP + Helm + kubectl providers
├── variables.tf # All input variables
├── output.tf # HTTPS endpoints, kubeconfig
├── cluster-tools.tf # cert-manager, prometheus, GPU operator
├── datasources.tf # Ingress data sources
├── vllm-production-stack.tf # vLLM Helm release, BackendConfig, dashboards
├── env-vars.template # Quick env exporter
├── terraform.tfvars.template # Same as HCL
├── modules/
│ ├── private-cluster-update-variant/ # GKE cluster (upstream)
│ │ ├── cluster.tf
│ │ ├── dns.tf
│ │ ├── firewall.tf
│ │ ├── main.tf
│ │ ├── networks.tf
│ │ ├── outputs.tf
│ │ ├── variables.tf
│ │ └── versions.tf
│ ├── google-network/ # VPC + subnet + NAT (upstream)
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── versions.tf
│ │ └── modules/
│ │ ├── vpc/
│ │ ├── subnets/
│ │ ├── routes/
│ │ └── firewall-rules/
│ ├── google-project-factory/ # Optional new project (upstream)
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── modules/
│ │ ├── core_project_factory/
│ │ └── project_services/
│ └── llm-stack/ # vLLM Helm value templates
│ └── helm/
│ ├── cpu/
│ │ └── cpu-tinyllama-light-ingress-gcp.tpl
│ └── gpu/
│ ├── gpu-operator-values.yaml
│ └── gpu-tinyllama-light-ingress-gcp.tpl
├── config/
│ ├── helm/
│ │ └── kube-prome-stack.yaml # Grafana + Prometheus values
│ ├── manifests/
│ │ └── letsencrypt-issuer.yaml # Let's Encrypt ClusterIssuer
│ ├── kubeconfig.tpl # Local kubeconfig
│ └── vllm-dashboard.json # Grafana vLLM dashboard
└── README.md # ← you are here🧰Prerequisites
Before you begin, ensure you have the following:
| Tool | Version | Notes |
|---|---|---|
| Terraform | ≥ 1.3 | Tested on 1.10+ |
| gcloud CLI | ≥ 450.0 | For authentication, tested on 529.0 |
| kubectl | ≥ 1.9.x | Kubernetes CLI |
| Terraform Provider for GCP | ≥ 6.41+ | hashicorp/google provider |
| jq | optional | JSON helper |
Follow the below steps to Install the tools (expend)👇🏼
# Install tools
sudo apt update && sudo apt install -y jq curl unzip gpg
# Terraform
wget -qO- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install -y terraform
# Google Cloud SDK
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init
# kubectl
curl -sLO "https://dl.k8s.io/release/$(curl -Ls https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install kubectl /usr/local/bin/ && rm kubectl- Configure Google Cloud
# Login to GCP
gcloud auth login
gcloud auth application-default login
# Set project
gcloud config set project YOUR_PROJECT_ID
# Enable required APIs
gcloud services enable container.googleapis.com compute.googleapis.com servicenetworking.googleapis.com
# Verify
gcloud config listWhat’s in the stack?📦
This terraform stack delivers a production-ready vLLM serving environment On Google Cloud GKE supporting both CPU/GPU inference with operational best practices embedded in the Google Terraform Kubernetes Engine Modules.
It’s designed for real-world production workloads with:
✅ Enterprise-grade infrastructure: Built on the private-cluster-update-variant module.
✅ Flexible compute: Switch between CPU and GPU inference with a single flag.
✅ Operational excellence: Prometheus, Grafana and Dataplane V2 (Cilium).
✅ Scalability: Cluster-autoscaler and Pre-installed GPU drivers.
✅ Secure endpoints: HTTPS model serving through GKE Native Ingress + Google Cloud Load Balancer + Let’s Encrypt
Architecture Overview

Deployment layers – The stack provisions infrastructure in logical layers that adapt based on your hardware choice:
| Layer | Component | CPU Mode | GPU Mode |
|---|---|---|---|
| Infrastructure | VPC + GKE + Dataplane V2 | Always deployed | Always deployed |
| Add-ons | Persistent Disk CSI, cert-manager, Prometheus | Always deployed | Always deployed |
| vLLM Stack | Secrets + Helm chart | Deploy on CPU nodes | + GPU nodes (drivers pre-installed) |
| Networking | Reserved IPs + Ingress + TLS + Let’s Encrypt | GKE native ingress + cert-manager | GKE native ingress + cert-manager |
1. Networking Foundation
The stack creates a production-grade network topology:
- Custom VPC with 3 IP ranges (nodes, pods, services)
- VPC-native networking with Dataplane V2 (Cilium eBPF)
- Private GKE cluster with Cloud NAT for outbound traffic
- Automated SSL/TLS certificates via cert-manager + Let’s Encrypt
- Network security with Dataplane V2 network policies
2. GKE Cluster
Private cluster v1.30 (zonal or regional) and Managed node pools with auto-scaling
| Pool | Machine Type | Purpose |
|---|---|---|
cpu-pool |
n2-standard-4 (4 vCPU / 16 GiB) | System services & CPU inference |
gpu-pool (optional) |
g2-standard-4 + L4 (1 × NVIDIA L4) | GPU inference |
3. Essential Add-ons
ore GKE add-ons via terraform-google-kubernetes-engine private-cluster-update-variant::
| Category | Add-on |
|---|---|
| CNI | VPC-native networking with Dataplane V2 (Cilium eBPF) |
| Storage | Persistent Disk CSI (block, pre-installed) Filestore CSI (shared, optional) |
| Ingress | GKE Native Ingress with reserved external IPs |
| SSL/TLS | cert-manager + Let’s Encrypt ClusterIssuer |
| Core | CoreDNS, kube-proxy (eBPF mode), Metrics Server (built-in) |
| Observability | kube-prometheus-stack, Grafana |
| GPU (optional) | NVIDIA GPU drivers (pre-installed by GKE via gpu_driver_version) |
4. vLLM Production Stack
The heart of the deployment a production-ready model serving:
✅ Model: TinyLlama-1.1B (default, fully customizable)
✅ Load balancing: Round-robin router service across replicas
✅ Secrets: Hugging Face token stored as Kubernetes Secret
✅ Storage: Init container with persistent model caching at `/data/models/`
✅ Monitoring: Prometheus metrics endpoint for observability
✅ HTTPS endpoints: Automated TLS via Let’s Encrypt with reserved IPs
✅ Default Helm charts: cpu-tinyllama-light-ingress-gcp | gpu-tinyllama-light-ingress-gcp
5. Hardware Flexibility: CPU vs GPU
You can choose to deploy VLLM production stack on either CPU or GPU using the inference_hardware parameter
# Deploy on CPU (default)n
export TF_VAR_inference_hardware=cpu
# Or deploy on GPU
export TF_VAR_inference_hardware=gpu🖥️ GKE GPU Instance Types Available
Available GPU instances (T4 · L4 · V100 · A10G · A100)
| Machine Type + GPU | vCPUs | Memory (GiB) | GPUs | GPU Memory (GiB) | Best For |
|---|---|---|---|---|---|
| NVIDIA L4 | |||||
g2-standard-4 + L4 |
4 | 16 | 1 | 24 | Cost-effective inference |
g2-standard-8 + L4 |
8 | 32 | 1 | 24 | Medium inference |
g2-standard-16 + 2×L4 |
16 | 64 | 2 | 48 | Multi-GPU inference |
| NVIDIA Tesla T4 | |||||
n1-standard-4 + T4 |
4 | 15 | 1 | 16 | Legacy inference workloads |
n1-standard-8 + T4 |
8 | 30 | 1 | 16 | Medium inference |
| NVIDIA Tesla V100 | |||||
n1-standard-8 + V100 |
8 | 30 | 1 | 16 | Training & inference |
n1-standard-16 + 2×V100 |
16 | 60 | 2 | 32 | Multi-GPU training |
| NVIDIA A100 (40GB) | |||||
a2-highgpu-1g |
12 | 85 | 1 | 40 | High-performance inference |
a2-highgpu-2g |
24 | 170 | 2 | 80 | Multi-GPU inference |
a2-highgpu-4g |
48 | 340 | 4 | 160 | Large-scale training |
Getting started
The deployment automatically provisions only the required infrastructure based on your hardware selection.
| Phase | Component | Action | Condition |
|---|---|---|---|
| 1. Project | GCP Project | Create new project (optional) | create_project = true |
| API Services | Enable required GCP APIs | Always | |
| 2. Infrastructure | VPC | Create VPC with subnets + secondary ranges | Always |
| NAT Gateway | Configure Cloud NAT for private nodes | Always | |
| Reserved IPs | Create external IPs for ingress | Always | |
| GKE | Deploy v1.30 private cluster + CPU node pool | Always | |
| Dataplane V2 | Enable eBPF-based networking (Cilium) | Always | |
| 3. SSL/TLS | cert-manager | Install cert-manager | Always |
| ClusterIssuer | Create Let’s Encrypt ClusterIssuer | Always | |
| 4. vLLM Stack | HF secret | Create Hugging Face token secret | enable_vllm = true |
| CPU Deployment | Deploy vLLM on CPU nodes | inference_hardware = "cpu" |
|
| GPU Infrastructure | Provision GPU node pool | inference_hardware = "gpu" |
|
| GPU Drivers | Pre-installed by GKE | inference_hardware = "gpu" |
|
| Helm chart | Deploy TinyLlama-1.1B with HTTPS ingress | enable_vllm = true |
|
| 5. Observability | Prometheus/Grafana | Deploy stack + vLLM dashboard | Always |
🔵 Deployment Steps
1️⃣Clone the repository
The vLLM GKE deployment build is located under vllm-production-stack-terraform/gke directory:
- Navigate to the production-stack-terraform directory and terraform GKE tutorial folder
$ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
📂..
$ cd vllm-production-stack-terraform/gke/2️⃣ Set Up Environment Variables
Use an env-vars file to export your TF_VARS or use terraform.tfvars . Replace placeholders with your values:
cp env-vars.template
env-vars
vim env-vars
# Set HF token and customize deployment options
source env-varsUsage examples
- Option 1: Through Environment Variables
# Copy and customize
cp env-vars.template env-vars
vi env-vars
################################################################################
# GCP Credentials & Location
################################################################################
export TF_VAR_project_id="your-gcp-project-id" # ← your GCP project ID
export TF_VAR_region="us-central1" # GCP region
################################################################################
# Project & Network Configuration
################################################################################
export TF_VAR_create_project="false" # Use existing project
export TF_VAR_create_vpc="true" # Create new VPC
################################################################################
# LLM Inference
################################################################################
export TF_VAR_enable_vllm=true # Deploy vLLM stack
export TF_VAR_hf_token="hf_your_token_here" # ← Hugging Face token
export TF_VAR_inference_hardware="gpu" # "cpu" or "gpu"
export TF_VAR_letsencrypt_email="your-email@email.com" # Change me
################################################################################
# Storage & Observability
################################################################################
export TF_VAR_enable_disk_csi_driver="true" # Persistent Disk CSI
export TF_VAR_enable_file_csi_driver="false" # Filestore CSI (optional)
export TF_VAR_grafana_admin_password="admin1234" # Grafana admin password
################################################################################
# Paths to Helm chart values templates (relative to repo root)
################################################################################
export TF_VAR_cpu_vllm_helm_config="modules/llm-stack/helm/cpu/cpu-tinyllama-light-ingress-gcp.tpl"
export TF_VAR_gpu_vllm_helm_config="modules/llm-stack/helm/gpu/gpu-tinyllama-light-ingress-gcp.tpl"
# Load vars
source env-vars- Option 2: Through Terraform Variables
# Copy and customizen
$ cp terraform.tfvars.example terraform.tfvars
$ vim terraform.tfvars- Load the Variables into Your Shell Before running Terraform, source the env-vars file:
$ source env-vars3️⃣ Make sure your gcloud authentication is fresh:
Run the below, if Google’s “Re-authentication for privileged access” (RAPT) has expired
gcloud auth application-default login --no-launch-browser3️⃣ Run Terraform deployment:
You can now safely run Terraform plan & apply. You will deploy the 42 resources in total, including local kubeconfig.
terraform init
terraform plan
terraform applyAfter the deployment you should be able to interact with the cluster using kubectl:
export KUBECONFIG=$PWD/kubeconfigFull Plan
Apply complete! Resources: 42 added, 0 changed, 0 destroyed.
Outputs:
Stack_Info = "Built with ❤️ by @Cloudthrill"
LB-IP-Address = "136.110.184.34"
gke_cluster_id = "xx"
gke_deployment_info = <sensitive>
gke_endpoint = <sensitive>
gke_location = "us-east1-c"
gke_master_version = "1.32.9-gke.1207000"
gke_name = "vllm-gke"
gke_region = "us-east1"
gpu_driver_status = tomap({
"deployed" = "true"
"method" = "GKE automatic installation"
"reason" = "GKE handles GPU drivers automatically via gpu_driver_version parameter"
"version" = "LATEST"
})
grafana_url = "https://grafana.886eb822.sslip.io"
vllm_api_url = "https://vllm-api.22241bb6.sslip.io/v1"4️⃣ Observability (Grafana Login)
You can access Grafana dashboards using grafana_url output or port forwarding .(i.e http://localhost:3000)
# Case 1: Get Grafana HTTPS URL (already printed by Terraform) i.e https://grafana.xxxxx.nip.io
terraform output -raw grafana_url
# Cse 2: Or port forward
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack- Run the below command to fetch the password
kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath={.data.admin-password} | base64 -d
- Username: admin
- Password : through kubectl command above
Automatic vLLM Dashboard
In this stack, the vLLM dashboard and service monitor are automatically configured for Grafana.

For Benchmarking vLLM Production Stack Performance check the multi-round QA tutorial
5️⃣ Destroying the Infrastructure 🚧
To delete everything just run the below (Note: sometimes you need to run it twice as the loadbalancer gets tough to die)
terraform destroy -auto-approve🛠️Configuration knobs
This stack provides extensive customization options to tailor your deployment:
| Variable | Default | Description |
|---|---|---|
project_id |
required | GCP project ID |
region |
us-central1 |
GCP region |
cluster_version |
1.30 |
Kubernetes version |
inference_hardware |
cpu |
cpu or gpu |
pod_cidr |
10.244.0.0/16 |
Pod IP range |
enable_vllm |
false |
Deploy vLLM stack |
hf_token |
«secret» | HF model download token |
enable_disk_csi_driver |
true |
Enable Persistent Disk CSI driver for block storage |
enable_file_csi_driver |
false |
Enable Filestore CSI driver for shared file storage |
create_project |
false |
Whether to create the GCP project (true) or use an existing one (false) |
letsencrypt_email |
admin@example.com |
Email for Let’s Encrypt |
📓This is just a subset. For the full list of 20+ configurable variables, consult the configuration template : env-vars.template
🧪 Quick Test
1️⃣ Router Endpoint and API URL
1.1 Router Endpoint through port forwarding run the following command:
# Case 1 : Port forwarding
kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80
export vllm_api_url=http://localhost:30080/v11.2 Extracting the Router URL via nginx egress
The endpoint URL can be found in the vllm_api_url output :
# Case 2 : Extract from Terraform output
export vllm_api_url=$(terraform output -raw vllm_api_url)
# Example output:
# https://vllm.a1b2c3d4.nip.io/v1
2️⃣ List models
# check models
curl -s ${vllm_api_url}/models | jq .
3️⃣ Completion Applicable for both ingress and port forwarding URLs
curl ${vllm_api_url}/completions
-H "Content-Type: application/json"
-d '{
"model": "/data/models/tinyllama",
"prompt": "Google Cloud is a",
"max_tokens": 20,
"temperature": 0
}' | jq .choices[].text
4️⃣ vLLM model service
kubectl -n vllm get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
vllm-gpu-router-service ClusterIP 10.96.104.134 <none> 80/TCP,9000/TCP 115m
vllm-gpu-tinyllama-gpu-engine-service ClusterIP 10.96.143.115 <none> 80/TCP,55555/TCP,9999/TCP 115m🎯Troubleshooting:
Certificate Not Issuing
Debug: STATUS: Pending or False
# Check certificate status
kubectl describe certificate -n vllm
# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100
# Check HTTP-01 challenge
kubectl get challenge -n vllm- Symptom
# Message:
Failed to create new order: acme: urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates already issued for: nip.io: see letsencrypt.org/docs/rate-limitsnip.io to sslip.io in the ingress host of the vllm helm charts cpu-tinyllama-light-ingress-gcp | gpu-tinyllama-light-ingress-gcp
Useful GCloud Debugging Commands
# Check GKE cluster status
gcloud container clusters describe vllm-gke --region us-central1
# Check node pools
gcloud container node-pools list --cluster vllm-gke --region us-central1
# Check Cloud NAT status
gcloud compute routers get-nat-mapping-info vllm-vpc-router --region us-central1Conclusion
After exploring EKS and AKS based vLLM implementations, this deployment now gives you a solid foundation for production LLM serving on Google Cloud. This gives you an ideal starting point to extend it further.
Next Steps 🚀
- In the next post, we’re taking this stack to Nebius Managed K8s. Stay Tuned!
📚 Additional Resources
- vLLM Documentation
- vLLM Production stack documentation
- GCP GPU Documentation
- Google Terraform Kubernetes Engine Modules
- terraform-registery-for-gke-modules
- Dataplane V2 Documentation

Run AI Your Way — In Your Cloud
Want full control over your AI backend? The CloudThrill VLLM Private Inference POC is still open — but not forever.
📢 Secure your spot (only a few left), 𝗔𝗽𝗽𝗹𝘆 𝗻𝗼𝘄!
Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control
𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…
🙋🏻♀️If you like this content please subscribe to our blog newsletter ❤️.