vLLM Production Stack on GCP GKE with Terraform🧑🏼‍🚀

Intro

Welcome back to the terraform vLLM Production Stack series! After covering AWS EKS and Azure AKS, today we’re deploying vLLM production-stack on Google Cloud GKE with the same Terraform approach.
This guide shows you how to deploy a production-ready LLM serving environment on Google Cloud, with GCP-specific optimizations including Dataplane V2 (Cilium eBPF), VPC-native networking, and pre-installed GPU drivers.

We’ll cover network architecture, certificate automation (using Let’s Encrypt), GPU provisioning, and comprehensive observability for both CPU and GPU inference- all using Infrastructure as Code.

💡You can find our code in the CloudThrill repo ➡️ production-stack-terraform.

This is part of CloudThrill‘s ongoing contribution to the vLLM Production Stack project. Extending terraform deployment patterns across AWS, Azure, GCP, Oracle OCI, and Nebius.

📂 Project Structure

./
├── main.tf                          # Project + GKE cluster
├── network.tf                       # VPC, subnet, NAT, reserved IPs
├── provider.tf                      # GCP + Helm + kubectl providers
├── variables.tf                     # All input variables
├── output.tf                        # HTTPS endpoints, kubeconfig
├── cluster-tools.tf                 # cert-manager, prometheus, GPU operator
├── datasources.tf                   # Ingress data sources
├── vllm-production-stack.tf         # vLLM Helm release, BackendConfig, dashboards
├── env-vars.template                # Quick env exporter
├── terraform.tfvars.template        # Same as HCL
├── modules/
   ├── private-cluster-update-variant/  # GKE cluster (upstream)
      ├── cluster.tf
      ├── dns.tf
      ├── firewall.tf
      ├── main.tf
      ├── networks.tf
      ├── outputs.tf
      ├── variables.tf
      └── versions.tf
   ├── google-network/                  # VPC + subnet + NAT (upstream)
      ├── main.tf
      ├── variables.tf
      ├── outputs.tf
      ├── versions.tf
      └── modules/
          ├── vpc/
          ├── subnets/
          ├── routes/
          └── firewall-rules/
   ├── google-project-factory/          # Optional new project (upstream)
      ├── main.tf
      ├── variables.tf
      ├── outputs.tf
      └── modules/
          ├── core_project_factory/
          └── project_services/
   └── llm-stack/                       # vLLM Helm value templates
       └── helm/
           ├── cpu/
              └── cpu-tinyllama-light-ingress-gcp.tpl
           └── gpu/
               ├── gpu-operator-values.yaml
               └── gpu-tinyllama-light-ingress-gcp.tpl
├── config/
   ├── helm/
      └── kube-prome-stack.yaml        # Grafana + Prometheus values
   ├── manifests/
      └── letsencrypt-issuer.yaml      # Let's Encrypt ClusterIssuer
   ├── kubeconfig.tpl                   # Local kubeconfig
   └── vllm-dashboard.json              # Grafana vLLM dashboard
└── README.md                            # ← you are here

🧰Prerequisites

Before you begin, ensure you have the following:

Tool Version Notes
Terraform ≥ 1.3 Tested on 1.10+
gcloud CLI ≥ 450.0 For authentication, tested on 529.0
kubectl ≥ 1.9.x Kubernetes CLI
Terraform Provider for GCP ≥ 6.41+ hashicorp/google provider
jq optional JSON helper
Follow the below steps to Install the tools (expend)👇🏼
# Install tools
sudo apt update && sudo apt install -y jq curl unzip gpg

  # Terraform
wget -qO- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install -y terraform

# Google Cloud SDK
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init

# kubectl
curl -sLO "https://dl.k8s.io/release/$(curl -Ls https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install kubectl /usr/local/bin/ && rm kubectl
  • Configure Google Cloud
# Login to GCP
gcloud auth login
gcloud auth application-default login

# Set project
gcloud config set project YOUR_PROJECT_ID

# Enable required APIs
gcloud services enable container.googleapis.com compute.googleapis.com servicenetworking.googleapis.com

# Verify
gcloud config list

What’s in the stack?📦

This terraform stack delivers a production-ready vLLM serving environment On Google Cloud GKE supporting both CPU/GPU inference with operational best practices embedded in the Google Terraform Kubernetes Engine Modules.

It’s designed for real-world production workloads with:
✅ Enterprise-grade infrastructure: Built on the private-cluster-update-variant module.
✅ Flexible compute: Switch between CPU and GPU inference with a single flag.
✅ Operational excellence: Prometheus, Grafana and Dataplane V2 (Cilium).
✅ Scalability: Cluster-autoscaler and Pre-installed GPU drivers.
✅ Secure endpoints: HTTPS model serving through GKE Native Ingress + Google Cloud Load Balancer + Let’s Encrypt

Architecture Overview

Deployment layers – The stack provisions infrastructure in logical layers that adapt based on your hardware choice:

Layer Component CPU Mode GPU Mode
Infrastructure VPC + GKE + Dataplane V2 Always deployed Always deployed
Add-ons Persistent Disk CSI, cert-manager, Prometheus Always deployed Always deployed
vLLM Stack Secrets + Helm chart Deploy on CPU nodes + GPU nodes (drivers pre-installed)
Networking Reserved IPs + Ingress + TLS + Let’s Encrypt GKE native ingress + cert-manager GKE native ingress + cert-manager

1. Networking Foundation

The stack creates a production-grade network topology:

  • Custom VPC with 3 IP ranges (nodes, pods, services)
  • VPC-native networking with Dataplane V2 (Cilium eBPF)
  • Private GKE cluster with Cloud NAT for outbound traffic
  • Automated SSL/TLS certificates via cert-manager + Let’s Encrypt
  • Network security with Dataplane V2 network policies

2. GKE Cluster

Private cluster v1.30 (zonal or regional) and Managed node pools with auto-scaling

Pool Machine Type Purpose
cpu-pool n2-standard-4 (4 vCPU / 16 GiB) System services & CPU inference
gpu-pool (optional) g2-standard-4 + L4 (1 × NVIDIA L4) GPU inference

3. Essential Add-ons

ore GKE add-ons via terraform-google-kubernetes-engine private-cluster-update-variant::

Category Add-on
CNI VPC-native networking with Dataplane V2 (Cilium eBPF)
Storage Persistent Disk CSI (block, pre-installed)
Filestore CSI (shared, optional)
Ingress GKE Native Ingress with reserved external IPs
SSL/TLS cert-manager + Let’s Encrypt ClusterIssuer
Core CoreDNS, kube-proxy (eBPF mode), Metrics Server (built-in)
Observability kube-prometheus-stack, Grafana
GPU (optional) NVIDIA GPU drivers (pre-installed by GKE via gpu_driver_version)

4. vLLM Production Stack

The heart of the deployment a production-ready model serving:

✅ Model: TinyLlama-1.1B (default, fully customizable)
✅ Load balancing: Round-robin router service across replicas
✅ Secrets: Hugging Face token stored as Kubernetes Secret
Storage: Init container with persistent model caching at `/data/models/`
Monitoring: Prometheus metrics endpoint for observability
HTTPS endpoints: Automated TLS via Let’s Encrypt with reserved IPs
Default Helm charts:  cpu-tinyllama-light-ingress-gcp | gpu-tinyllama-light-ingress-gcp

5. Hardware Flexibility: CPU vs GPU

You can choose to deploy VLLM production stack on either CPU or GPU using the inference_hardware parameter

# Deploy on CPU (default)n
export TF_VAR_inference_hardware=cpu
# Or deploy on GPU
export TF_VAR_inference_hardware=gpu

🖥️ GKE GPU Instance Types Available

Available GPU instances (T4 · L4 · V100 · A10G · A100)
Machine Type + GPU vCPUs Memory (GiB) GPUs GPU Memory (GiB) Best For
NVIDIA L4
g2-standard-4 + L4 4 16 1 24 Cost-effective inference
g2-standard-8 + L4 8 32 1 24 Medium inference
g2-standard-16 + 2×L4 16 64 2 48 Multi-GPU inference
NVIDIA Tesla T4
n1-standard-4 + T4 4 15 1 16 Legacy inference workloads
n1-standard-8 + T4 8 30 1 16 Medium inference
NVIDIA Tesla V100
n1-standard-8 + V100 8 30 1 16 Training & inference
n1-standard-16 + 2×V100 16 60 2 32 Multi-GPU training
NVIDIA A100 (40GB)
a2-highgpu-1g 12 85 1 40 High-performance inference
a2-highgpu-2g 24 170 2 80 Multi-GPU inference
a2-highgpu-4g 48 340 4 160 Large-scale training
Note: Check the full list of Google Cloud GPU instance offering here

Getting started

The deployment automatically provisions only the required infrastructure based on your hardware selection.

Phase Component Action Condition
1. Project GCP Project Create new project (optional) create_project = true
API Services Enable required GCP APIs Always
2. Infrastructure VPC Create VPC with subnets + secondary ranges Always
NAT Gateway Configure Cloud NAT for private nodes Always
Reserved IPs Create external IPs for ingress Always
GKE Deploy v1.30 private cluster + CPU node pool Always
Dataplane V2 Enable eBPF-based networking (Cilium) Always
3. SSL/TLS cert-manager Install cert-manager Always
ClusterIssuer Create Let’s Encrypt ClusterIssuer Always
4. vLLM Stack HF secret Create Hugging Face token secret enable_vllm = true
CPU Deployment Deploy vLLM on CPU nodes inference_hardware = "cpu"
GPU Infrastructure Provision GPU node pool inference_hardware = "gpu"
GPU Drivers Pre-installed by GKE inference_hardware = "gpu"
Helm chart Deploy TinyLlama-1.1B with HTTPS ingress enable_vllm = true
5. Observability Prometheus/Grafana Deploy stack + vLLM dashboard Always

🔵 Deployment Steps

1️⃣Clone the repository

The vLLM GKE deployment build is located under vllm-production-stack-terraform/gke directory:

  • Navigate to the production-stack-terraform directory and terraform GKE tutorial folder
$ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
 📂.. 
$ cd vllm-production-stack-terraform/gke/

2️⃣ Set Up Environment Variables

Use an env-vars file to export your TF_VARS or use terraform.tfvars . Replace placeholders with your values:

cp env-vars.template 
env-vars
vim env-vars  
# Set HF token and customize deployment options
source env-vars

Usage examples

  • Option 1: Through Environment Variables
# Copy and customize
cp env-vars.template env-vars
vi env-vars

################################################################################
# GCP Credentials & Location
################################################################################
export TF_VAR_project_id="your-gcp-project-id"        # ← your GCP project ID
export TF_VAR_region="us-central1"                    # GCP region

################################################################################
# Project & Network Configuration
################################################################################
export TF_VAR_create_project="false"                  # Use existing project
export TF_VAR_create_vpc="true"                       # Create new VPC

################################################################################
# LLM Inference
################################################################################
export TF_VAR_enable_vllm=true                        # Deploy vLLM stack
export TF_VAR_hf_token="hf_your_token_here"          # ← Hugging Face token
export TF_VAR_inference_hardware="gpu"                # "cpu" or "gpu"
export TF_VAR_letsencrypt_email="your-email@email.com"  # Change me

################################################################################
# Storage & Observability
################################################################################
export TF_VAR_enable_disk_csi_driver="true"           # Persistent Disk CSI
export TF_VAR_enable_file_csi_driver="false"          # Filestore CSI (optional)
export TF_VAR_grafana_admin_password="admin1234"      # Grafana admin password

################################################################################
#  Paths to Helm chart values templates (relative to repo root)
################################################################################
export TF_VAR_cpu_vllm_helm_config="modules/llm-stack/helm/cpu/cpu-tinyllama-light-ingress-gcp.tpl"
export TF_VAR_gpu_vllm_helm_config="modules/llm-stack/helm/gpu/gpu-tinyllama-light-ingress-gcp.tpl"

# Load vars
source env-vars
  • Option 2: Through Terraform Variables
 # Copy and customizen 
 $ cp terraform.tfvars.example terraform.tfvars
 $ vim terraform.tfvars
  • Load the Variables into Your Shell Before running Terraform, source the env-vars file:
$ source env-vars

3️⃣ Make sure your gcloud authentication is fresh:

Run the below, if Google’s “Re-authentication for privileged access” (RAPT) has expired

gcloud auth application-default login --no-launch-browser

3️⃣ Run Terraform deployment:

You can now safely run Terraform plan & apply. You will deploy the 42 resources in total, including local kubeconfig.

terraform init
terraform plan
terraform apply

After the deployment you should be able to interact with the cluster using kubectl:

export KUBECONFIG=$PWD/kubeconfig
Full Plan
Apply complete! Resources: 42 added, 0 changed, 0 destroyed.

Outputs:

Stack_Info = "Built with ❤️ by @Cloudthrill"
LB-IP-Address = "136.110.184.34"
gke_cluster_id = "xx"
gke_deployment_info = <sensitive>
gke_endpoint = <sensitive>
gke_location = "us-east1-c"
gke_master_version = "1.32.9-gke.1207000"
gke_name = "vllm-gke"
gke_region = "us-east1"
gpu_driver_status = tomap({
  "deployed" = "true"
  "method" = "GKE automatic installation"
  "reason" = "GKE handles GPU drivers automatically via gpu_driver_version parameter"
  "version" = "LATEST"
})
grafana_url = "https://grafana.886eb822.sslip.io"
vllm_api_url = "https://vllm-api.22241bb6.sslip.io/v1"

4️⃣ Observability (Grafana Login)

You can access Grafana dashboards using grafana_url output or port forwarding .(i.e http://localhost:3000)

# Case 1:  Get Grafana HTTPS URL (already printed by Terraform) i.e https://grafana.xxxxx.nip.io
terraform output -raw grafana_url 
# Cse 2: Or port forward
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack
  • Run the below command to fetch the password
kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath={.data.admin-password} | base64 -d
  • Username: admin
  • Password : through kubectl command above

Automatic vLLM Dashboard

In this stack, the vLLM dashboard and service monitor are automatically configured for Grafana.

For Benchmarking vLLM Production Stack Performance check the multi-round QA tutorial

5️⃣ Destroying the Infrastructure 🚧

To delete everything just run the below (Note: sometimes you need to run it twice as the loadbalancer gets tough to die)

terraform destroy -auto-approve



🛠️Configuration knobs

This stack provides extensive customization options to tailor your deployment:

Variable Default Description
project_id required GCP project ID
region us-central1 GCP region
cluster_version 1.30 Kubernetes version
inference_hardware cpu cpu or gpu
pod_cidr 10.244.0.0/16 Pod IP range
enable_vllm false Deploy vLLM stack
hf_token «secret» HF model download token
enable_disk_csi_driver true Enable Persistent Disk CSI driver for block storage
enable_file_csi_driver false Enable Filestore CSI driver for shared file storage
create_project false Whether to create the GCP project (true) or use an existing one (false)
letsencrypt_email admin@example.com Email for Let’s Encrypt

📓This is just a subset. For the full list of 20+ configurable variables, consult the configuration template : env-vars.template

🧪 Quick Test

1️⃣ Router Endpoint and API URL

1.1 Router Endpoint through port forwarding run the following command:

# Case 1 : Port forwarding
kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80
export vllm_api_url=http://localhost:30080/v1

1.2 Extracting the Router URL via nginx egress 
The endpoint URL can be found in the vllm_api_url output :

# Case 2 : Extract from Terraform output 
export vllm_api_url=$(terraform output -raw vllm_api_url)
# Example output:
# https://vllm.a1b2c3d4.nip.io/v1


2️⃣ List models

# check models
curl -s ${vllm_api_url}/models | jq .


3️⃣ Completion Applicable for both ingress and port forwarding URLs

curl ${vllm_api_url}/completions 
  -H "Content-Type: application/json" 
  -d '{
    "model": "/data/models/tinyllama",
    "prompt": "Google Cloud is a",
    "max_tokens": 20,
    "temperature": 0
  }' | jq .choices[].text


4️⃣ vLLM model service

kubectl -n vllm get svc
NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                     AGE
vllm-gpu-router-service                 ClusterIP   10.96.104.134   <none>        80/TCP,9000/TCP             115m
vllm-gpu-tinyllama-gpu-engine-service   ClusterIP   10.96.143.115   <none>        80/TCP,55555/TCP,9999/TCP   115m

🎯Troubleshooting:

Certificate Not Issuing

Debug: STATUS: Pending or False

# Check certificate status
kubectl describe certificate -n vllm
# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100
# Check HTTP-01 challenge
kubectl get challenge -n vllm

  • Symptom
# Message: 
Failed to create new order: acme: urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates already issued for: nip.io: see letsencrypt.org/docs/rate-limits
Fix: Change nip.io to sslip.io in the ingress host of the vllm helm charts
cpu-tinyllama-light-ingress-gcp | gpu-tinyllama-light-ingress-gcp

Useful GCloud Debugging Commands

# Check GKE cluster status
gcloud container clusters describe vllm-gke --region us-central1

# Check node pools
gcloud container node-pools list --cluster vllm-gke --region us-central1

# Check Cloud NAT status
gcloud compute routers get-nat-mapping-info vllm-vpc-router --region us-central1

Conclusion

After exploring EKS and AKS based vLLM implementations, this deployment now gives you a solid foundation for production LLM serving on Google Cloud. This gives you an ideal starting point to extend it further.

Next Steps 🚀

  • In the next post, we’re taking this stack to Nebius Managed K8s. Stay Tuned!

📚 Additional Resources


Run AI Your Way — In Your Cloud


Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

🙋🏻‍♀️If you like this content please subscribe to our blog newsletter ❤️.

👋🏻Want to chat about your challenges?
We’d love to hear from you! 

Share this…

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Start your Cloud journey with us today .