Intro

Welcome back to the terraform vLLM Production Stack series! After covering AWS EKS and Azure AKS, today we’re deploying vLLM production-stack on Google Cloud GKE with the same Terraform approach.
This guide shows you how to deploy a production-ready LLM serving environment on Google Cloud, with GCP-specific optimizations including Dataplane V2 (Cilium eBPF), VPC-native networking, and pre-installed GPU drivers.

We’ll cover network architecture, certificate automation (using Let’s Encrypt), GPU provisioning, and comprehensive observability for both CPU and GPU inference- all using Infrastructure as Code.

💡You can find our code in the CloudThrill repo ➡️ production-stack-terraform.

This is part of CloudThrill‘s ongoing contribution to the vLLM Production Stack project. Extending terraform deployment patterns across AWS, Azure, GCP, Oracle OCI, and Nebius.

📂 Project Structure

./
├── main.tf                          # Project + GKE cluster
├── network.tf                       # VPC, subnet, NAT, reserved IPs
├── provider.tf                      # GCP + Helm + kubectl providers
├── variables.tf                     # All input variables
├── output.tf                        # HTTPS endpoints, kubeconfig
├── cluster-tools.tf                 # cert-manager, prometheus, GPU operator
├── datasources.tf                   # Ingress data sources
├── vllm-production-stack.tf         # vLLM Helm release, BackendConfig, dashboards
├── env-vars.template                # Quick env exporter
├── terraform.tfvars.template        # Same as HCL
├── modules/
│   ├── private-cluster-update-variant/  # GKE cluster (upstream)
│   │   ├── cluster.tf
│   │   ├── dns.tf
│   │   ├── firewall.tf
│   │   ├── main.tf
│   │   ├── networks.tf
│   │   ├── outputs.tf
│   │   ├── variables.tf
│   │   └── versions.tf
│   ├── google-network/                  # VPC + subnet + NAT (upstream)
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── versions.tf
│   │   └── modules/
│   │       ├── vpc/
│   │       ├── subnets/
│   │       ├── routes/
│   │       └── firewall-rules/
│   ├── google-project-factory/          # Optional new project (upstream)
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── modules/
│   │       ├── core_project_factory/
│   │       └── project_services/
│   └── llm-stack/                       # vLLM Helm value templates
│       └── helm/
│           ├── cpu/
│           │   └── cpu-tinyllama-light-ingress-gcp.tpl
│           └── gpu/
│               ├── gpu-operator-values.yaml
│               └── gpu-tinyllama-light-ingress-gcp.tpl
├── config/
│   ├── helm/
│   │   └── kube-prome-stack.yaml        # Grafana + Prometheus values
│   ├── manifests/
│   │   └── letsencrypt-issuer.yaml      # Let's Encrypt ClusterIssuer
│   ├── kubeconfig.tpl                   # Local kubeconfig
│   └── vllm-dashboard.json              # Grafana vLLM dashboard
└── README.md                            # ← you are here

./
├── main.tf                          # Project + GKE cluster
├── network.tf                       # VPC, subnet, NAT, reserved IPs
├── provider.tf                      # GCP + Helm + kubectl providers
├── variables.tf                     # All input variables
├── output.tf                        # HTTPS endpoints, kubeconfig
├── cluster-tools.tf                 # cert-manager, prometheus, GPU operator
├── datasources.tf                   # Ingress data sources
├── vllm-production-stack.tf         # vLLM Helm release, BackendConfig, dashboards
├── env-vars.template                # Quick env exporter
├── terraform.tfvars.template        # Same as HCL
├── modules/
│   ├── private-cluster-update-variant/  # GKE cluster (upstream)
│   │   ├── cluster.tf
│   │   ├── dns.tf
│   │   ├── firewall.tf
│   │   ├── main.tf
│   │   ├── networks.tf
│   │   ├── outputs.tf
│   │   ├── variables.tf
│   │   └── versions.tf
│   ├── google-network/                  # VPC + subnet + NAT (upstream)
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── versions.tf
│   │   └── modules/
│   │       ├── vpc/
│   │       ├── subnets/
│   │       ├── routes/
│   │       └── firewall-rules/
│   ├── google-project-factory/          # Optional new project (upstream)
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── modules/
│   │       ├── core_project_factory/
│   │       └── project_services/
│   └── llm-stack/                       # vLLM Helm value templates
│       └── helm/
│           ├── cpu/
│           │   └── cpu-tinyllama-light-ingress-gcp.tpl
│           └── gpu/
│               ├── gpu-operator-values.yaml
│               └── gpu-tinyllama-light-ingress-gcp.tpl
├── config/
│   ├── helm/
│   │   └── kube-prome-stack.yaml        # Grafana + Prometheus values
│   ├── manifests/
│   │   └── letsencrypt-issuer.yaml      # Let's Encrypt ClusterIssuer
│   ├── kubeconfig.tpl                   # Local kubeconfig
│   └── vllm-dashboard.json              # Grafana vLLM dashboard
└── README.md                            # ← you are here

🧰Prerequisites

Before you begin, ensure you have the following:

Tool	Version	Notes
Terraform	≥ 1.3	Tested on 1.10+
gcloud CLI	≥ 450.0	For authentication, tested on 529.0
kubectl	≥ 1.9.x	Kubernetes CLI
Terraform Provider for GCP	≥ 6.41+	hashicorp/google provider
jq	optional	JSON helper

Follow the below steps to Install the tools (expend)👇🏼

# Install tools
sudo apt update && sudo apt install -y jq curl unzip gpg

  # Terraform
wget -qO- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install -y terraform

# Google Cloud SDK
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init

# kubectl
curl -sLO "https://dl.k8s.io/release/$(curl -Ls https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install kubectl /usr/local/bin/ && rm kubectl

# Install tools
sudo apt update && sudo apt install -y jq curl unzip gpg

  # Terraform
wget -qO- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install -y terraform

# Google Cloud SDK
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init

# kubectl
curl -sLO "https://dl.k8s.io/release/$(curl -Ls https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install kubectl /usr/local/bin/ && rm kubectl

Configure Google Cloud

# Login to GCP
gcloud auth login
gcloud auth application-default login

# Set project
gcloud config set project YOUR_PROJECT_ID

# Enable required APIs
gcloud services enable container.googleapis.com compute.googleapis.com servicenetworking.googleapis.com

# Verify
gcloud config list

# Login to GCP
gcloud auth login
gcloud auth application-default login

# Set project
gcloud config set project YOUR_PROJECT_ID

# Enable required APIs
gcloud services enable container.googleapis.com compute.googleapis.com servicenetworking.googleapis.com

# Verify
gcloud config list

What’s in the stack?📦

This terraform stack delivers a production-ready vLLM serving environment On Google Cloud GKE supporting both CPU/GPU inference with operational best practices embedded in the Google Terraform Kubernetes Engine Modules.

It’s designed for real-world production workloads with:
✅ Enterprise-grade infrastructure: Built on the private-cluster-update-variant module.
✅ Flexible compute: Switch between CPU and GPU inference with a single flag.
✅ Operational excellence: Prometheus, Grafana and Dataplane V2 (Cilium).
✅ Scalability: Cluster-autoscaler and Pre-installed GPU drivers.
✅ Secure endpoints: HTTPS model serving through GKE Native Ingress + Google Cloud Load Balancer + Let’s Encrypt

Architecture Overview

Deployment layers – The stack provisions infrastructure in logical layers that adapt based on your hardware choice:

Layer	Component	CPU Mode	GPU Mode
Infrastructure	VPC + GKE + Dataplane V2	Always deployed	Always deployed
Add-ons	Persistent Disk CSI, cert-manager, Prometheus	Always deployed	Always deployed
vLLM Stack	Secrets + Helm chart	Deploy on CPU nodes	+ GPU nodes (drivers pre-installed)
Networking	Reserved IPs + Ingress + TLS + Let’s Encrypt	GKE native ingress + cert-manager	GKE native ingress + cert-manager

1. Networking Foundation

The stack creates a production-grade network topology:

Custom VPC with 3 IP ranges (nodes, pods, services)
VPC-native networking with Dataplane V2 (Cilium eBPF)
Private GKE cluster with Cloud NAT for outbound traffic
Automated SSL/TLS certificates via cert-manager + Let’s Encrypt
Network security with Dataplane V2 network policies

2. GKE Cluster

Private cluster v1.30 (zonal or regional) and Managed node pools with auto-scaling

Pool	Machine Type	Purpose
`cpu-pool`	n2-standard-4 (4 vCPU / 16 GiB)	System services & CPU inference
`gpu-pool` (optional)	g2-standard-4 + L4 (1 × NVIDIA L4)	GPU inference

3. Essential Add-ons

ore GKE add-ons via terraform-google-kubernetes-engine private-cluster-update-variant::

Category	Add-on
CNI	VPC-native networking with Dataplane V2 (Cilium eBPF)
Storage	Persistent Disk CSI (block, pre-installed) Filestore CSI (shared, optional)
Ingress	GKE Native Ingress with reserved external IPs
SSL/TLS	cert-manager + Let’s Encrypt ClusterIssuer
Core	CoreDNS, kube-proxy (eBPF mode), Metrics Server (built-in)
Observability	kube-prometheus-stack, Grafana
GPU (optional)	NVIDIA GPU drivers (pre-installed by GKE via `gpu_driver_version`)

4. vLLM Production Stack

The heart of the deployment a production-ready model serving:

✅ Model: TinyLlama-1.1B (default, fully customizable)
✅ Load balancing: Round-robin router service across replicas
✅ Secrets: Hugging Face token stored as Kubernetes Secret
✅ Storage: Init container with persistent model caching at `/data/models/`
✅ Monitoring: Prometheus metrics endpoint for observability
✅ HTTPS endpoints: Automated TLS via Let’s Encrypt with reserved IPs
✅ Default Helm charts: cpu-tinyllama-light-ingress-gcp | gpu-tinyllama-light-ingress-gcp

5. Hardware Flexibility: CPU vs GPU

You can choose to deploy VLLM production stack on either CPU or GPU using the inference_hardware parameter

# Deploy on CPU (default)n
export TF_VAR_inference_hardware=cpu
# Or deploy on GPU
export TF_VAR_inference_hardware=gpu

# Deploy on CPU (default)n
export TF_VAR_inference_hardware=cpu
# Or deploy on GPU
export TF_VAR_inference_hardware=gpu

🖥️ GKE GPU Instance Types Available

Available GPU instances (T4 · L4 · V100 · A10G · A100)

Machine Type + GPU	vCPUs	Memory (GiB)	GPUs	GPU Memory (GiB)	Best For
NVIDIA L4
`g2-standard-4` + L4	4	16	1	24	Cost-effective inference
`g2-standard-8` + L4	8	32	1	24	Medium inference
`g2-standard-16` + 2×L4	16	64	2	48	Multi-GPU inference
NVIDIA Tesla T4
`n1-standard-4` + T4	4	15	1	16	Legacy inference workloads
`n1-standard-8` + T4	8	30	1	16	Medium inference
NVIDIA Tesla V100
`n1-standard-8` + V100	8	30	1	16	Training & inference
`n1-standard-16` + 2×V100	16	60	2	32	Multi-GPU training
NVIDIA A100 (40GB)
`a2-highgpu-1g`	12	85	1	40	High-performance inference
`a2-highgpu-2g`	24	170	2	80	Multi-GPU inference
`a2-highgpu-4g`	48	340	4	160	Large-scale training

Note: Check the full list of Google Cloud GPU instance offering here

Getting started

The deployment automatically provisions only the required infrastructure based on your hardware selection.

Phase	Component	Action	Condition
1. Project	GCP Project	Create new project (optional)	`create_project = true`
1. Project	API Services	Enable required GCP APIs	Always
2. Infrastructure	VPC	Create VPC with subnets + secondary ranges	Always
	NAT Gateway	Configure Cloud NAT for private nodes	Always
	Reserved IPs	Create external IPs for ingress	Always
	GKE	Deploy v1.30 private cluster + CPU node pool	Always
	Dataplane V2	Enable eBPF-based networking (Cilium)	Always
3. SSL/TLS	cert-manager	Install cert-manager	Always
3. SSL/TLS	ClusterIssuer	Create Let’s Encrypt ClusterIssuer	Always
4. vLLM Stack	HF secret	Create Hugging Face token secret	`enable_vllm = true`
	CPU Deployment	Deploy vLLM on CPU nodes	`inference_hardware = "cpu"`
	GPU Infrastructure	Provision GPU node pool	`inference_hardware = "gpu"`
	GPU Drivers	Pre-installed by GKE	`inference_hardware = "gpu"`
	Helm chart	Deploy TinyLlama-1.1B with HTTPS ingress	`enable_vllm = true`
5. Observability	Prometheus/Grafana	Deploy stack + vLLM dashboard	Always

🔵 Deployment Steps

1️⃣Clone the repository

The vLLM GKE deployment build is located under vllm-production-stack-terraform/gke directory:

Navigate to the production-stack-terraform directory and terraform GKE tutorial folder

$ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
 📂.. 
$ cd vllm-production-stack-terraform/gke/

$ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
 📂.. 
$ cd vllm-production-stack-terraform/gke/

2️⃣ Set Up Environment Variables

Use an env-vars file to export your TF_VARS or use terraform.tfvars . Replace placeholders with your values:

cp env-vars.template 
env-vars
vim env-vars  
# Set HF token and customize deployment options
source env-vars

cp env-vars.template 
env-vars
vim env-vars  
# Set HF token and customize deployment options
source env-vars

Usage examples

Option 1: Through Environment Variables

# Copy and customize
cp env-vars.template env-vars
vi env-vars

################################################################################
# GCP Credentials & Location
################################################################################
export TF_VAR_project_id="your-gcp-project-id"        # ← your GCP project ID
export TF_VAR_region="us-central1"                    # GCP region

################################################################################
# Project & Network Configuration
################################################################################
export TF_VAR_create_project="false"                  # Use existing project
export TF_VAR_create_vpc="true"                       # Create new VPC

################################################################################
# LLM Inference
################################################################################
export TF_VAR_enable_vllm=true                        # Deploy vLLM stack
export TF_VAR_hf_token="hf_your_token_here"          # ← Hugging Face token
export TF_VAR_inference_hardware="gpu"                # "cpu" or "gpu"
export TF_VAR_letsencrypt_email="your-email@email.com"  # Change me

################################################################################
# Storage & Observability
################################################################################
export TF_VAR_enable_disk_csi_driver="true"           # Persistent Disk CSI
export TF_VAR_enable_file_csi_driver="false"          # Filestore CSI (optional)
export TF_VAR_grafana_admin_password="admin1234"      # Grafana admin password

################################################################################
#  Paths to Helm chart values templates (relative to repo root)
################################################################################
export TF_VAR_cpu_vllm_helm_config="modules/llm-stack/helm/cpu/cpu-tinyllama-light-ingress-gcp.tpl"
export TF_VAR_gpu_vllm_helm_config="modules/llm-stack/helm/gpu/gpu-tinyllama-light-ingress-gcp.tpl"

# Load vars
source env-vars

# Copy and customize
cp env-vars.template env-vars
vi env-vars

################################################################################
# GCP Credentials & Location
################################################################################
export TF_VAR_project_id="your-gcp-project-id"        # ← your GCP project ID
export TF_VAR_region="us-central1"                    # GCP region

################################################################################
# Project & Network Configuration
################################################################################
export TF_VAR_create_project="false"                  # Use existing project
export TF_VAR_create_vpc="true"                       # Create new VPC

################################################################################
# LLM Inference
################################################################################
export TF_VAR_enable_vllm=true                        # Deploy vLLM stack
export TF_VAR_hf_token="hf_your_token_here"          # ← Hugging Face token
export TF_VAR_inference_hardware="gpu"                # "cpu" or "gpu"
export TF_VAR_letsencrypt_email="your-email@email.com"  # Change me

################################################################################
# Storage & Observability
################################################################################
export TF_VAR_enable_disk_csi_driver="true"           # Persistent Disk CSI
export TF_VAR_enable_file_csi_driver="false"          # Filestore CSI (optional)
export TF_VAR_grafana_admin_password="admin1234"      # Grafana admin password

################################################################################
#  Paths to Helm chart values templates (relative to repo root)
################################################################################
export TF_VAR_cpu_vllm_helm_config="modules/llm-stack/helm/cpu/cpu-tinyllama-light-ingress-gcp.tpl"
export TF_VAR_gpu_vllm_helm_config="modules/llm-stack/helm/gpu/gpu-tinyllama-light-ingress-gcp.tpl"

# Load vars
source env-vars

Option 2: Through Terraform Variables

 # Copy and customizen 
 $ cp terraform.tfvars.example terraform.tfvars
 $ vim terraform.tfvars

 # Copy and customizen 
 $ cp terraform.tfvars.example terraform.tfvars
 $ vim terraform.tfvars

Load the Variables into Your Shell Before running Terraform, source the env-vars file:

$ source env-vars

$ source env-vars

3️⃣ Make sure your gcloud authentication is fresh:

Run the below, if Google’s “Re-authentication for privileged access” (RAPT) has expired

gcloud auth application-default login --no-launch-browser

gcloud auth application-default login --no-launch-browser

3️⃣ Run Terraform deployment:

You can now safely run Terraform plan & apply. You will deploy the 42 resources in total, including local kubeconfig.

terraform init
terraform plan
terraform apply

terraform init
terraform plan
terraform apply

After the deployment you should be able to interact with the cluster using kubectl:

export KUBECONFIG=$PWD/kubeconfig

export KUBECONFIG=$PWD/kubeconfig

Full Plan

Apply complete! Resources: 42 added, 0 changed, 0 destroyed.

Outputs:

Stack_Info = "Built with ❤️ by @Cloudthrill"
LB-IP-Address = "136.110.184.34"
gke_cluster_id = "xx"
gke_deployment_info = <sensitive>
gke_endpoint = <sensitive>
gke_location = "us-east1-c"
gke_master_version = "1.32.9-gke.1207000"
gke_name = "vllm-gke"
gke_region = "us-east1"
gpu_driver_status = tomap({
  "deployed" = "true"
  "method" = "GKE automatic installation"
  "reason" = "GKE handles GPU drivers automatically via gpu_driver_version parameter"
  "version" = "LATEST"
})
grafana_url = "https://grafana.886eb822.sslip.io"
vllm_api_url = "https://vllm-api.22241bb6.sslip.io/v1"

Apply complete! Resources: 42 added, 0 changed, 0 destroyed.

Outputs:

Stack_Info = "Built with ❤️ by @Cloudthrill"
LB-IP-Address = "136.110.184.34"
gke_cluster_id = "xx"
gke_deployment_info = <sensitive>
gke_endpoint = <sensitive>
gke_location = "us-east1-c"
gke_master_version = "1.32.9-gke.1207000"
gke_name = "vllm-gke"
gke_region = "us-east1"
gpu_driver_status = tomap({
  "deployed" = "true"
  "method" = "GKE automatic installation"
  "reason" = "GKE handles GPU drivers automatically via gpu_driver_version parameter"
  "version" = "LATEST"
})
grafana_url = "https://grafana.886eb822.sslip.io"
vllm_api_url = "https://vllm-api.22241bb6.sslip.io/v1"

4️⃣ Observability (Grafana Login)

You can access Grafana dashboards using grafana_url output or port forwarding .(i.e http://localhost:3000)

# Case 1:  Get Grafana HTTPS URL (already printed by Terraform) i.e https://grafana.xxxxx.nip.io
terraform output -raw grafana_url 
# Cse 2: Or port forward
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack

# Case 1:  Get Grafana HTTPS URL (already printed by Terraform) i.e https://grafana.xxxxx.nip.io
terraform output -raw grafana_url 
# Cse 2: Or port forward
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack

Run the below command to fetch the password

kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath={.data.admin-password} | base64 -d

kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana -o jsonpath={.data.admin-password} | base64 -d

Username: admin
Password : through kubectl command above

Automatic vLLM Dashboard

In this stack, the vLLM dashboard and service monitor are automatically configured for Grafana.

For Benchmarking vLLM Production Stack Performance check the multi-round QA tutorial

5️⃣ Destroying the Infrastructure 🚧

To delete everything just run the below (Note: sometimes you need to run it twice as the loadbalancer gets tough to die)

terraform destroy -auto-approve

terraform destroy -auto-approve

🛠️Configuration knobs

This stack provides extensive customization options to tailor your deployment:

Variable	Default	Description
`project_id`	required	GCP project ID
`region`	`us-central1`	GCP region
`cluster_version`	`1.30`	Kubernetes version
`inference_hardware`	`cpu`	`cpu` or `gpu`
`pod_cidr`	`10.244.0.0/16`	Pod IP range
`enable_vllm`	`false`	Deploy vLLM stack
`hf_token`	«secret»	HF model download token
`enable_disk_csi_driver`	`true`	Enable Persistent Disk CSI driver for block storage
`enable_file_csi_driver`	`false`	Enable Filestore CSI driver for shared file storage
`create_project`	`false`	Whether to create the GCP project (`true`) or use an existing one (`false`)
`letsencrypt_email`	`admin@example.com`	Email for Let’s Encrypt

📓This is just a subset. For the full list of 20+ configurable variables, consult the configuration template : env-vars.template

🧪 Quick Test

1️⃣ Router Endpoint and API URL

1.1 Router Endpoint through port forwarding run the following command:

# Case 1 : Port forwarding
kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80
export vllm_api_url=http://localhost:30080/v1

# Case 1 : Port forwarding
kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80
export vllm_api_url=http://localhost:30080/v1

1.2 Extracting the Router URL via nginx egress
The endpoint URL can be found in the vllm_api_url output :

# Case 2 : Extract from Terraform output 
export vllm_api_url=$(terraform output -raw vllm_api_url)
# Example output:
# https://vllm.a1b2c3d4.nip.io/v1

# Case 2 : Extract from Terraform output 
export vllm_api_url=$(terraform output -raw vllm_api_url)
# Example output:
# https://vllm.a1b2c3d4.nip.io/v1

2️⃣ List models

# check models
curl -s ${vllm_api_url}/models | jq .

# check models
curl -s ${vllm_api_url}/models | jq .

3️⃣ Completion Applicable for both ingress and port forwarding URLs

curl ${vllm_api_url}/completions 
  -H "Content-Type: application/json" 
  -d '{
    "model": "/data/models/tinyllama",
    "prompt": "Google Cloud is a",
    "max_tokens": 20,
    "temperature": 0
  }' | jq .choices[].text

curl ${vllm_api_url}/completions 
  -H "Content-Type: application/json" 
  -d '{
    "model": "/data/models/tinyllama",
    "prompt": "Google Cloud is a",
    "max_tokens": 20,
    "temperature": 0
  }' | jq .choices[].text

4️⃣ vLLM model service

kubectl -n vllm get svc
NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                     AGE
vllm-gpu-router-service                 ClusterIP   10.96.104.134   <none>        80/TCP,9000/TCP             115m
vllm-gpu-tinyllama-gpu-engine-service   ClusterIP   10.96.143.115   <none>        80/TCP,55555/TCP,9999/TCP   115m

kubectl -n vllm get svc
NAME                                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                     AGE
vllm-gpu-router-service                 ClusterIP   10.96.104.134   <none>        80/TCP,9000/TCP             115m
vllm-gpu-tinyllama-gpu-engine-service   ClusterIP   10.96.143.115   <none>        80/TCP,55555/TCP,9999/TCP   115m

🎯Troubleshooting:

Certificate Not Issuing

Debug: STATUS: Pending or False

# Check certificate status
kubectl describe certificate -n vllm
# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100
# Check HTTP-01 challenge
kubectl get challenge -n vllm

# Check certificate status
kubectl describe certificate -n vllm
# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager --tail=100
# Check HTTP-01 challenge
kubectl get challenge -n vllm

Symptom

# Message: 
Failed to create new order: acme: urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates already issued for: nip.io: see letsencrypt.org/docs/rate-limits

# Message: 
Failed to create new order: acme: urn:ietf:params:acme:error:rateLimited: Error creating new order :: too many certificates already issued for: nip.io: see letsencrypt.org/docs/rate-limits

Fix: Change nip.io to sslip.io in the ingress host of the vllm helm charts
cpu-tinyllama-light-ingress-gcp | gpu-tinyllama-light-ingress-gcp

Useful GCloud Debugging Commands

# Check GKE cluster status
gcloud container clusters describe vllm-gke --region us-central1

# Check node pools
gcloud container node-pools list --cluster vllm-gke --region us-central1

# Check Cloud NAT status
gcloud compute routers get-nat-mapping-info vllm-vpc-router --region us-central1

# Check GKE cluster status
gcloud container clusters describe vllm-gke --region us-central1

# Check node pools
gcloud container node-pools list --cluster vllm-gke --region us-central1

# Check Cloud NAT status
gcloud compute routers get-nat-mapping-info vllm-vpc-router --region us-central1

Conclusion

After exploring EKS and AKS based vLLM implementations, this deployment now gives you a solid foundation for production LLM serving on Google Cloud. This gives you an ideal starting point to extend it further.

Next Steps 🚀

In the next post, we’re taking this stack to Nebius Managed K8s. Stay Tuned!

📚 Additional Resources

Run AI Your Way — In Your Cloud

Want full control over your AI backend? The CloudThrill VLLM Private Inference POC is still open — but not forever.

📢 Secure your spot (only a few left), 𝗔𝗽𝗽𝗹𝘆 𝗻𝗼𝘄!

Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control

Claim YOur FREE VLLM POC

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

🙋🏻‍♀️If you like this content please subscribe to our blog newsletter ❤️.

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Get in touch

Latest Podcasts

vLLM Production Stack on GCP GKE with Terraform🧑🏼‍🚀

Intro

📂 Project Structure

🧰Prerequisites

What’s in the stack?📦

Architecture Overview

1. Networking Foundation

2. GKE Cluster

3. Essential Add-ons

4. vLLM Production Stack

5. Hardware Flexibility: CPU vs GPU

🖥️ GKE GPU Instance Types Available

Getting started

🔵 Deployment Steps

1️⃣Clone the repository

2️⃣ Set Up Environment Variables

Usage examples

3️⃣ Make sure your gcloud authentication is fresh:

3️⃣ Run Terraform deployment:

4️⃣ Observability (Grafana Login)

Automatic vLLM Dashboard

5️⃣ Destroying the Infrastructure 🚧

🛠️Configuration knobs

🧪 Quick Test

🎯Troubleshooting:

Certificate Not Issuing

Useful GCloud Debugging Commands

Conclusion

Next Steps 🚀

📚 Additional Resources

Run AI Your Way — In Your Cloud

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Intro

📂 Project Structure

🧰Prerequisites

What’s in the stack?📦

Architecture Overview

1. Networking Foundation

2. GKE Cluster

3. Essential Add-ons

4. vLLM Production Stack

5. Hardware Flexibility: CPU vs GPU

🖥️ GKE GPU Instance Types Available

Getting started

🔵 Deployment Steps

1️⃣Clone the repository

2️⃣ Set Up Environment Variables

Usage examples

3️⃣ Make sure your gcloud authentication is fresh:

3️⃣ Run Terraform deployment:

4️⃣ Observability (Grafana Login)

Automatic vLLM Dashboard

5️⃣ Destroying the Infrastructure 🚧

🛠️Configuration knobs

🧪 Quick Test

🎯Troubleshooting:

Certificate Not Issuing

Useful GCloud Debugging Commands

Conclusion

Next Steps 🚀

📚 Additional Resources

Run AI Your Way — In Your Cloud

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

👋🏻Want to chat about your challenges? We’d love to hear from you!

Don't miss a Bit!

Join countless others! Sign up and get awesome cloud content straight to your inbox. 🚀

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀