Intro

When scaling AI models like DeepSeek or Qwen on Amazon EKS, engineering teams obsess over GPU utilization while quietly bleeding money on storage bloat. Because standard EBS volumes force a 1:1 replica-to-disk ratio, scaling a single 70GB model to 20 pods doesn’t cost 70GB, it forces you to provision 1.4 Terabytes of redundant EBS storage.

But here’s a smarter way: shift LLM storage tier from EBS to S3 mountpoint CSI driver, and mount model weights directly into your vLLM pods as a shared ReadOnlyMany volume. This eliminates duplicate storage, centralizes your model registry, speeds up pod scaling (stream weights directly from S3–>GPU), and permanently cuts your inference storage bill by up to 95%.

Storage should scale with models, not replicas…

Today, we’ll build that architecture with vLLM on Amazon EKS and show why S3 is often the best storage tier for the job.

I. The Storage Problem

1.1 The “EBS storage Tax”: Why Scaling vLLM is Broken

Every EBS volume is ReadWriteOnce and locked to a single node. This creates three operational penalties:

The Problem	EBS Reality	S3 Mountpoint Fix
Duplicate Storage	20 replicas = You pay for 20 copies of the same (i.e 70GB) model	Shared storage: 1 copy for all replica pods
Cold Start Delays	Serial loading: HuggingFace → EBS → GPU VRAM	Single-hop: S3 (AWS backbone) → GPU VRAM
Throughput Tax	Over-paying for write-latency/IOPS that read-only weights never use	5-50+ GB/s included; no surcharge

💡S3 Mountpoint eliminates all three penalties so compute scales with replicas, and storage scales with models.

1.2 EBS vs. S3 Mountpoint: Cost Savings Simulator

To see how bad this gets in practice, try the cost simulator below and the numbers show how much $ EBS is wasting:

Interactive LLM Storage calculator:

EBS vs S3 Storage Cost Calculator

% Cost Saved (vs gp3)

97.1%

Monthly Savings ($)

$0.00

Storage Avoided (GB)

1,890

Unique Models 3 Avg Model Size (GB) 70 Replicas per Model 10

EBS gp2 ($/GB-mo) EBS gp3 ($/GB-mo) S3 Standard ($/GB-mo)

Storage Class	Provisioned Storage	Monthly Cost

1.3 Why Not Just Use EFS?

AWS EFS might seem like a good idea, shared storage, no duplication. But it has an insane throughput pricing problem. You need THIS MUCH cash to afford EFS throughput 👇🏻.

Why EFS Doesn’t Help You:

EFS charges separately for throughput because it’s designed for frequent random I/O. Model weights are large sequential reads done once at pod startup. You’re paying for features you don’t need.

The Cost Reality S3 vs EFS/EBS:

Storage Class	Cost Model	Throughput	1 GB/s Cost	Total Monthly
EBS gp3	Pay per volume + extra for higher throughput	~125 MB/s baseline	Included	$80.00
EFS	Pay for storage + throughput/IOPS	~50-500 MB/s	$6,000	$6,030.00
S3 Mountpoint	Pay per TB stored + requests	5-50+ GB/s*	Included	$2.30

*S3 throughput depends on instance size and parallel read requests.

EBS has a scaling problem while EFS has a throughput pricing problem

When to use each:

Use Case	Best Choice
Inference (frozen weights)	S3 Mountpoint
Training (active checkpoints)	EFS

II. Implementation (Try It Yourlsef)

This is part of our ongoing contribution to the vLLM Production Stack. Extending vLLM deployments across Clouds.

2.1 📂 Project Structure

💡You can find the code in our official repo ➡️ cloudthrill-vllm-production-stack-terraform.

./
├── main.tf
├── network.tf
├── storage.tf     <<-- S3 Mount integration
├── provider.tf
├── variables.tf
├── output.tf
├── cluster-tools.tf
├── datasources.tf
├── iam_role.tf
├── vllm-production-stack.tf
├── env-vars.template
├── terraform.tfvars.template
├── modules/
│   └── llm-stack
|       ├── helm|
|           ├── cpu|
|           └── gpu| gpu-tinyllama-light-ingress-s3.tpl  # <<-- our Vllm chart using S3-mount
├── config/
│   ├── calico-values.tpl
│   └── kubeconfig.tpl
└── README.md

./
├── main.tf
├── network.tf
├── storage.tf     <<-- S3 Mount integration
├── provider.tf
├── variables.tf
├── output.tf
├── cluster-tools.tf
├── datasources.tf
├── iam_role.tf
├── vllm-production-stack.tf
├── env-vars.template
├── terraform.tfvars.template
├── modules/
│   └── llm-stack
|       ├── helm|
|           ├── cpu|
|           └── gpu| gpu-tinyllama-light-ingress-s3.tpl  # <<-- our Vllm chart using S3-mount
├── config/
│   ├── calico-values.tpl
│   └── kubeconfig.tpl
└── README.md

2.2 🧰Prerequisites

Before you begin, ensure you have the following:

Tool	Version-tested	Purpose
Terraform	≥ 1.5.7	Infrastructure provisioning
AWS CLI v2	≥ 2.16	AWS authentication
kubectl	≥ 1.30	Kubernetes management
helm	≥ 3.14	Used for Helm chart debugging
jq	latest	JSON parsing (optional)

Follow the below steps to Install the tools (expend)👇🏼

# Install tools
sudo apt update && sudo apt install -y jq curl unzip gpg
wget -qO- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install -y terraform
curl -s "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && unzip -q awscliv2.zip && sudo ./aws/install && rm -rf aws awscliv2.zip
curl -sLO "https://dl.k8s.io/release/$(curl -Ls https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" && sudo install kubectl /usr/local/bin/ && rm kubectl
curl -s https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg >/dev/null && echo "deb [signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm.list && sudo apt update && sudo apt install -y helm

# Install tools
sudo apt update && sudo apt install -y jq curl unzip gpg
wget -qO- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install -y terraform
curl -s "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && unzip -q awscliv2.zip && sudo ./aws/install && rm -rf aws awscliv2.zip
curl -sLO "https://dl.k8s.io/release/$(curl -Ls https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" && sudo install kubectl /usr/local/bin/ && rm kubectl
curl -s https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg >/dev/null && echo "deb [signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm.list && sudo apt update && sudo apt install -y helm

Configure AWS profile

aws configure --profile myprofile
export AWS_PROFILE=myprofile    # ← If null Terraform exec auth will use the default profile

aws configure --profile myprofile
export AWS_PROFILE=myprofile    # ← If null Terraform exec auth will use the default profile

2.3 Core Infrastructure Components 📦

Architecture Overview

Deployment layers – The stack provisions infrastructure in logical layers that adapt based on your hardware choice:

Phase	Component	Action	Condition
1. Infrastructure	VPC	Provision VPC with 3 public + 3 private subnets	Always
	EKS	Deploy v1.30 cluster + CPU system node group	Always
	CNI	Remove `aws-node` and install Calico overlay (VXLAN)	Always
	Add-ons	Deploy ALB controller, kube-prometheus, and core EKS add-ons	Always
2. LLM Storage	S3 Bucket	Create model bucket and bootstrap weights from Hugging Face if missing	`enable_s3_model_storage = true`
	S3 CSI Driver	Install Mountpoint S3 CSI Driver and attach S3 IAM role	`enable_s3_csi_driver = true`
	S3 PV/PVC	Create prefix-scoped PV/PVC targeting the selected model path	`enable_s3_model_storage = true`
3. vLLM Stack	HF Secret	Create `hf-token-secret` for gated model access / bootstrap	`enable_vllm = true`
	GPU Infrastructure	Provision GPU node group	`inference_hardware = "gpu"`
	GPU Operator	Deploy NVIDIA plugin / operator	`inference_hardware = "gpu"`
	Application	Deploy validated TinyLlama-1.1B Helm chart to `vllm` namespace	`enable_vllm = true`
4. Networking	Load Balancer (Optional)	Configure ALB and ingress for external access	`enable_lb_ctl = true`

III. S3 vLLM Mountpoint Walkthrough

This S3-optimized stack delivers three key features with the current test (single GPU node L4 :24GB):

Key Feature	What It Does
S3-Native Streaming	Mountpoint CSI streams weights directly from S3 → GPU VRAM (no EBS overhead, no duplication).
Multi-Replica GPU Sharing	Bypasses K8s hardware lock and partitions VRAM, allowing multi-replicas per GPU.
Automated Bootstrapping	Auto-downloads models from HuggingFace to S3 with idempotency checks.

☸️S3 CSI Driver & IAM
🗄️S3 Storage Provisioning
⚙️Automated S3 Bootstrap

Mountpoint CSI exposes S3 as EKS Persistent Volumes, with IRSA granting the CSI Driver scoped read-only bucket access.

View CSI driver snippet →

# cluster-tools.tf snippet
# 1. Deploy CSI Driver with IRSA & GPU Tolerations
module "eks_addons" {
  source = ".." 
  # ...
eks_addons = {
 helm_releases = var.enable_s3_csi_driver ? {
    aws-mountpoint-s3-csi-driver = {
    namespace  = "kube-system"
    chart      = "aws-mountpoint-s3-csi-driver"
    repository = "https://awslabs.github.io/mountpoint-s3-csi-driver"
    
    values = [yamlencode({
      node = {
        serviceAccount = {
          annotations = {
            "eks.amazonaws.com/role-arn" = aws_iam_role.s3_csi_driver[0].arn
          }
        }
        tolerations = [{
          key      = "nvidia.com/gpu"
          operator = "Exists"
          effect   = "NoSchedule"
        }]
      }
    })]
  }
} : {}

#--------------------------------------------------------------
# Static PV/PVC for Mountpoint S3 CSI
# Mount only the models/ prefix from the bucket # --- storage.tf snippet
#--------------------------------------------------------------
# 2. The IAM Role for the Service Account (IRSA)
resource "aws_iam_role" "s3_csi_driver" {
  count              = var.enable_s3_csi_driver ? 1 : 0
  name_prefix        = "${module.eks.cluster_name}-s3-csi-driver-"
  assume_role_policy = data.aws_iam_policy_document.s3_csi_driver_assume_role.json
}

# 3. Scoped Read-Only Policy
resource "aws_iam_policy" "s3_csi_driver_readonly" {
  name_prefix = "${module.eks.cluster_name}-s3-csi-read-"
  policy      = data.aws_iam_policy_document.s3_csi_driver_readonly.json
}

# 4. Attaching Scoped Access to the CSI Role
resource "aws_iam_role_policy_attachment" "s3_csi_driver_readonly" {
  role       = aws_iam_role.s3_csi_driver[0].name
  policy_arn = aws_iam_policy.s3_csi_driver_readonly[0].arn
}

# cluster-tools.tf snippet
# 1. Deploy CSI Driver with IRSA & GPU Tolerations
module "eks_addons" {
  source = ".." 
  # ...
eks_addons = {
 helm_releases = var.enable_s3_csi_driver ? {
    aws-mountpoint-s3-csi-driver = {
    namespace  = "kube-system"
    chart      = "aws-mountpoint-s3-csi-driver"
    repository = "https://awslabs.github.io/mountpoint-s3-csi-driver"
    
    values = [yamlencode({
      node = {
        serviceAccount = {
          annotations = {
            "eks.amazonaws.com/role-arn" = aws_iam_role.s3_csi_driver[0].arn
          }
        }
        tolerations = [{
          key      = "nvidia.com/gpu"
          operator = "Exists"
          effect   = "NoSchedule"
        }]
      }
    })]
  }
} : {}

#--------------------------------------------------------------
# Static PV/PVC for Mountpoint S3 CSI
# Mount only the models/ prefix from the bucket # --- storage.tf snippet
#--------------------------------------------------------------
# 2. The IAM Role for the Service Account (IRSA)
resource "aws_iam_role" "s3_csi_driver" {
  count              = var.enable_s3_csi_driver ? 1 : 0
  name_prefix        = "${module.eks.cluster_name}-s3-csi-driver-"
  assume_role_policy = data.aws_iam_policy_document.s3_csi_driver_assume_role.json
}

# 3. Scoped Read-Only Policy
resource "aws_iam_policy" "s3_csi_driver_readonly" {
  name_prefix = "${module.eks.cluster_name}-s3-csi-read-"
  policy      = data.aws_iam_policy_document.s3_csi_driver_readonly.json
}

# 4. Attaching Scoped Access to the CSI Role
resource "aws_iam_role_policy_attachment" "s3_csi_driver_readonly" {
  role       = aws_iam_role.s3_csi_driver[0].name
  policy_arn = aws_iam_policy.s3_csi_driver_readonly[0].arn
}

💡Notice both IAM (IRSA) role annotation and GPU toleration enabling the CSI pods to run on tainted GPU nodes.

View 📄 Complete implementation code → cluster-tools.tf

This creates an S3 bucket, and provisions Kubernetes PV/PVC resources that mount the specific S3 prefixes.

View storage provisioning snippet →

# storage.tf snippet

# 1. The Global Model Registry
resource "aws_s3_bucket" "vllm_models" {
  count         = var.enable_s3_model_storage && var.create_s3_bucket ? 1 : 0
  bucket        = var.s3_bucket
  force_destroy = true
}

# 2. The Kubernetes Bridge (PV)
resource "kubernetes_persistent_volume" "s3_models" {
  count = var.enable_vllm && var.enable_s3_model_storage ? 1 : 0
  spec {
    access_modes  = ["ReadOnlyMany"]
# ...
    mount_options = [
      "region ${var.region}",
      "prefix ${local.model_s3_paths["tiny"]}/"
     ]
    }
  } 
  
# 3. The Pod-Facing Claim (PVC)
resource "kubernetes_persistent_volume_claim" "s3_models" {
  count = var.enable_vllm && var.enable_s3_model_storage ? 1 : 0
  ...
  spec {
    access_modes       = ["ReadOnlyMany"]
    storage_class_name = ""
    volume_name        = kubernetes_persistent_volume.s3_models[0].metadata[0].name
  # ...
  }}
#  --snip

# storage.tf snippet

# 1. The Global Model Registry
resource "aws_s3_bucket" "vllm_models" {
  count         = var.enable_s3_model_storage && var.create_s3_bucket ? 1 : 0
  bucket        = var.s3_bucket
  force_destroy = true
}

# 2. The Kubernetes Bridge (PV)
resource "kubernetes_persistent_volume" "s3_models" {
  count = var.enable_vllm && var.enable_s3_model_storage ? 1 : 0
  spec {
    access_modes  = ["ReadOnlyMany"]
# ...
    mount_options = [
      "region ${var.region}",
      "prefix ${local.model_s3_paths["tiny"]}/"
     ]
    }
  } 
  
# 3. The Pod-Facing Claim (PVC)
resource "kubernetes_persistent_volume_claim" "s3_models" {
  count = var.enable_vllm && var.enable_s3_model_storage ? 1 : 0
  ...
  spec {
    access_modes       = ["ReadOnlyMany"]
    storage_class_name = ""
    volume_name        = kubernetes_persistent_volume.s3_models[0].metadata[0].name
  # ...
  }}
#  --snip

💡The PV options sets both region and the S3 prefix to mount, narrowing model weights to access in the bucket.

View 📄Complete implementation code → storage.tf

The bootstrap script checks if models exist before deployment. If it’s empty, it downloads it from HF and syncs it to S3.

View S3 bootstrap snippet →

# storage.tf snippet...
echo "Checking if model exists in s3://$BUCKET/$PREFIX/"
...
if [ "$HAS_CONFIG" -eq 1 ] && [ "$SIZE" -gt "$MIN_SIZE_BYTES" ]; then
  echo "Model already exists in S3. Skipping bootstrap."
  exit 0
fi

echo "Model not found or incomplete. Downloading from Hugging Face: $HF_MODEL"
hf download "$HF_MODEL" --local-dir "$TMP_DIR"
#...snip

# storage.tf snippet...
echo "Checking if model exists in s3://$BUCKET/$PREFIX/"
...
if [ "$HAS_CONFIG" -eq 1 ] && [ "$SIZE" -gt "$MIN_SIZE_BYTES" ]; then
  echo "Model already exists in S3. Skipping bootstrap."
  exit 0
fi

echo "Model not found or incomplete. Downloading from Hugging Face: $HF_MODEL"
hf download "$HF_MODEL" --local-dir "$TMP_DIR"
#...snip

Step	Action	Benefit
HuggingFace Download	Auto-downloads model if S3 prefix `vllm/models/$MyModel` is empty	Zero manual setup
Idempotency Checks	Validates `config.json` + minimum S3 sub-directory size	Prevents duplicate uploads
Dependency Sequencing	Helm waits for bootstrap completion	Eliminates race conditions

View 📄Complete implementation code → storage.tf

Final S3 storage layout:

S3 bucket (vllm)     # bucket name must be unique globally i.e vllm-123456
    └── models/
        ├── tiny-llama/
        ├── llama-3/
        └── qwen-3/

EKS
  └── Mountpoint CSI
        └── Mount s3://vllm/models → /models

vLLM pods
  └── modelURL = /models/tiny-llama

S3 bucket (vllm)     # bucket name must be unique globally i.e vllm-123456
    └── models/
        ├── tiny-llama/
        ├── llama-3/
        └── qwen-3/

EKS
  └── Mountpoint CSI
        └── Mount s3://vllm/models → /models

vLLM pods
  └── modelURL = /models/tiny-llama

3.2 Multi-Replica GPU Sharing

To run 2 replicas on a single NVIDIA L4, we bypass Kubernetes’ hardware lock requestGPU and let vLLM manage VRAM directly.

🔴View full storage provisioning code →

modelSpec:
  - name: "tinyllama-gpu"
    replicaCount: 2
    # requestGPU: 1  # REMOVED - bypasses K8s device lock
    
    nodeSelectorTerms:
      - matchExpressions:
        - key: workload-type
          operator: "In"
          values: ["gpu"]
    
    vllmConfig:
      extraArgs:
        - "--gpu-memory-utilization=0.4"  # 0.4 × 2 = 80% VRAM usage
        
    extraVolumes:
      - name: s3-model-storage
        persistentVolumeClaim:
          claimName: vllm-s3-claim
    
    extraVolumeMounts:
      - name: s3-model-storage
        mountPath: /models/tiny-llama
        readOnly: true

modelSpec:
  - name: "tinyllama-gpu"
    replicaCount: 2
    # requestGPU: 1  # REMOVED - bypasses K8s device lock
    
    nodeSelectorTerms:
      - matchExpressions:
        - key: workload-type
          operator: "In"
          values: ["gpu"]
    
    vllmConfig:
      extraArgs:
        - "--gpu-memory-utilization=0.4"  # 0.4 × 2 = 80% VRAM usage
        
    extraVolumes:
      - name: s3-model-storage
        persistentVolumeClaim:
          claimName: vllm-s3-claim
    
    extraVolumeMounts:
      - name: s3-model-storage
        mountPath: /models/tiny-llama
        readOnly: true

nodeSelectorTerms pins pods on GPU nodes, and --gpu-memory-utilization flag controls VRAM allocation per pod.

IV. 🔵Getting started

The following configuration was selected to validate the S3-native streaming and multi-replica GPU sharing logic:

Feature	Configuration Details
✅ Model	TinyLlama-1.1B (Default, customizable via Helm)
✅ vLLM Load Balancing	Round-robin router service across replicas
✅ Storage	S3-Mount PVC mapped to `/models/<mymodel>`
✅ Monitoring	Prometheus metrics enabled for observability

4.1 Deployment Steps

1️⃣Clone the repository

The vLLM EKS-S3 deployment build is located under /aws /eks-s3-mount (see below):

 $ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
 📂..  
 $ cd vllm-production-stack-terraform/aws/eks-s3-mount/

 $ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
 📂..  
 $ cd vllm-production-stack-terraform/aws/eks-s3-mount/

2️⃣ Set Up Environment Variables

Use an env-vars file to export your TF_VARS or use terraform.tfvars . Replace placeholders with your values:

cp env-vars.template env-vars
vim env-vars  # Set HF token and customize deployment options
source env-vars

cp env-vars.template env-vars
vim env-vars  # Set HF token and customize deployment options
source env-vars

🛠️Configuration knobs

This stack provides extensive customization options to tailor your deployment:

Variable	Tested Value	Effect
`inference_hardware`	`"gpu"`	Required to provision GPU-optimized node groups.
`gpu_node_instance_types`	`'["g6.2xlarge"]'`	Selects NVIDIA L4 instances (24GB VRAM) for partitioning.
`enable_s3_model_storage`	`true`	Enables the S3 back-end logic for weight delivery.
`enable_s3_csi_driver`	`true`	Deploys the Mountpoint for Amazon S3 CSI driver.
`s3_bucket`	“vllm-unique-id”	Target S3 bucket for the model registry.
`huggingface_model_id`	`"TinyLlama/TinyLlama-1.1B-Chat-v1.0"`	The specific model source for automated sync.
`hf_token`	“your-token”	Auth token for private or gated HF models.

📓This is just a subset. For the full list of 20+ configurable variables, consult the configuration template : env-vars.template

Usage examples

# Copy and customize
$ cp env-vars.template env-vars
$ vi env-vars
################################################################################
 # ☸️ EKS cluster basics
################################################################################
export TF_VAR_cluster_name="vllm-eks-prod" # default: "vllm-eks-prod"
export TF_VAR_cluster_version="1.32"       # default: "1.30" - Kubernetes cluster version
export TF_VAR_gpu_node_instance_types='["g6.2xlarge"]'
################################################################################
 # 💽 S3 Model Storage 
################################################################################
export TF_VAR_enable_s3_csi_driver=true
export TF_VAR_enable_s3_model_storage=true
export TF_VAR_create_s3_bucket=true
export TF_VAR_s3_bucket="vllm-cloudthrill"    # CHANGE ME (must be unique globally i.e vllm-1234)
export TF_VAR_s3_models_prefix="models"
export TF_VAR_s3_csi_driver_version="1.10.0"
export TF_VAR_huggingface_model_id="TinyLlama/TinyLlama-1.1B-Chat-v1.0" # required for this lab
################################################################################
 # 🧠 LLM Inference Configuration
################################################################################
export TF_VAR_enable_vllm="true"         # default: "false" - Set to "true" to deploy vLLM
export TF_VAR_hf_token=""                # default: "" - Hugging Face token for model download (if needed)
export TF_VAR_inference_hardware="gpu"   # must be "gpu"
# Paths to VLLM Helm chart values templates.
# DO NOT Change below for this lab
# export TF_VAR_gpu_vllm_helm_config="./modules/llm-stack/helm/gpu/gpu-tinyllama-light-ingress-3.tpl" 
################################################################################
 # ⚙️ Node-group sizing
################################################################################
# CPU pool (always present)
export TF_VAR_cpu_node_min_size="1"     # default: 1
export TF_VAR_cpu_node_max_size="3"     # default: 3
export TF_VAR_cpu_node_desired_size="2" # default: 2
# GPU pool (ignored unless inference_hardware = "gpu")
export TF_VAR_gpu_node_min_size="1"     # default: 1
export TF_VAR_gpu_node_max_size="1"     # default: 1
export TF_VAR_gpu_node_desired_size="1" # default: 1
...snip
 $ source env-vars

# Copy and customize
$ cp env-vars.template env-vars
$ vi env-vars
################################################################################
 # ☸️ EKS cluster basics
################################################################################
export TF_VAR_cluster_name="vllm-eks-prod" # default: "vllm-eks-prod"
export TF_VAR_cluster_version="1.32"       # default: "1.30" - Kubernetes cluster version
export TF_VAR_gpu_node_instance_types='["g6.2xlarge"]'
################################################################################
 # 💽 S3 Model Storage 
################################################################################
export TF_VAR_enable_s3_csi_driver=true
export TF_VAR_enable_s3_model_storage=true
export TF_VAR_create_s3_bucket=true
export TF_VAR_s3_bucket="vllm-cloudthrill"    # CHANGE ME (must be unique globally i.e vllm-1234)
export TF_VAR_s3_models_prefix="models"
export TF_VAR_s3_csi_driver_version="1.10.0"
export TF_VAR_huggingface_model_id="TinyLlama/TinyLlama-1.1B-Chat-v1.0" # required for this lab
################################################################################
 # 🧠 LLM Inference Configuration
################################################################################
export TF_VAR_enable_vllm="true"         # default: "false" - Set to "true" to deploy vLLM
export TF_VAR_hf_token=""                # default: "" - Hugging Face token for model download (if needed)
export TF_VAR_inference_hardware="gpu"   # must be "gpu"
# Paths to VLLM Helm chart values templates.
# DO NOT Change below for this lab
# export TF_VAR_gpu_vllm_helm_config="./modules/llm-stack/helm/gpu/gpu-tinyllama-light-ingress-3.tpl" 
################################################################################
 # ⚙️ Node-group sizing
################################################################################
# CPU pool (always present)
export TF_VAR_cpu_node_min_size="1"     # default: 1
export TF_VAR_cpu_node_max_size="3"     # default: 3
export TF_VAR_cpu_node_desired_size="2" # default: 2
# GPU pool (ignored unless inference_hardware = "gpu")
export TF_VAR_gpu_node_min_size="1"     # default: 1
export TF_VAR_gpu_node_max_size="1"     # default: 1
export TF_VAR_gpu_node_desired_size="1" # default: 1
...snip
 $ source env-vars

Make sure to load the variables into your shell before running Terraform by sourcing the env-vars file:

3️⃣ Run Terraform deployment:

You can now run Terraform plan & apply which will deploy 110 resources in total, including shared S3-mount LLM storage:

terraform init
terraform plan
terraform apply

terraform init
terraform plan
terraform apply

View Full output summary containing your INFRA & S3 Storage info, along with API endpoints.

Apply complete! Resources: 110 added, 0 changed, 0 destroyed.

Outputs:

aws_vllm_stack_summary = <<EOT

✅ AWS EKS Cluster deployed successfully!

🚀 VLLM PRODUCTION STACK ON AWS EKS 🚀
-----------------------------------------------------------
REGION             : us-east-2
AVAILABILITY ZONES : us-east-2a, us-east-2b, us-east-2c
API ENDPOINT       : https://XXXXXXXXXX.gr7.us-east-2.eks.amazonaws.com
VPC ID             : vpc-09a8ebe863defea50 (10.20.0.0/16)

🖥️  INFRASTRUCTURE & STORAGE
-----------------------------------------------------------
CPU NODES         : [t3.xlarge]
GPU NODES         : [g6.2xlarge]
S3 MODEL BUCKET   : vllm-cloudthrill
S3 CSI ROLE       : arn:aws:iam::xxxxxxxxxxx:role/vllm-eks-prod-s3-csi-driver-xxxx

🧠  MODEL CONFIGURATION
-----------------------------------------------------------
HF SOURCE ID      : TinyLlama/TinyLlama-1.1B-Chat-v1.0
API MODEL URL     : /models/tiny-llama

🌐 ACCESS ENDPOINTS
-----------------------------------------------------------
VLLM API URL      : Disabled
GRAFANA FORWARD   : kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-p.

🛠️  QUICK START COMMANDS
-----------------------------------------------------------
1. Set Context    : export KUBECONFIG="./kubeconfig"
2. Test API       : curl -k "<VLLM_API_URL>/v1/models"

Built with ❤️ by @Cloudthrill

Apply complete! Resources: 110 added, 0 changed, 0 destroyed.

Outputs:

aws_vllm_stack_summary = <<EOT

✅ AWS EKS Cluster deployed successfully!

🚀 VLLM PRODUCTION STACK ON AWS EKS 🚀
-----------------------------------------------------------
REGION             : us-east-2
AVAILABILITY ZONES : us-east-2a, us-east-2b, us-east-2c
API ENDPOINT       : https://XXXXXXXXXX.gr7.us-east-2.eks.amazonaws.com
VPC ID             : vpc-09a8ebe863defea50 (10.20.0.0/16)

🖥️  INFRASTRUCTURE & STORAGE
-----------------------------------------------------------
CPU NODES         : [t3.xlarge]
GPU NODES         : [g6.2xlarge]
S3 MODEL BUCKET   : vllm-cloudthrill
S3 CSI ROLE       : arn:aws:iam::xxxxxxxxxxx:role/vllm-eks-prod-s3-csi-driver-xxxx

🧠  MODEL CONFIGURATION
-----------------------------------------------------------
HF SOURCE ID      : TinyLlama/TinyLlama-1.1B-Chat-v1.0
API MODEL URL     : /models/tiny-llama

🌐 ACCESS ENDPOINTS
-----------------------------------------------------------
VLLM API URL      : Disabled
GRAFANA FORWARD   : kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-p.

🛠️  QUICK START COMMANDS
-----------------------------------------------------------
1. Set Context    : export KUBECONFIG="./kubeconfig"
2. Test API       : curl -k "<VLLM_API_URL>/v1/models"

Built with ❤️ by @Cloudthrill

KUBECONFIG: After the deployment you should be able to interact with the cluster using kubectl:

export KUBECONFIG=$PWD/kubeconfig

export KUBECONFIG=$PWD/kubeconfig

4️⃣ Observability (Grafana)

Upon deployment, you can access Grafana dashboards using port forwarding . URL → “http://localhost:3000”

kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack

# Run the below command to fetch the password
kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana \
-o jsonpath="{.data.admin-password}" | base64 --decode

kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack

# Run the below command to fetch the password
kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana \
-o jsonpath="{.data.admin-password}" | base64 --decode

Automatic vLLM Dashboard

The vLLM dashboard and service monitor are automatically configured for Grafana. See VLLM Dashboard

V. Testing & Validation

5.1 Shared S3-Mount Inference

1️⃣ Export Router API Endpoint

kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80
# Case 1 : Port forwarding
export vllm_api_url=http://localhost:30080/v1

kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80
# Case 1 : Port forwarding
export vllm_api_url=http://localhost:30080/v1

2️⃣ List models

# ---- check models
curl -s ${vllm_api_url}/models | jq .

# ---- check models
curl -s ${vllm_api_url}/models | jq .

3️⃣ Generate Round-Robin inference Workload

# 2. Send a barrage of concurent prompts to test the round-robin distribution
seq 1 100 | xargs -n 1 -P 25 -I {} curl -s -X POST $vllm_api_url/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/models/tiny-llama",
    "prompt": "Explain the architecture of Kubernetes and how it schedules pods in detail:",
    "max_tokens": 150
  }' \
  -o /dev/null \
  -w "✅ Request: {} | Status: %{http_code} | Time: %{time_total}s\n"

# 2. Send a barrage of concurent prompts to test the round-robin distribution
seq 1 100 | xargs -n 1 -P 25 -I {} curl -s -X POST $vllm_api_url/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/models/tiny-llama",
    "prompt": "Explain the architecture of Kubernetes and how it schedules pods in detail:",
    "max_tokens": 150
  }' \
  -o /dev/null \
  -w "✅ Request: {} | Status: %{http_code} | Time: %{time_total}s\n"

4️⃣ Observe the inference in Action
Check the engine logs via stern to confirm that inference runs through both pods using a single weight storage:

# Watch the engine logs to see both pods responding
stern tinyllama-gpu -n vllm --tail 100 --no-follow --include 'POST|Engine' \ 
--exclude 'launcher|200 OK|health|metrics' --color always

# Watch the engine logs to see both pods responding
stern tinyllama-gpu -n vllm --tail 100 --no-follow --include 'POST|Engine' \ 
--exclude 'launcher|200 OK|health|metrics' --color always

🔎View the monitoring output from both vllm pods

+ vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod1 › vllm
+ vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod2 › vllm
vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod1 vllm INFO 04-08 01:24:55 [loggers.py:111] Engine 000: Avg prompt throughput: 77.4 tokens/s, Avg generation throughput: 558.8 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod1 vllm INFO 04-08 01:26:05 [loggers.py:111] Engine 000: Avg prompt throughput: 12.6 tokens/s, Avg generation throughput: 191.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod2 vllm INFO 04-08 01:26:08 [loggers.py:111] Engine 000: Avg prompt throughput: 72.0 tokens/s, Avg generation throughput: 583.1 tokens/s, Running: 11 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.9%, Prefix cache hit rate: 0.0%
vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod2 vllm INFO 04-08 01:41:28 [loggers.py:111] Engine 000: Avg prompt throughput: 82.8 tokens/s, Avg generation throughput: 567.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.5%, Prefix cache hit rate: 0.0%

+ vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod1 › vllm
+ vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod2 › vllm
vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod1 vllm INFO 04-08 01:24:55 [loggers.py:111] Engine 000: Avg prompt throughput: 77.4 tokens/s, Avg generation throughput: 558.8 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod1 vllm INFO 04-08 01:26:05 [loggers.py:111] Engine 000: Avg prompt throughput: 12.6 tokens/s, Avg generation throughput: 191.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod2 vllm INFO 04-08 01:26:08 [loggers.py:111] Engine 000: Avg prompt throughput: 72.0 tokens/s, Avg generation throughput: 583.1 tokens/s, Running: 11 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.9%, Prefix cache hit rate: 0.0%
vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod2 vllm INFO 04-08 01:41:28 [loggers.py:111] Engine 000: Avg prompt throughput: 82.8 tokens/s, Avg generation throughput: 567.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.5%, Prefix cache hit rate: 0.0%

For Benchmarking vLLM Production Stack Performance check the multi-round QA tutorial

5.2 Destroying the Infrastructure 🚧

To delete everything just run the below (Note: sometimes you need to run it twice as the loadbalancer gets tough to die)

terraform destroy -auto-approve

terraform destroy -auto-approve

🫧 Cleanup Notes
If encountering job conflicts during Calico removal (i.e: * jobs.batch already exists) run the below commands

# use the following commands to delete the jobs manually first:
kubectl -n tigera-operator delete job tigera-operator-uninstall --ignore-not-found=true

# use the following commands to delete the jobs manually first:
kubectl -n tigera-operator delete job tigera-operator-uninstall --ignore-not-found=true

Note: See most common issues in this Troubleshooting section

Conclusion

Frozen model weights are dead weight, why keep duplicating them per replica? Traditional EBS-backed vLLM deployments force storage to scale with compute, quietly bleeding money on redundant storage. Today we’ve seen how S3 Mountpoint breaks that pattern by scaling storage with models, not replicas, while streaming weights directly from S3 → GPU, cutting inference storage costs by up to 95% without sacrificing performance.

This isn’t just an AWS trick. The pattern is cloud-agnostic with equivalent object-storage CSI/FUSE drivers across cloud providers:

Azure: Blob Storage CSI driver
GCP: Cloud Storage FUSE CSI driver

For ultra-premium platforms where every millisecond matters, specialized storage layers like WEKA, Vast DATA or Alluxio may still justify the premium. But for most early production inference, object storage is the sweet spot (Stop scaling your bill).

📚 Additional Resources

Run AI Your Way — In Your Cloud

Want full control over your AI backend? The CloudThrill VLLM Private Inference POC is still open — but not forever.

📢 Secure your spot (only a few left), 𝗔𝗽𝗽𝗹𝘆 𝗻𝗼𝘄!

Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control

Claim YOur FREE VLLM POC

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

🙋🏻‍♀️If you like this content please subscribe to our blog newsletter ❤️.

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Get in touch

Latest Podcasts

vLLM on EKS: Cut LLM Storage Costs by 95% with S3 Mountpoint

Intro

I. The Storage Problem

1.1 The “EBS storage Tax”: Why Scaling vLLM is Broken

1.2 EBS vs. S3 Mountpoint: Cost Savings Simulator

AWS Storage Cost Visualizer

1.3 Why Not Just Use EFS?

Why EFS Doesn’t Help You:

The Cost Reality S3 vs EFS/EBS:

When to use each:

II. Implementation (Try It Yourlsef)

2.1 📂 Project Structure

2.2 🧰Prerequisites

Configure AWS profile

2.3 Core Infrastructure Components 📦

Architecture Overview

III. S3 vLLM Mountpoint Walkthrough

3.2 Multi-Replica GPU Sharing

IV. 🔵Getting started

4.1 Deployment Steps

1️⃣Clone the repository

2️⃣ Set Up Environment Variables

🛠️Configuration knobs

Usage examples

3️⃣ Run Terraform deployment:

4️⃣ Observability (Grafana)

Automatic vLLM Dashboard

V. Testing & Validation

5.1 Shared S3-Mount Inference

5.2 Destroying the Infrastructure 🚧

Conclusion

📚 Additional Resources

Run AI Your Way — In Your Cloud

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀

Intro

I. The Storage Problem

1.1 The “EBS storage Tax”: Why Scaling vLLM is Broken

1.2 EBS vs. S3 Mountpoint: Cost Savings Simulator

AWS Storage Cost Visualizer

1.3 Why Not Just Use EFS?

Why EFS Doesn’t Help You:

The Cost Reality S3 vs EFS/EBS:

When to use each:

II. Implementation (Try It Yourlsef)

2.1 📂 Project Structure

2.2 🧰Prerequisites

Configure AWS profile

2.3 Core Infrastructure Components 📦

Architecture Overview

III. S3 vLLM Mountpoint Walkthrough

3.2 Multi-Replica GPU Sharing

IV. 🔵Getting started

4.1 Deployment Steps

1️⃣Clone the repository

2️⃣ Set Up Environment Variables

🛠️Configuration knobs

Usage examples

3️⃣ Run Terraform deployment:

4️⃣ Observability (Grafana)

Automatic vLLM Dashboard

V. Testing & Validation

5.1 Shared S3-Mount Inference

5.2 Destroying the Infrastructure 🚧

Conclusion

📚 Additional Resources

Run AI Your Way — In Your Cloud

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

👋🏻Want to chat about your challenges? We’d love to hear from you!

Don't miss a Bit!

Join countless others! Sign up and get awesome cloud content straight to your inbox. 🚀

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. 🚀