
Intro
When scaling AI models like DeepSeek or Qwen on Amazon EKS, engineering teams obsess over GPU utilization while quietly bleeding money on storage bloat. Because standard EBS volumes force a 1:1 replica-to-disk ratio, scaling a single 70GB model to 20 pods doesn’t cost 70GB, it forces you to provision 1.4 Terabytes of redundant EBS storage.
But here’s a smarter way: shift LLM storage tier from EBS to S3 mountpoint CSI driver, and mount model weights directly into your vLLM pods as a shared ReadOnlyMany volume. This eliminates duplicate storage, centralizes your model registry, speeds up pod scaling (stream weights directly from S3–>GPU), and permanently cuts your inference storage bill by up to 95%.
Today, we’ll build that architecture with vLLM on Amazon EKS and show why S3 is often the best storage tier for the job.
I. The Storage Problem
1.1 The “EBS storage Tax”: Why Scaling vLLM is Broken
Every EBS volume is ReadWriteOnce and locked to a single node. This creates three operational penalties:
💡S3 Mountpoint eliminates all three penalties so compute scales with replicas, and storage scales with models.
1.2 EBS vs. S3 Mountpoint: Cost Savings Simulator
To see how bad this gets in practice, try the cost simulator below and the numbers show how much $ EBS is wasting:
AWS Storage Cost Visualizer
Compare EBS scaling vs S3 Mountpoint for LLM deployments
| Storage Class | Provisioned Storage | Monthly Cost |
|---|
1.3 Why Not Just Use EFS?
AWS EFS might seem like a good idea, shared storage, no duplication. But it has an insane throughput pricing problem. You need THIS MUCH cash to afford EFS throughput 👇🏻.

Why EFS Doesn’t Help You:
EFS charges separately for throughput because it’s designed for frequent random I/O. Model weights are large sequential reads done once at pod startup. You’re paying for features you don’t need.
The Cost Reality S3 vs EFS/EBS:
- EBS has a scaling problem while EFS has a throughput pricing problem
When to use each:
II. Implementation (Try It Yourlsef)
2.1 📂 Project Structure
💡You can find the code in our official repo ➡️ cloudthrill-vllm-production-stack-terraform.
./
├── main.tf
├── network.tf
├── storage.tf <<-- S3 Mount integration
├── provider.tf
├── variables.tf
├── output.tf
├── cluster-tools.tf
├── datasources.tf
├── iam_role.tf
├── vllm-production-stack.tf
├── env-vars.template
├── terraform.tfvars.template
├── modules/
│ └── llm-stack
| ├── helm|
| ├── cpu|
| └── gpu| gpu-tinyllama-light-ingress-s3.tpl # <<-- our Vllm chart using S3-mount
├── config/
│ ├── calico-values.tpl
│ └── kubeconfig.tpl
└── README.md 2.2 🧰Prerequisites
Before you begin, ensure you have the following:
| Tool | Version-tested | Purpose |
|---|---|---|
| Terraform | ≥ 1.5.7 | Infrastructure provisioning |
| AWS CLI v2 | ≥ 2.16 | AWS authentication |
| kubectl | ≥ 1.30 | Kubernetes management |
| helm | ≥ 3.14 | Used for Helm chart debugging |
| jq | latest | JSON parsing (optional) |
Follow the below steps to Install the tools (expend)👇🏼
# Install tools
sudo apt update && sudo apt install -y jq curl unzip gpg
wget -qO- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install -y terraform
curl -s "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && unzip -q awscliv2.zip && sudo ./aws/install && rm -rf aws awscliv2.zip
curl -sLO "https://dl.k8s.io/release/$(curl -Ls https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" && sudo install kubectl /usr/local/bin/ && rm kubectl
curl -s https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg >/dev/null && echo "deb [signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm.list && sudo apt update && sudo apt install -y helmConfigure AWS profile
aws configure --profile myprofile
export AWS_PROFILE=myprofile # ← If null Terraform exec auth will use the default profile2.3 Core Infrastructure Components 📦
Architecture Overview

Deployment layers – The stack provisions infrastructure in logical layers that adapt based on your hardware choice:
| Phase | Component | Action | Condition |
|---|---|---|---|
| 1. Infrastructure | VPC | Provision VPC with 3 public + 3 private subnets | Always |
| EKS | Deploy v1.30 cluster + CPU system node group | Always | |
| CNI | Remove aws-node and install Calico overlay (VXLAN) |
Always | |
| Add-ons | Deploy ALB controller, kube-prometheus, and core EKS add-ons | Always | |
| 2. LLM Storage | S3 Bucket | Create model bucket and bootstrap weights from Hugging Face if missing | enable_s3_model_storage = true |
| S3 CSI Driver | Install Mountpoint S3 CSI Driver and attach S3 IAM role | enable_s3_csi_driver = true |
|
| S3 PV/PVC | Create prefix-scoped PV/PVC targeting the selected model path | enable_s3_model_storage = true |
|
| 3. vLLM Stack | HF Secret | Create hf-token-secret for gated model access / bootstrap |
enable_vllm = true |
| GPU Infrastructure | Provision GPU node group | inference_hardware = "gpu" |
|
| GPU Operator | Deploy NVIDIA plugin / operator | inference_hardware = "gpu" |
|
| Application | Deploy validated TinyLlama-1.1B Helm chart to vllm namespace |
enable_vllm = true |
|
| 4. Networking | Load Balancer (Optional) | Configure ALB and ingress for external access | enable_lb_ctl = true |
III. S3 vLLM Mountpoint Walkthrough
This S3-optimized stack delivers three key features with the current test (single GPU node L4 :24GB):
| Key Feature | What It Does |
|---|---|
| S3-Native Streaming | Mountpoint CSI streams weights directly from S3 → GPU VRAM (no EBS overhead, no duplication). |
| Multi-Replica GPU Sharing | Bypasses K8s hardware lock and partitions VRAM, allowing multi-replicas per GPU. |
| Automated Bootstrapping | Auto-downloads models from HuggingFace to S3 with idempotency checks. |
Mountpoint CSI exposes S3 as EKS Persistent Volumes, with IRSA granting the CSI Driver scoped read-only bucket access.
View CSI driver snippet →
# cluster-tools.tf snippet
# 1. Deploy CSI Driver with IRSA & GPU Tolerations
module "eks_addons" {
source = ".."
# ...
eks_addons = {
helm_releases = var.enable_s3_csi_driver ? {
aws-mountpoint-s3-csi-driver = {
namespace = "kube-system"
chart = "aws-mountpoint-s3-csi-driver"
repository = "https://awslabs.github.io/mountpoint-s3-csi-driver"
values = [yamlencode({
node = {
serviceAccount = {
annotations = {
"eks.amazonaws.com/role-arn" = aws_iam_role.s3_csi_driver[0].arn
}
}
tolerations = [{
key = "nvidia.com/gpu"
operator = "Exists"
effect = "NoSchedule"
}]
}
})]
}
} : {}
#--------------------------------------------------------------
# Static PV/PVC for Mountpoint S3 CSI
# Mount only the models/ prefix from the bucket # --- storage.tf snippet
#--------------------------------------------------------------
# 2. The IAM Role for the Service Account (IRSA)
resource "aws_iam_role" "s3_csi_driver" {
count = var.enable_s3_csi_driver ? 1 : 0
name_prefix = "${module.eks.cluster_name}-s3-csi-driver-"
assume_role_policy = data.aws_iam_policy_document.s3_csi_driver_assume_role.json
}
# 3. Scoped Read-Only Policy
resource "aws_iam_policy" "s3_csi_driver_readonly" {
name_prefix = "${module.eks.cluster_name}-s3-csi-read-"
policy = data.aws_iam_policy_document.s3_csi_driver_readonly.json
}
# 4. Attaching Scoped Access to the CSI Role
resource "aws_iam_role_policy_attachment" "s3_csi_driver_readonly" {
role = aws_iam_role.s3_csi_driver[0].name
policy_arn = aws_iam_policy.s3_csi_driver_readonly[0].arn
}This creates an S3 bucket, and provisions Kubernetes PV/PVC resources that mount the specific S3 prefixes.
View storage provisioning snippet →
# storage.tf snippet
# 1. The Global Model Registry
resource "aws_s3_bucket" "vllm_models" {
count = var.enable_s3_model_storage && var.create_s3_bucket ? 1 : 0
bucket = var.s3_bucket
force_destroy = true
}
# 2. The Kubernetes Bridge (PV)
resource "kubernetes_persistent_volume" "s3_models" {
count = var.enable_vllm && var.enable_s3_model_storage ? 1 : 0
spec {
access_modes = ["ReadOnlyMany"]
# ...
mount_options = [
"region ${var.region}",
"prefix ${local.model_s3_paths["tiny"]}/"
]
}
}
# 3. The Pod-Facing Claim (PVC)
resource "kubernetes_persistent_volume_claim" "s3_models" {
count = var.enable_vllm && var.enable_s3_model_storage ? 1 : 0
...
spec {
access_modes = ["ReadOnlyMany"]
storage_class_name = ""
volume_name = kubernetes_persistent_volume.s3_models[0].metadata[0].name
# ...
}}
# --snipThe bootstrap script checks if models exist before deployment. If it’s empty, it downloads it from HF and syncs it to S3.
View S3 bootstrap snippet →
# storage.tf snippet...
echo "Checking if model exists in s3://$BUCKET/$PREFIX/"
...
if [ "$HAS_CONFIG" -eq 1 ] && [ "$SIZE" -gt "$MIN_SIZE_BYTES" ]; then
echo "Model already exists in S3. Skipping bootstrap."
exit 0
fi
echo "Model not found or incomplete. Downloading from Hugging Face: $HF_MODEL"
hf download "$HF_MODEL" --local-dir "$TMP_DIR"
#...snip| Step | Action | Benefit |
|---|---|---|
| HuggingFace Download | Auto-downloads model if S3 prefix vllm/models/$MyModel is empty |
Zero manual setup |
| Idempotency Checks | Validates config.json + minimum S3 sub-directory size |
Prevents duplicate uploads |
| Dependency Sequencing | Helm waits for bootstrap completion | Eliminates race conditions |
Final S3 storage layout:
S3 bucket (vllm) # bucket name must be unique globally i.e vllm-123456
└── models/
├── tiny-llama/
├── llama-3/
└── qwen-3/
EKS
└── Mountpoint CSI
└── Mount s3://vllm/models → /models
vLLM pods
└── modelURL = /models/tiny-llama3.2 Multi-Replica GPU Sharing
To run 2 replicas on a single NVIDIA L4, we bypass Kubernetes’ hardware lock requestGPU and let vLLM manage VRAM directly.
🔴View full storage provisioning code →
modelSpec:
- name: "tinyllama-gpu"
replicaCount: 2
# requestGPU: 1 # REMOVED - bypasses K8s device lock
nodeSelectorTerms:
- matchExpressions:
- key: workload-type
operator: "In"
values: ["gpu"]
vllmConfig:
extraArgs:
- "--gpu-memory-utilization=0.4" # 0.4 × 2 = 80% VRAM usage
extraVolumes:
- name: s3-model-storage
persistentVolumeClaim:
claimName: vllm-s3-claim
extraVolumeMounts:
- name: s3-model-storage
mountPath: /models/tiny-llama
readOnly: true nodeSelectorTerms pins pods on GPU nodes, and --gpu-memory-utilization flag controls VRAM allocation per pod.
IV. 🔵Getting started
The following configuration was selected to validate the S3-native streaming and multi-replica GPU sharing logic:
| Feature | Configuration Details |
|---|---|
| ✅ Model | TinyLlama-1.1B (Default, customizable via Helm) |
| ✅ vLLM Load Balancing | Round-robin router service across replicas |
| ✅ Storage | S3-Mount PVC mapped to /models/<mymodel> |
| ✅ Monitoring | Prometheus metrics enabled for observability |
4.1 Deployment Steps
1️⃣Clone the repository
The vLLM EKS-S3 deployment build is located under /aws /eks-s3-mount (see below):
$ git clone https://github.com/CloudThrill/vllm-production-stack-terraform
📂..
$ cd vllm-production-stack-terraform/aws/eks-s3-mount/2️⃣ Set Up Environment Variables
Use an env-vars file to export your TF_VARS or use terraform.tfvars . Replace placeholders with your values:
cp env-vars.template env-vars
vim env-vars # Set HF token and customize deployment options
source env-vars🛠️Configuration knobs
This stack provides extensive customization options to tailor your deployment:
| Variable | Tested Value | Effect |
|---|---|---|
inference_hardware |
"gpu" |
Required to provision GPU-optimized node groups. |
gpu_node_instance_types |
'["g6.2xlarge"]' |
Selects NVIDIA L4 instances (24GB VRAM) for partitioning. |
enable_s3_model_storage |
true |
Enables the S3 back-end logic for weight delivery. |
enable_s3_csi_driver |
true |
Deploys the Mountpoint for Amazon S3 CSI driver. |
s3_bucket |
“vllm-unique-id” | Target S3 bucket for the model registry. |
huggingface_model_id |
"TinyLlama/TinyLlama-1.1B-Chat-v1.0" |
The specific model source for automated sync. |
hf_token |
“your-token” | Auth token for private or gated HF models. |
📓This is just a subset. For the full list of 20+ configurable variables, consult the configuration template : env-vars.template
Usage examples
# Copy and customize
$ cp env-vars.template env-vars
$ vi env-vars
################################################################################
# ☸️ EKS cluster basics
################################################################################
export TF_VAR_cluster_name="vllm-eks-prod" # default: "vllm-eks-prod"
export TF_VAR_cluster_version="1.32" # default: "1.30" - Kubernetes cluster version
export TF_VAR_gpu_node_instance_types='["g6.2xlarge"]'
################################################################################
# 💽 S3 Model Storage
################################################################################
export TF_VAR_enable_s3_csi_driver=true
export TF_VAR_enable_s3_model_storage=true
export TF_VAR_create_s3_bucket=true
export TF_VAR_s3_bucket="vllm-cloudthrill" # CHANGE ME (must be unique globally i.e vllm-1234)
export TF_VAR_s3_models_prefix="models"
export TF_VAR_s3_csi_driver_version="1.10.0"
export TF_VAR_huggingface_model_id="TinyLlama/TinyLlama-1.1B-Chat-v1.0" # required for this lab
################################################################################
# 🧠 LLM Inference Configuration
################################################################################
export TF_VAR_enable_vllm="true" # default: "false" - Set to "true" to deploy vLLM
export TF_VAR_hf_token="" # default: "" - Hugging Face token for model download (if needed)
export TF_VAR_inference_hardware="gpu" # must be "gpu"
# Paths to VLLM Helm chart values templates.
# DO NOT Change below for this lab
# export TF_VAR_gpu_vllm_helm_config="./modules/llm-stack/helm/gpu/gpu-tinyllama-light-ingress-3.tpl"
################################################################################
# ⚙️ Node-group sizing
################################################################################
# CPU pool (always present)
export TF_VAR_cpu_node_min_size="1" # default: 1
export TF_VAR_cpu_node_max_size="3" # default: 3
export TF_VAR_cpu_node_desired_size="2" # default: 2
# GPU pool (ignored unless inference_hardware = "gpu")
export TF_VAR_gpu_node_min_size="1" # default: 1
export TF_VAR_gpu_node_max_size="1" # default: 1
export TF_VAR_gpu_node_desired_size="1" # default: 1
...snip
$ source env-vars- Make sure to load the variables into your shell before running Terraform by sourcing the env-vars file:
3️⃣ Run Terraform deployment:
You can now run Terraform plan & apply which will deploy 110 resources in total, including shared S3-mount LLM storage:
terraform init
terraform plan
terraform apply
View Full output summary containing your INFRA & S3 Storage info, along with API endpoints.
Apply complete! Resources: 110 added, 0 changed, 0 destroyed.
Outputs:
aws_vllm_stack_summary = <<EOT
✅ AWS EKS Cluster deployed successfully!
🚀 VLLM PRODUCTION STACK ON AWS EKS 🚀
-----------------------------------------------------------
REGION : us-east-2
AVAILABILITY ZONES : us-east-2a, us-east-2b, us-east-2c
API ENDPOINT : https://XXXXXXXXXX.gr7.us-east-2.eks.amazonaws.com
VPC ID : vpc-09a8ebe863defea50 (10.20.0.0/16)
🖥️ INFRASTRUCTURE & STORAGE
-----------------------------------------------------------
CPU NODES : [t3.xlarge]
GPU NODES : [g6.2xlarge]
S3 MODEL BUCKET : vllm-cloudthrill
S3 CSI ROLE : arn:aws:iam::xxxxxxxxxxx:role/vllm-eks-prod-s3-csi-driver-xxxx
🧠 MODEL CONFIGURATION
-----------------------------------------------------------
HF SOURCE ID : TinyLlama/TinyLlama-1.1B-Chat-v1.0
API MODEL URL : /models/tiny-llama
🌐 ACCESS ENDPOINTS
-----------------------------------------------------------
VLLM API URL : Disabled
GRAFANA FORWARD : kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-p.
🛠️ QUICK START COMMANDS
-----------------------------------------------------------
1. Set Context : export KUBECONFIG="./kubeconfig"
2. Test API : curl -k "<VLLM_API_URL>/v1/models"
Built with ❤️ by @Cloudthrill
KUBECONFIG: After the deployment you should be able to interact with the cluster using kubectl:
export KUBECONFIG=$PWD/kubeconfig4️⃣ Observability (Grafana)
Upon deployment, you can access Grafana dashboards using port forwarding . URL → “http://localhost:3000”
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n kube-prometheus-stack
# Run the below command to fetch the password
kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana \
-o jsonpath="{.data.admin-password}" | base64 --decodeAutomatic vLLM Dashboard
The vLLM dashboard and service monitor are automatically configured for Grafana. See VLLM Dashboard

V. Testing & Validation
5.1 Shared S3-Mount Inference
1️⃣ Export Router API Endpoint
kubectl -n vllm port-forward svc/vllm-gpu-router-service 30080:80
# Case 1 : Port forwarding
export vllm_api_url=http://localhost:30080/v1
2️⃣ List models
# ---- check models
curl -s ${vllm_api_url}/models | jq .
3️⃣ Generate Round-Robin inference Workload
# 2. Send a barrage of concurent prompts to test the round-robin distribution
seq 1 100 | xargs -n 1 -P 25 -I {} curl -s -X POST $vllm_api_url/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/models/tiny-llama",
"prompt": "Explain the architecture of Kubernetes and how it schedules pods in detail:",
"max_tokens": 150
}' \
-o /dev/null \
-w "✅ Request: {} | Status: %{http_code} | Time: %{time_total}s\n"
4️⃣ Observe the inference in Action
Check the engine logs via stern to confirm that inference runs through both pods using a single weight storage:
# Watch the engine logs to see both pods responding
stern tinyllama-gpu -n vllm --tail 100 --no-follow --include 'POST|Engine' \
--exclude 'launcher|200 OK|health|metrics' --color always🔎View the monitoring output from both vllm pods
+ vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod1 › vllm
+ vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod2 › vllm
vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod1 vllm INFO 04-08 01:24:55 [loggers.py:111] Engine 000: Avg prompt throughput: 77.4 tokens/s, Avg generation throughput: 558.8 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod1 vllm INFO 04-08 01:26:05 [loggers.py:111] Engine 000: Avg prompt throughput: 12.6 tokens/s, Avg generation throughput: 191.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod2 vllm INFO 04-08 01:26:08 [loggers.py:111] Engine 000: Avg prompt throughput: 72.0 tokens/s, Avg generation throughput: 583.1 tokens/s, Running: 11 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.9%, Prefix cache hit rate: 0.0%
vllm-gpu-tinyllama-gpu-deployment-vllm-x-pod2 vllm INFO 04-08 01:41:28 [loggers.py:111] Engine 000: Avg prompt throughput: 82.8 tokens/s, Avg generation throughput: 567.0 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.5%, Prefix cache hit rate: 0.0%For Benchmarking vLLM Production Stack Performance check the multi-round QA tutorial
5.2 Destroying the Infrastructure 🚧
To delete everything just run the below (Note: sometimes you need to run it twice as the loadbalancer gets tough to die)
terraform destroy -auto-approve
🫧 Cleanup Notes
If encountering job conflicts during Calico removal (i.e: * jobs.batch already exists) run the below commands
# use the following commands to delete the jobs manually first:
kubectl -n tigera-operator delete job tigera-operator-uninstall --ignore-not-found=trueConclusion
Frozen model weights are dead weight, why keep duplicating them per replica? Traditional EBS-backed vLLM deployments force storage to scale with compute, quietly bleeding money on redundant storage. Today we’ve seen how S3 Mountpoint breaks that pattern by scaling storage with models, not replicas, while streaming weights directly from S3 → GPU, cutting inference storage costs by up to 95% without sacrificing performance.
This isn’t just an AWS trick. The pattern is cloud-agnostic with equivalent object-storage CSI/FUSE drivers across cloud providers:
- Azure: Blob Storage CSI driver
- GCP: Cloud Storage FUSE CSI driver
For ultra-premium platforms where every millisecond matters, specialized storage layers like WEKA, Vast DATA or Alluxio may still justify the premium. But for most early production inference, object storage is the sweet spot (Stop scaling your bill).
📚 Additional Resources

Run AI Your Way — In Your Cloud
Want full control over your AI backend? The CloudThrill VLLM Private Inference POC is still open — but not forever.
📢 Secure your spot (only a few left), 𝗔𝗽𝗽𝗹𝘆 𝗻𝗼𝘄!
Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control
𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…
🙋🏻♀️If you like this content please subscribe to our blog newsletter ❤️.