NVIDIA’s ๐—–๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ ๐— ๐—ฒ๐—บ๐—ผ๐—ฟ๐˜† ๐—ฆ๐˜๐—ผ๐—ฟ๐—ฎ๐—ด๐—ฒ (CMX): The KV Cache War

Intro A few weeks ago I wrote about why Intel Optane Persistent Memory was the ideal technology for LLM KV-cache offloading with a near-DRAM latency, and natively non-volatile. In other words, it behaved like memory but survived reboots. I also explained why CXL wasn’t quite the performance equivalent, due to higher latency and non persistence. But recently …

What is vLLM-Omni? Beginners Intro

Intro Any-to-any multimodal models combining text, images, video, and audio are advancing AI, but their complex architectures, mixing autoregressive LLMs and diffusion transformers, make efficient serving very difficult. Current systems like OpenAI’s ChatGPT (text) and Sora (video) run as separate engines, lacking unified any-to-any pipelines. vLLM-Omni solves just that with a fully disaggregated serving system …

vLLM production-stack: Deployment in the cloud (part2)

Intro In the previous post, we explored how the vLLM Production-Stack upgrades vanilla vLLM engine to an enterprise-grade platform. This time, weโ€™ll crack open the Helm chart, decoding the key knobs in values.yaml and showing deployment recipes that span from a minimal install to full cloud setups. Acknowledgment: While authored independently, this series benefited from …

vLLM Production Stack on Nebius K8s with Terraform๐Ÿง‘๐Ÿผโ€๐Ÿš€

Intro The vLLM Production Stack is designed to work across any cloud provider with Kubernetes. After covering AWS EKS, Azure AKS, and Google Cloud GKE implementations, today we’re deploying vLLM production-stack on Nebius Managed Kubernetes (MK8s) with the same Terraform approach. Nebius AI Cloud is purpose-built for AI/ML workloads, offering cutting-edge GPU options from NVIDIA …

vLLM Production Stack on GCP GKE with Terraform๐Ÿง‘๐Ÿผโ€๐Ÿš€

Intro Welcome back to the terraform vLLM Production Stack series! After covering AWS EKS and Azure AKS, today we’re deploying vLLM production-stack on Google Cloud GKE with the same Terraform approach. This guide shows you how to deploy a production-ready LLM serving environment on Google Cloud, with GCP-specific optimizations including Dataplane V2 (Cilium eBPF), VPC-native …