NVIDIA’s ๐—–๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ ๐— ๐—ฒ๐—บ๐—ผ๐—ฟ๐˜† ๐—ฆ๐˜๐—ผ๐—ฟ๐—ฎ๐—ด๐—ฒ (CMX): The KV Cache War

Intro

A few weeks ago I wrote about why Intel Optane Persistent Memory was the ideal technology for LLM KV-cache offloading with a near-DRAM latency, and natively non-volatile. In other words, it behaved like memory but survived reboots. I also explained why CXL wasn’t quite the performance equivalent, due to higher latency and non persistence.

But recently other big players like NVIDIA started to take a dent at this challenge. The hardware implementation of their solution is called Inference Context Memory Storage (ICMS) renamed CMX platform.

I. KV Cache: The New Inference Bottleneck

Every time an LLM processes a prompt, it saves its internal thinking process (Key & Value Tensors) in a super-fast scratchpad called the KV Cache. This speeds up first word appearance in your chat output (Time to First Token).

๐Ÿ’กIf you or your agent ask similar questions or use a RAG with repetitive prompts, the cached โ€œcontextโ€ is reused. Your LLM jumps straight to the answer by instantly generating the first word of its response.

KV Cache is explained further in our simplified blog post โœ๐Ÿผkv_cache Explained like I’m 5

The Bottleneck:

  • GPU is Already Full: The GPUโ€™s ultra-fast memory (HBM) is already consumed by all models weights.
  • Cache Overload: On top of that, the AIโ€™s working memory (KV Cache) explodes as context gets larger.
  • Limits Scale: Larger context, more users means exponential memory pressure, making KV Cache the bottleneck.

๐ŸŸฅ The new inference wall isn’t compute. It’s Context (KV Cache).

KV Cache Offloading

The answer to the scaling challenges of KV Cache in LLM inference was KV Cache Offloading which is a technique of moving inactive, older parts of huge KV-Cache from expensive GPU into cheaper, larger storage like CPU RAM, SSD or S3. When data is needed again, itโ€™s quickly swapped back onto the GPU.

The tiers consist of:

  • G1 (GPU HBM): ย for hot, latencyโ€‘critical KV used in active generationย 
  • G2 (System DRAM): for staging and buffering KV off HBM
  • G3 (Local SSDs/Nvme): ย for warm KV that is reused over shorter timescales
  • G4 (Shared Storage) including cache DBs(Redis),for cold artifacts, history, and durable/non critical results

The Offload Trap

The downside is the target medium itself, as the success of offloading relies entirely on its speed and persistence:

  • DRAM (fast but expensive): great speed, but high cost & lack of persistence makes it impractical for large caches
  • SSD/NVMe (cheap but slow): Great capacity, but higher Latency, hurts the Time-to-First-Token (TTFT) benefit.
Note: This is exactly where Optane once fit perfectly, and where NVIDIA is now trying something new.

II. Software solutions (LMCache)

LMCache is the primary vendor-neutral alternative for KV cache management. The open-source project, developed at the University of Chicago, provides hierarchical KV cache storage and sharing capabilities that work across multiple hardware platforms including NVIDIA GPUs, AMD MI300X accelerators, and Intel Gaudi 3 processors.

LMCache integrates with vLLM and SGLang inference engines to enable KV cache offloading to CPU RAM, local storage, and network storage like S3, Redis. The system supports cache offloading for prefix reuse across queries, compression (CacheGen), Cacheblend, and prefill-decode disaggregation for cross-engine cache transfer.

Platform & storage agnostic

Unlike NVIDIAโ€™s CMX, which requires BlueField-4 DPUs and Spectrum-X Ethernet, LMCache operates over standard TCP/IP/RDMA networking and works with commodity storage infrastructure. Itโ€™s an approach prioritizing ecosystem compatibility over specialized hardware acceleration like NVIDIA’s BlueField.

Note: You can find further LMCache Features in our previous post about vLLM production stack (KVCache network).

III. Hardware Solutions

What Is NVIDIA CMX?

CMX (formerly Inference Context Memory Storage) is NVIDIA’s purpose-built hardware platform for storing and serving KV cache context at rack scale. It is powered by BlueField-4 DPUs and sits as a G3.5 tier in NVIDIA’s tiered KV cache architecture, bridging the gap between in-pod local SSDs (G3) and off-pod shared storage (G4) over RDMA networking.

GTC Nvidia expo floor

Think of it as a network-attached SSD storage array that is AI-inference-aware. Instead of just storing, it understands context, orchestrating intelligent pre-staging so that relevant KV cache data is ready before the GPU asks for it.

KV Cache Offloading Workflow in CMX

Key building blocks:

  • BlueField-4 DPUs โ€” 64-core Arm processors managing NVMe context storage
  • 800 Gb/s networking โ€” RDMA over Spectrum-X Ethernet
  • PCIe Gen6 x16 host interface
  • DOCA storage frameworks โ€” software layer orchestrating context pre-staging
  • Up to 16 TB of context memory per GPU, and up to 150 TB per BlueField-4 DPU
This is the G3.5 tier in NVIDIA’s Tiered KV Flow architecture, shipping with the Vera Rubin platform in H2 2026.

NVIDIA CMX Data sheet:

โšกhttps://resources.nvidia.com/bluefield-4-dpu-datasheet
๐Ÿ“บ Explore GTC relatedย sessions.

What Was Intel Optane?

Intel Optane Persistent Memory, based on 3D XPoint technology, bridged the gap between volatile DRAM and slower storage. Think cheaper, higher-capacity RAM (up to 512 GB per stick) that persists over reboots. For LLM serving, that last part matters: with standard DRAM, a server reboot wipes the entire KV cache, not with Optane.

With Optane, an LLM server farm could recover near-instantly after maintenance or failure. No costly recomputation of every active user’s context. A resilience challenge the AI industry still hasn’t fully solved.

AMD: More Memory

AMD’s answer to the KV cache problem isn’t a new storage tier, it’s more memory. The AMD MI455X packs 432GB of HBM4 with 19.6 TB/s of bandwidth per GPU, large enough to keep model weights and a substantial KV cache on-accelerator without offloading. The trade-off: higher per-GPU cost, no cross-instance cache sharing, and it still relies on LMCache or vLLM’s PagedAttention when context spills over. Not a purpose-built context storage layer like CMX.

IV. Nvidia CMX vs. Intel Optane

KV cache is large, frequently accessed, and expensive to keep in GPU memory. What you really want is something that behaves close to memory, but persists like storage. This is why CMX and Intel Optane were picked for this comparison.

1. Architecture and Positioning

Optane Persistent Memory

  • 3D XPoint memory in DDR4 DIMM slots
  • Acts as a direct memory-tier extension (RAM sticks)
  • Around 100 to 300 nanoseconds latency
  • Byte addressable
  • Native persistence (survives reboots)

NVIDIA CMX

  • Flash based NVMe Context storage tier
  • G3.5: Sits between local SSD tier and shared storage tier
  • Managed by BlueField 4 DPUs
  • Network attached and rack scale
  • Disaggregated virtual storage
  • limited to inferior GPU sparse attention mechanisms
Optane extended the memory bus, while ICMS extends the network fabric.

2. Scale

Optane

  • ~ 512 GB per DIMM
  • Limited by memory slots
  • A few terabytes per server at most

NVIDIA CMX

  • Up to 16 TB of context memory per GPU
  • Up to 150 TB behind each BlueField 4 DPU
  • Designed for cluster level deployment
  • leverages NIXL and Dynamo for advanced sharing across AI nodes
Optane scaled vertically per server while ICMS scales horizontally across racks.

3. Performance Claims

Optane

  • Near DRAM latency (~100โ€“300 ns)
  • Byte addressable

NVIDIA CMX

  • ~Microsecond latency (DPU-orchestrated prestaging)
  • Claims 5x higher tokens per second versus traditional storage
  • Claims 5x better power efficiency
CMX is still storage backed. It is not memory latency. It compensates with orchestration and parallelism.

4. Persistence Model

Optane

  • Native non volatile memory
  • Survives reboots
  • True persistent KV cache potential

INVIDIA CMX

  • Context treated as reusable but non durable
  • Intelligent prestaging and orchestration
  • Persistence is operational, not architectural
Optane was persistent by physics while CMX is persistent by system design.

5. Availability

  • Intel Optane: discontinued โ˜น๏ธ
  • NVIDIA ICMS: shipping H2 2026

CMX vs. Optane vs. CXL: Full Comparison

Feature Intel Optane PMem (3D XPoint) CXL-Attached DRAM (Type 3) NVIDIA ICMS (BlueField-4)
Underlying Media Proprietary Non-Volatile (3D XPoint) Standard Volatile DRAM (DDR4/DDR5) Flash-based NVMe (128GB LPDDR5 + DPU orchestration)
Persistence Native Persistence (Data survives power-off) Volatile (Data is lost on power-off) Quasi-persistent (reusable context, non-durable)
Latency Very Low (Approx. 100-300 ns) – Close to DRAM Low-to-Moderate (Approx. 170-400 ns) – Higher than local DRAM Moderate (~microseconds with 64-core Arm DPU prestaging)
Bandwidth High (DDR4-equivalent, **20-30 GB/s per module**) Very High (PCIe 5 Equivalent, up to 64GB/s) Very High (800Gb/s network, PCIe Gen6 x16 host interface)
Protocol Proprietary IMC/DDR Bus Slot Open Standard (Runs over PCIe physical layer) NVMe-oF + RDMA (Ethernet/InfiniBand, 800G capable)
Primary Value Low-Latency Persistence + Capacity Massive Capacity Expansion + Memory Pooling Massive Scale (16TB/GPU, 150TB/DPU) + AI-native orchestration

The CMX Catch

CMX is tightly coupled to the NVIDIA ecosystem. If your stack is AMD based, you are not leveraging it. The ideal architecture would be a CPU linked, vendor agnostic context tier usable across heterogeneous GPU environments.
That still does not exist.

Takeaways

Optane was technically ahead of its time. It could have solved the KV cache problem at the memory tier, before LLM workloads made that problem mainstream. But it was trapped inside Intelโ€™s shrinking ecosystem. NVIDIA CMX takes the same core insight: Fast, persistent like storage for ephemeral inference context. The difference is execution at scale. Instead of extending the CPU memory bus, NVIDIA extends the GPU cluster fabric with DPUs, RDMA, Spectrum X Ethernet, and DOCA storage frameworks. Architecturally, it is less elegant than Optane. But Operationally more scalable.

The irony? Intel killed Optane in 2022, just months before the LLM boom that would’ve made it essential.

The next frontier is not just smarter models. It is smarter memory.

Run AI Your Way โ€” In Your Cloud


Run AI assistants, RAG, or internal models on an AI backend ๐—ฝ๐—ฟ๐—ถ๐˜ƒ๐—ฎ๐˜๐—ฒ๐—น๐˜† ๐—ถ๐—ป ๐˜†๐—ผ๐˜‚๐—ฟ ๐—ฐ๐—น๐—ผ๐˜‚๐—ฑ –
โœ… No external APIs
โœ… No vendor lock-in
โœ… Total data control

๐—ฌ๐—ผ๐˜‚๐—ฟ ๐—ถ๐—ป๐—ณ๐—ฟ๐—ฎ. ๐—ฌ๐—ผ๐˜‚๐—ฟ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€. ๐—ฌ๐—ผ๐˜‚๐—ฟ ๐—ฟ๐˜‚๐—น๐—ฒ๐˜€…

๐Ÿ™‹๐Ÿปโ€โ™€๏ธIf you like this content please subscribe to our blog newsletter โค๏ธ.

๐Ÿ‘‹๐ŸปWant to chat about your challenges?
Weโ€™d love to hear from you!ย 

Share this…

Don't miss a Bit!

Join countless others!
Sign up and get awesome cloud content straight to your inbox. ๐Ÿš€

Start your Cloud journey with us today .