Intro

The world of LLMs is dominated by one expensive bottleneck: GPU memory. This directly impacts how many models can fit and how fast can they generate text, especially in multi-turn conversations or for processing long contexts.

The solution is KV Cache Offloading (i.e with LMCache). One technology was perfectly suited to supercharge it, Intel Optane Persistent Memory (PMem). Optane could have been the ideal bridge between fast, small GPU memory and slower, larger traditional storage. But alas, Intel decided otherwise making this a clear missed opportunity for LLM inference efficiency at scale. Let’s explain why...

I. What’s behind the KV Cache Problem?

The KV Cache Purpose:

In simple terms, your LLM remembers what it just “thought“. Every time the an LLM processes a prompt, it saves its internal thinking process (Key & Value Tensors) in a super-fast scratchpad called the KV Cache. This speeds up first word appearance in your chat output (Time to First Token).

💡If you ask similar questions or use a RAG with repetitive prompts, the cached “context” is reused. Your LLM jumps straight to the answer by instantly generating the first word of its response.

The Bottleneck:

GPU is Already Full: The GPU’s ultra-fast memory (HBM) is already consumed by all models weights.
Cache Overload: On top of that, the AI’s working memory (KV Cache) explodes as context gets larger.
Limits Scale: KV Cache then becomes the final block, dictating the maximum context length or concurrent users.

KV Cache is explained further in our simplified blog post ✍🏼kv_cache Explained like I’m 5

🚨The fix: KV Cache Offloading

KV Cache Offloading is a technique of moving inactive, older parts of huge KV-Cache from expensive GPU into cheaper, larger storage like CPU RAM, SSD or S3. When data is needed again, it’s quickly swapped back onto the GPU.

The Offload Trap

The downside is the target medium itself, as the success of offloading relies entirely on its speed and persistence:

DRAM (fast but expensive): great speed, but high cost & lack of persistence makes it impractical for large caches
SSD/NVMe (cheap but slow): Great capacity, but high Latency destroys the Time-to-First-Token (TTFT) benefit.

II. Why Optane PMem Was the “Perfect” Solution?

Intel Optane Persistent Memory, based on 3D XPoint technology, created a brand new tier in the memory/storage hierarchy, perfectly bridging the gap between volatile DRAM and slower storage. Imagine having cheaper and higher capacity RAM (up to 512GB per stick) that persists over reboots?

Optane PMem: The Perfect Bridge

As you can see below, Optane uniquely solved the “impossible” trade-off between speed, capacity, and persistence

Characteristic	GPU HBM (Fastest)	CPU DRAM	Optane PMem💡	SSD/NVMe (Slowest)
Speed (Latency)	Very Low (nanoseconds)	Very Low (nanoseconds)	Low (Closer to DRAM)	Higher (microseconds)
Capacity	Very Low (e.g., 80GB)	Low (e.g., 2TB max)	Very High (up to TBs)	Very High
Cost	Highest	High	Medium-High	Lowest
Persistence	No (Volatile)	No (Volatile)	YES	YES
Interface/Access	Direct to GPU	DDR4/DIMM Slot	DDR4/DIMM Slot	PCIe Bus

The above comparison shows , Optane PMem wasn’t just slightly better; it solved three major, simultaneous problems for large-scale LLM serving:

Speed (Low Latency): Its near-DRAM speed made KV Cache Offloading finally viable option over slower storage.
Scale (high Capacity): TBs of cost-effective memory preserving caches of 1000s of concurrent user sessions.

🔥Native Persistence (The “Killer Feature” for Serving)

Persistence is an operational game-changer for LLM serving infrastructure. With volatile HBM/DRAM, a server reboot means the entire cache is lost. But with PMem surviving reboots, an entire LLM server farm could recover near-instantly after maintenance or failure, avoiding the costly, time-consuming recomputation of every active user’s context. This high-speed resilience is a challenge the AI industry still struggles to solve.

IV. The $559 Million Failure That Killed Optane

Despite its technical superiority (used in Oracle engineered system Exadata), Optane was killed by Intel in 2022 because it failed as a profitable business (with $559 million inventory loss), just before ChatGPT and the LLM boom started. And this mostly due to:

Micron’s Exit: The writing was on the wall after its partner, Micron, terminated the joint venture in 2018.
Niche Market vs. Cost: Optane struggled in the broader market, and rival SSDs were deemed “good enough”.
The Rise of CXL: A competing tech, Compute Express Link, offered a more open alternative.

⚖️CXL vs. Optane PMem

CXL (Compute Express Link) is an open industry standard that allows servers to expand, pool and share standard, cheaper DRAM (even older DDR4/DDR3) across multiple servers using standard PCIe slots. It is supported by Major platforms (Intel, AMD, Arm etc). This could solves the massive memory scale problem faced by large AI models today.

🤔Now let’s see if CXL or even NVIDIA ICMS are “on par” with Optane in solving our kv-cache problem using different technologies.

Feature	Intel Optane PMem (3D XPoint)	CXL-Attached DRAM (Type 3)	NVIDIA ICMS (BlueField-4)
Underlying Media	Proprietary Non-Volatile (3D XPoint)	Standard Volatile DRAM (DDR4/DDR5)	Flash-based NVMe (128GB LPDDR5 + DPU orchestration)
Persistence	Native Persistence (Data survives power-off)	Volatile (Data is lost on power-off)	Quasi-persistent (reusable context, non-durable)
Latency	Very Low (Approx. 100-300 ns) – Close to DRAM	Low-to-Moderate (Approx. 170-400 ns) – Higher than local DRAM	Moderate (~microseconds with 64-core Arm DPU prestaging)
Bandwidth	High (DDR4-equivalent, 20-30 GB/s per module)	Very High (PCIe 5 Equivalent, up to 64GB/s)	Very High (800Gb/s network, PCIe Gen6 x16 host interface)
Protocol	Proprietary IMC/DDR Bus Slot	Open Standard (Runs over PCIe physical layer)	NVMe-oF + RDMA (Ethernet/InfiniBand, 800G capable)
Primary Value	Low-Latency Persistence + Capacity	Massive Capacity Expansion + Memory Pooling	Massive Scale (16TB/GPU, 150TB/DPU) + AI-native orchestration

CXL not the answer

While CXL takes a different approach by focusing on flexible pooling and memory expansion using cheaper DRAM/storage connected via PCI, here are major caveates CXL introduces compared to Optane:

⛔Higher Latency: CXL is slower than Optane PMem (and local DRAM), hurting TTFT.
⛔Lack of Persistence: CXL-attached memory is volatile DRAM (entire kv-cache is lost on reboot).
⛔NVidia Support: Nvidia prefers bandwidth (NVL,C2C ) over standard interconnects, such as CXL over PCIe.

Takeaways

The tragedy is that just as the LLM/Generative AI boom began, a market perfectly suited to exploit Optane’s unique combination of capacity, speed, and persistence, Intel made the business decision to abandon the memory technology (much like in 86′). We are now left solving the critical KV cache bottleneck with less-than-ideal solutions, making Optane PMem the great “what if” story of AI infrastructure..

Note: While Optane PMem may be dead, its lesson is not. We strongly hope new developments embrace Optane’s unique persistence insights to finally solve the multi-tier memory challenges required for next-generation AI serving.

Run AI Your Way — In Your Cloud

Want full control over your AI backend? The CloudThrill VLLM Private Inference POC is still open — but not forever.

📢 Secure your spot (only a few left), 𝗔𝗽𝗽𝗹𝘆 𝗻𝗼𝘄!

Run AI assistants, RAG, or internal models on an AI backend 𝗽𝗿𝗶𝘃𝗮𝘁𝗲𝗹𝘆 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗰𝗹𝗼𝘂𝗱 –
✅ No external APIs
✅ No vendor lock-in
✅ Total data control

Claim YOur FREE VLLM POC

𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮. 𝗬𝗼𝘂𝗿 𝗺𝗼𝗱𝗲𝗹𝘀. 𝗬𝗼𝘂𝗿 𝗿𝘂𝗹𝗲𝘀…

🙋🏻‍♀️If you like this content please subscribe to our blog newsletter ❤️.

👋🏻Want to chat about your challenges?
We’d love to hear from you!

Get in touch

Latest Podcasts

The link That Never Was: Intel Optane PMem + LLM KV Cache Offload

Intro

I. What’s behind the KV Cache Problem?