
Intro
The world of LLMs is dominated by one expensive bottleneck: GPU memory. This directly impacts how many models can fit and how fast can they generate text, especially in multi-turn conversations or for processing long contexts.
The solution is KV Cache Offloading (i.e with LMCache). One technology was perfectly suited to supercharge it, Intel Optane Persistent Memory (PMem). Optane could have been the ideal bridge between fast, small GPU memory and slower, larger traditional storage. But alas, Intel decided otherwise making this a clear missed opportunity for LLM inference efficiency at scale. Let’s explain why...
I. What’s behind the KV Cache Problem?

The KV Cache Purpose:
In simple terms, your LLM remembers what it just “thought“. Every time the an LLM processes a prompt, it saves its internal thinking process (Key & Value Tensors) in a super-fast scratchpad called the KV Cache. This speeds up first word appearance in your chat output (Time to First Token).
๐กIf you ask similar questions or use a RAG with repetitive prompts, the cached “context” is reused. Your LLM jumps straight to the answer by instantly generating the first word of its response.
The Bottleneck:
- GPU is Already Full: The GPU’s ultra-fast memory (HBM) is already consumed by all models weights.
- Cache Overload: On top of that, the AI’s working memory (KV Cache) explodes as context gets larger.
- Limits Scale: KV Cache then becomes the final block, dictating the maximum context length or concurrent users.
๐จThe fix: KV Cache Offloading
KV Cache Offloading is a technique of moving inactive, older parts of huge KV-Cache from expensive GPU into cheaper, larger storage like CPU RAM, SSD or S3. When data is needed again, it’s quickly swapped back onto the GPU.
The Offload Trap
The downside is the target medium itself, as the success of offloading relies entirely on its speed and persistence:
- DRAM (fast but expensive): great speed, but high cost & lack of persistence makes it impractical for large caches
- SSD/NVMe (cheap but slow): Great capacity, but high Latency destroys the Time-to-First-Token (TTFT) benefit.
II. Why Optane PMem Was the “Perfect” Solution?

Intel Optane Persistent Memory, based on 3D XPoint technology, created a brand new tier in the memory/storage hierarchy, perfectly bridging the gap between volatile DRAM and slower storage. Imagine having cheaper and higher capacity RAM (up to 512GB per stick) that persists over reboots?

Optane PMem: The Perfect Bridge
As you can see below, Optane uniquely solved the “impossible” trade-off between speed, capacity, and persistence
| Characteristic | GPU HBM (Fastest) | CPU DRAM | Optane PMem๐ก | SSD/NVMe (Slowest) |
|---|---|---|---|---|
| Speed (Latency) | Very Low (nanoseconds) | Very Low (nanoseconds) | Low (Closer to DRAM) | Higher (microseconds) |
| Capacity | Very Low (e.g., 80GB) | Low (e.g., 2TB max) | Very High (up to TBs) | Very High |
| Cost | Highest | High | Medium-High | Lowest |
| Persistence | No (Volatile) | No (Volatile) | YES | YES |
| Interface/Access | Direct to GPU | DDR4/DIMM Slot | DDR4/DIMM Slot | PCIe Bus |
The above comparison shows , Optane PMem wasn’t just slightly better; it solved three major, simultaneous problems for large-scale LLM serving:
- Speed (Low Latency): Its near-DRAM speed made KV Cache Offloading finally viable option over slower storage.
- Scale (high Capacity): TBs of cost-effective memory preserving caches of 1000s of concurrent user sessions.
๐ฅNative Persistence (The “Killer Feature” for Serving)
Persistence is an operational game-changer for LLM serving infrastructure. With volatile HBM/DRAM, a server reboot means the entire cache is lost. But with PMem surviving reboots, an entire LLM server farm could recover near-instantly after maintenance or failure, avoiding the costly, time-consuming recomputation of every active user’s context. This high-speed resilience is a challenge the AI industry still struggles to solve.
IV. The $559 Million Failure That Killed Optane
Despite its technical superiority (used in Oracle engineered system Exadata), Optane was killed by Intel in 2022 because it failed as a profitable business (with $559 million inventory loss), just before ChatGPT and the LLM boom started. And this mostly due to:
- Micron’s Exit: The writing was on the wall after its partner, Micron, terminated the joint venture in 2018.
- Niche Market vs. Cost: Optane struggled in the broader market, and rival SSDs were deemed “good enough”.
- The Rise of CXL: A competing tech, Compute Express Link, offered a more open alternative.
โ๏ธCXL vs. Optane PMem

CXL (Compute Express Link) is an open industry standard that allows servers to expand, pool and share standard, cheaper DRAM (even older DDR4/DDR3) across multiple servers using standard PCIe slots. It is supported by Major platforms (Intel, AMD, Arm etc). This could solves the massive memory scale problem faced by large AI models today.
๐คNow let’s see if CXL is “on par” with Optane in solving our kv-cache problem using different technology.
| Feature | Intel Optane PMem (3D XPoint) | CXL-Attached DRAM (Type 3) |
|---|---|---|
| Underlying Media | Proprietary Non-Volatile (3D XPoint) | Standard Volatile DRAM (DDR4/DDR5) |
| Persistence | Native Persistence (Data survives power-off) | Volatile (Data is lost on power-off) |
| Latency | Very Low (Approx. 100-300 ns) – Close to DRAM | Low-to-Moderate (Approx. 170-400 ns) – Higher than local DRAM |
| Bandwidth | High (DDR4-equivalent, **20-30 GB/s per module**) | Very High (PCIe 5 Equivalent, up to 64GB/s) |
| Protocol | Proprietary IMC/DDR Bus Slot | Open Standard (Runs over PCIe physical layer) |
| Primary Value | Low-Latency Persistence + Capacity | Massive Capacity Expansion + Memory Pooling |
CXL not the answer
While CXL takes a different approach by focusing on flexible pooling and memory expansion using cheaper DRAM/storage connected via PCI, here are major caveates CXL introduces compared to Optane:
- โHigher Latency: CXL is slower than Optane PMem (and local DRAM), hurting TTFT.
- โLack of Persistence: CXL-attached memory is volatile DRAM (entire kv-cache is lost on reboot).
- โNVidia Support: Nvidia prefers bandwidth (NVL,C2C ) over standard interconnects, such as CXL over PCIe.
Takeaways
The tragedy is that just as the LLM/Generative AI boom began, a market perfectly suited to exploit Optane’s unique combination of capacity, speed, and persistence, Intel made the business decision to abandon the memory technology (much like in 86′). We are now left solving the critical KV cache bottleneck with less-than-ideal solutions, making Optane PMem the great “what if” story of AI infrastructure..

Run AI Your Way โ In Your Cloud
Want full control over your AI backend? The CloudThrill VLLM Private Inference POC is still open โ but not forever.
๐ข Secure your spot (only a few left), ๐๐ฝ๐ฝ๐น๐ ๐ป๐ผ๐!
Run AI assistants, RAG, or internal models on an AI backend ๐ฝ๐ฟ๐ถ๐๐ฎ๐๐ฒ๐น๐ ๐ถ๐ป ๐๐ผ๐๐ฟ ๐ฐ๐น๐ผ๐๐ฑ –
โ
No external APIs
โ
No vendor lock-in
โ
Total data control
๐ฌ๐ผ๐๐ฟ ๐ถ๐ป๐ณ๐ฟ๐ฎ. ๐ฌ๐ผ๐๐ฟ ๐บ๐ผ๐ฑ๐ฒ๐น๐. ๐ฌ๐ผ๐๐ฟ ๐ฟ๐๐น๐ฒ๐…
๐๐ปโโ๏ธIf you like this content please subscribe to our blog newsletter โค๏ธ.