What Is Context Memory? The New Infrastructure Layer Powering AI Inference

TL;DR

Context memory is the infrastructure layer that stores and serves an AI model’s working memory — its KV cache — during inference. NVIDIA formalized it as a distinct market category in January 2026 with the NVIDIA® CMX™ powered by the NVIDIA BlueField®‑4 storage processor. It’s the missing tier between GPU compute and traditional storage, and it’s about to change how every inference stack gets built.

Why “Context Memory” Exists Now

LLMs have always had working memory. Every token a model generates depends on the key-value cache — the mathematical representation of everything that came before in the conversation. That’s not new. What’s new is the scale.

Twelve months ago, inference meant chat sessions. An 8,000-token conversation. A quick prompt-and-response. The KV cache fit comfortably in GPU high-bandwidth memory, and nobody thought much about it.

Then agents arrived. Not hypothetical agents — production agent swarms running hundreds of turns, consuming million-token contexts, operating 24/7. As WEKA Chief AI Officer Val Bercovici framed it at Xcelerated Compute in March 2026: we have to get past the notion of chat as the mental model for this industry. Coding agents, knowledge agents, security agents — they’re where all the action is, and they consume tokens at a rate that makes chat look like a rounding error.

The math breaks fast. A single 100,000-token session amplifies to roughly 50 gigabytes of KV cache in GPU memory. Multiply that by concurrent users on a multi-tenant inference server, add trillion-parameter model weights that consume most of the HBM before a single user connects, and you’re out of memory before you begin. The context memory AI problem isn’t a future concern — it’s the defining constraint of inference right now.

NVIDIA recognized this by formalizing NVIDIA context memory as a new infrastructure category at CES in January 2026, announcing the BlueField-4-powered CMX (Context Memory eXtension) platform. This wasn’t a product refresh. It was a category creation — an acknowledgment that the gap between GPU compute and traditional storage needs its own infrastructure layer. Where KV cache offloading was once a niche optimization technique, it’s now the foundation of an entire infrastructure tier.

→ Understand the data structure context memory serves: What Is KV Cache?

→ Start from the beginning: How AI Inference Actually Works

The Inference Memory Hierarchy, Formalized

Before 2026, GPU context memory tiers were an informal engineering reality — something framework developers dealt with in code but nobody named. NVIDIA’s Dynamo inference framework changed that by formalizing a four-tier inference memory hierarchy:

G1 — HBM (High Bandwidth Memory). Co-packaged with the GPU. Fastest tier: 3.35–4.8 TB/s bandwidth on Blackwell, 22 TB/s on Vera Rubin’s HBM4. This is where attention algorithms execute. It’s also the scarcest and most expensive tier — 192 GB on B200, 288 GB on Rubin. Model weights alone can consume most of it.

G2 — CPU DRAM. Server system memory. Cheaper, larger (512 GB to 2 TB per node), but roughly 50x slower than HBM. The first overflow tier when GPU memory fills up.

G3 — Local NVMe. SSDs on the GPU server itself. 4–30 TB per node, another order of magnitude cheaper per gigabyte. Traditionally treated as “just storage” — but that’s exactly what’s changing.

G4 — Network storage. Remote storage accessible over the data center network. Effectively unlimited capacity, but historically too slow for inference workloads.

Context memory lives across G2, G3, and G4 — extending effective GPU memory by orders of magnitude. The critical insight: each tier is 10–100x larger than the one above it. If your software can deliver memory-class performance from NVMe (G3) and network storage (G4), you don’t just extend memory. You transform the economics of inference.

NVIDIA’s Vera Rubin SuperPod makes the scale concrete: 1,152 Rubin GPUs across 40 racks, each with 288 GB of HBM4. That’s roughly 324 terabytes of G1 memory. But behind those GPUs sit BlueField-4 DPUs, each backing 150 terabytes of context memory storage via CMX. The context memory tier dwarfs the GPU memory tier — by design.

The CMX platform delivers 5x higher tokens per second and 5x better power efficiency compared to traditional storage approaches. Those aren’t incremental gains. They’re the difference between inference that scales and inference that hits a wall at 40 concurrent users.

→ Why that wall exists: The AI Memory Wall

The DRAM Ceiling

The instinctive response to HBM scarcity is to spill overflow into CPU DRAM. On paper, it bridges the gap. In practice, it just defers the problem.

DRAM is trapped on a single server. It can’t share cached context across a GPU cluster. Every inference instance is its own isolated memory island — if a request lands on an instance that hasn’t seen that conversation before, the system starts cold, even if another instance cached that exact context seconds ago. When demand spikes and new GPU instances spin up, they start empty. Every request pays full processing cost until the cache warms. And at $5–10 per GB — 25–50x more expensive than flash storage — scaling DRAM to the capacity these workloads demand isn’t just architecturally constrained. It’s not economically viable.

What inference actually requires is a memory tier that delivers DRAM-class speed at flash economics and petabyte capacity, shared across the entire cluster.

Agentic AI makes this gap critical rather than merely expensive. Production agentic workloads — the sessions generated by tools like Claude Code, Cursor, and Codex — reveal the scale of the problem. A typical session carries 90,000 tokens of context. At the high end, sessions exceed 865,000 tokens and run thousands of turns, with just 3.9 seconds between each one. That context has to stay hot and accessible in near-real time, or response times degrade immediately.

No single GPU’s memory can hold that kind of context across concurrent workloads. Every piece of context evicted from memory is context the system rebuilds from scratch — costing you twice: once in latency, once in dollars. That’s the DRAM ceiling. Every agentic workload compounds it.

Legacy storage and general-purpose infrastructure were never designed for workloads that continuously reuse and build on prior context. The gap is architectural — and it’s what defines the inference infrastructure category.

Three Networks, Not Two

Traditional GPU clusters have two networks: a front-end network connecting users to the cluster and a back-end network connecting GPUs to each other for distributed computation. Context memory introduces a third.

Val identified this architectural shift in a recent webinar with Steve McDowell of NAND Research: there are going to be three networks in these NVIDIA clusters soon. Front-end, back-end, and context memory network. The context memory network is a dedicated high-speed fabric connecting GPUs to their extended memory tier — NVMe and network storage delivering KV cache at memory-class latencies.

This isn’t just a networking change. It’s an architectural paradigm shift. The context memory network needs storage that performs like memory, not storage that performs like storage. Microsecond latency. Hundreds of gigabytes per second of sustained throughput. And it has to maintain that performance under the chaotic, mixed I/O patterns of thousands of concurrent inference sessions.

The three-network architecture also enables a concept Val calls “token warehousing” — prefill your KV cache once, warehouse it persistently, and decode from it indefinitely. Instead of redundantly recomputing prefill every five minutes or every hour, you produce your tokens at the factory once and warehouse them at memory speeds. That’s how you convert negative unit economics on token generation into positive ones.

→ Why prefill and decode need different infrastructure: Prefill and Decode

What This Means for Infrastructure Decisions

If you’re evaluating inference infrastructure in 2026, context memory storage changes your criteria in three concrete ways.

Your Storage Must Present as Memory, Not Files. 

G2 and G3 performance — microsecond latency, hundreds of GB/s bandwidth — is the new baseline for inference storage. If your storage vendor talks in milliseconds, they’re a generation behind. The NVIDIA Dynamo KV block manager and NIXL transfer library already implement the software interfaces for memory-tier storage. Your infrastructure needs to meet them.

Software-Defined Wins. 

Context memory storage can’t be a hardware appliance bolted onto the side of your cluster. It needs to run on the NVMe drives already in your GPU servers (G3), extend seamlessly to network storage (G4), and adapt as your inference workload mix changes. WEKA’s Augmented Memory Grid™ delivers exactly this — presenting as G1.5, G2, and G3 performance across NVMe and network storage over the context memory network. It’s persistent (surviving restarts and rebalances), petabyte-scale, and achieves 75–99% KV cache hit rates that eliminate redundant prefill operations.

Plan for Three Networks, Not Two. 

Your next GPU cluster design should include a dedicated context memory network alongside the traditional compute and storage fabrics. The bandwidth and latency requirements are distinct from both — and trying to share either existing network will create contention that defeats the purpose.

NVIDIA CMX with BlueField-4 arrives in the second half of 2026. But context memory storage isn’t a future problem. Inference providers are sizing exabyte-scale caches today. WEKA has been shipping augmented memory capabilities for over a year — with production deployments on Oracle Cloud, CoreWeave, and Firmus demonstrating 4–5x more tokens from the same hardware through software alone.

The inference memory hierarchy is real. The context memory network is coming. The question isn’t whether your infrastructure needs a context memory tier. It’s whether you’ll build it before your competitors do.

→ Practical techniques to implement today: Inference Optimization

→ Know your numbers: How to Measure AI Inference 

Go deeper: Read What Is KV Cache? to understand the data structure that context memory serves — and why it’s the bottleneck everyone is finally talking about.

Build your AI infrastructure without bottlenecks. Download the Buyer’s Guide to AI Storage for a full evaluation framework.

Frequently Asked Questions

What is context memory in AI? 

Context memory is the infrastructure layer that stores and serves an AI model’s working memory — specifically the KV (key-value) cache — during inference. Every token a large language model generates depends on the KV cache, which holds the mathematical representation of all prior tokens in the conversation. As models scale to trillion parameters and agent workloads run million-token contexts, the KV cache demands far exceed what GPU high-bandwidth memory can hold. Context memory extends that capacity across CPU DRAM, NVMe, and network storage — delivering memory-class performance from lower-cost tiers.

How is context memory different from GPU memory?

GPU memory (HBM) is co-packaged with the processor and delivers the highest bandwidth — up to 22 TB/s on NVIDIA’s Vera Rubin architecture. But it’s scarce: 80–288 GB per GPU depending on generation. Context memory spans the tiers below HBM — CPU DRAM (G2), local NVMe (G3), and network storage (G4) — each 10–100x larger than the one above. The key difference: context memory storage uses software to deliver memory-like performance from these larger, cheaper tiers, so KV cache can overflow gracefully instead of forcing recomputation or dropping sessions.

What is NVIDIA CMX?

CMX (Context Memory eXtension) is NVIDIA’s platform for context memory storage, announced at CES in January 2026 and powered by the BlueField-4 DPU. CMX creates an optimized memory tier between GPU HBM and traditional networked storage, delivering 5x higher tokens per second and 5x better power efficiency compared to conventional storage approaches. It integrates with NVIDIA’s Dynamo inference framework and the KV block manager to manage KV cache placement across the G1–G4 memory hierarchy. BlueField-4 with CMX is expected to ship in the second half of 2026.

What is the G1, G2, G3, G4 memory hierarchy? 

NVIDIA’s Dynamo inference framework formalized GPU cluster memory into four tiers. G1 is HBM on the GPU itself — fastest and most expensive. G2 is CPU DRAM on the server — cheaper, larger, roughly 50x slower. G3 is local NVMe SSDs — another order of magnitude cheaper per gigabyte. G4 is network-attached storage — effectively unlimited capacity but traditionally the slowest. Context memory solutions like WEKA®’s Augmented Memory Grid™ blur these boundaries, delivering G2 and even G1.5 performance from G3 and G4 hardware.

What is token warehousing? 

Token warehousing is the practice of computing KV cache once during the prefill phase and persisting it for repeated decode operations — rather than recomputing prefill every time a session resumes or a shared context is reused. Think of it as the difference between manufacturing a product and shipping it directly to the consumer each time (expensive, wasteful) versus manufacturing once, warehousing it, and delivering from inventory (efficient, scalable). Token warehousing requires a persistent context memory tier that can store and serve KV cache at memory speeds.

Can I implement context memory today, or do I need to wait for BlueField-4? 

You don’t need to wait. WEKA has been shipping augmented memory capabilities for over a year, with production deployments demonstrating 4–5x more tokens from the same hardware through software alone. The Augmented Memory Grid runs on existing GPU servers — no new hardware required — and delivers persistent, petabyte-scale KV cache storage with 75–99% cache hit rates. When BlueField-4 CMX arrives in H2 2026, it will complement and accelerate these capabilities, but the context memory tier is available now for teams that need it.