No Memory, No Margin

The AI industry spent 2023 obsessing over prompts. It spent 2024 obsessing over context windows. In late 2025 and now well into 2026, the conversation has shifted again. This time, it's about memory. Specifically, the gap between what the industry calls “logical” KV caching and what's actually happening in production. This gap is costing companies millions, degrading model quality, and quietly handcuffing the agentic AI era before it has fully arrived.
During a recent episode of the Fabricated Knowledge podcast, Doug O’Laughlin, president of independent research and analysis firm SemiAnalysis, interviewed WEKA’s Val Bercovici to discuss what’s really going on and what you can do about it. Read on for highlights and grab the link to the full podcast at the end of this article.
What is driving trillion-token AI consumption in 2026?
AI token consumption in some cases has crossed into the trillions, territory that would have seemed unimaginable only 18 months ago. And it isn't experimental usage….this is in production, with volumes climbing fast. “We're seeing trillion-token consumption by AI startups now as a routine thing, daily sometimes,” Val noted.
At this consumption level, even small inefficiencies can mushroom into enormous costs quickly. To put this in context, Anthropic's most recent Opus pricing for context windows above 200,000 tokens in Fast Mode (which prioritizes latency and is currently in research preview) sits at $225 per million output tokens. Uncached input tokens cost $60 per million. But cached reads drop the price to $6 per million — a 10x reduction.
At scale, the financial delta between well-managed caching strategies and those that aren’t optimized isn't a rounding error; it's the difference between a sustainable business and one that is burning substantial yet avoidable amounts of cash on infrastructure. “If it was all uncached and you're paying the raw price, you're talking insane numbers,” Doug explained.
What is the difference between logical and physical KV caching?
This is the core insight most teams miss.
Logical KV cache utilization is what you observe in your “traces,” or the theoretical reuse of tokens across turns, subtasks, and agent calls. In multi-agent workflows where dozens of parallel subtasks share the same system prompts, codebases, and toolsets, logical cache hit rates can reach 90% or even 99%. The efficiency is genuinely impressive.
On the other hand, physical KV cache utilization is what your inference provider can actually deliver. Here’s where the picture changes dramatically.
The big giveaway is in the pricing tiers for cache write windows. In 2026, these have started to vary in terms of how long a window will stay active and what the premium is for longer sessions, but there’s always a cap, which exists for a specific reason: KV cache is being stored in DRAM, and DRAM is finite. As Val explains, “If you can't offer more than an hour, there's no cache offloading beyond DRAM. You're doing HBM, maybe DRAM pooling, maybe CXL, but that's basically musical chairs around very finite resources. We're still talking about 1 to 2 TB, at most, of DRAM per node.”
What happens in practice? An orchestration agent dispatches 50 parallel subtasks. They don't all complete simultaneously. Some hit a five-minute gap between calls. The cache is evicted. The next subtask starts from scratch, triggering a full — and expensive — prefill all over again. The logical cache hit rate might have looked great, but the physical cache hit rate was far lower, and you paid for the full prefill anyway.
Val illustrated the downstream effect: “With billions of tokens per day and agent swarms, you're evicting all the time. You're certainly evicting within five minutes.”
Why DRAM is the bottleneck in AI inference memory hierarchies
To understand why this problem is structural and not just a configuration issue, it helps to understand the memory hierarchy in modern inference infrastructure.
HBM (High-Bandwidth Memory) on the GPU is where attention computation actually happens. But Flash Attention algorithms don't operate directly from DRAM — they move data into HBM first, and from HBM into SRAM, before computation occurs. DRAM sits between persistent storage and HBM as an intermediary. Val put it plainly: “DRAM is a staging area. It's a green room. And the green room is overloaded right now.”
Currently, this green room is continuously overbooked. Providers are constantly refilling DRAM from compute, cycling tokens through HBM over and over. This burns energy, hammers accelerators, and creates the underlying reason why cache windows above one hour don't exist yet: There's nowhere durable to hold this cache without offloading it to NVMe storage, and traditional NVMe hasn't been fast enough to keep pace with the feed rate DRAM requires.
The implication is significant because the one-hour maximum cache window isn't a business decision but rather an infrastructure ceiling.
How does GPU memory scarcity cause AI model quantization at inference?
There's a less-discussed consequence to this memory crunch: model quantization.
When inference providers receive a new model, they evaluate it at full precision, then influencers and benchmarkers test it at full precision. By the time most enterprise users access it, it's been quantized — often to FP8 or lower — because there simply isn't enough memory to serve it at FP16 at scale while also maintaining throughput. The model that shipped the benchmark isn't quite the model in production.
This is addressable. “It's not a compute problem at inference time; it's a memory problem,” Val said. “If you reduce the amount of redundant prefills, you now have massive savings in accelerator compute time. You can reallocate those GPUs.” More parallel decode capacity means lower latency, higher throughput, and crucially, the ability to keep higher-precision weights in memory. The memory wall and the quantization problem are the same problem.
How to optimize KV cache hit rates in production AI inference workloads
First, instrument your actual physical cache hit rate, not just the logical one. Doug’s own audit found the reality more nuanced than expected: “It averages around the 70s. There are definitely sessions where you can see heavy usage of the same tokens — boom, 90%. But when you start a new session, you can have much lower rates.”
Second, architect your agent workflows around cache window boundaries. If you're running multi-agent swarms with subtasks that may have variable completion times, design your orchestration to ensure high-frequency cache refreshes on shared prefixes — system prompts, codebases, tool definitions — before the five-minute eviction window closes. Pre-purchasing one-hour cache write tiers where available is often worth the premium when operating at high concurrency.
Third, watch OpenRouter as a leading indicator. Right now, very few open-model providers offer cache reads as a product, let alone cache write tiers. Val frames the signal clearly: “When those two columns fill up, we're going to see an explosion of performance and efficiency and higher-quality tokens, because people won't have to make that tradeoff.”
What is the future of KV cache offloading and inference memory architecture?
The memory wall is not a temporary constraint; it is the defining infrastructure challenge of the agentic era. Doug summarized it like this: “If 2023-2024 was about prompting, context management is the new ‘prompting’.”
KV cache offloading from DRAM to fast NVMe, combined with smarter orchestration and models that become natively memory-aware, will determine which inference providers can deliver both performance and margin at scale. The companies that understand their actual caching economics today — not the theoretical maximum, but the physical reality — will have a significant advantage as token volumes continue their steep climb.
Memory is the new context. And the green room can't stay overbooked forever.
Be sure to listen to Doug and Val’s full conversation for even more insights.
What's Next
Scale Production AI Faster with NeuralMesh
Your models aren't slow. Your data is. Fix AI bottlenecks with high-throughput infrastructure.


