What Is KV Cache? The Hidden Engine Behind Every AI Response

Achieving real-time inference at scale is now the central challenge of AI infrastructure. But the real bottleneck isn’t GPU compute. It’s something most teams haven’t even looked at yet: key-value (KV) cache. So what is this important concept?

KV cache is the working memory of AI inference. It makes it possible for the model to remember what it has processed before, eliminating the need to recalculate everything again from scratch. This allows it to build on past conversations to deliver faster answers and a more natural user experience.

But KV cache isn’t unlimited. As demands on it explode, the GPU memory where KV cache lives can’t keep up, creating a bottleneck that limits inference. Understanding how KV cache works is key to knowing how to extend it – and making your AI infrastructure inference-ready.

KV Cache, Explained Simply

KV caching is a technique used to speed up inference. As an LLM generates a response, the KV cache is where it can store and quickly retrieve computations. This allows the AI to remember what it has previously processed so that it doesn’t have to use additional time or compute to recalculate a response from scratch.

KV cache gets its name from the mathematical representations of memory: keys are the unique identifiers and values are the actual content. Because LLMs are autoregressive, meaning they calculate and predict each token in a sequence based on previous calculations, the immediate paths contained in the KV cache mean that the model can simply read what’s come before as it predicts the next token at each step and adds to the cache.

For example, say you’ve asked the model to complete the sentence, “That’s one small step for…” After it predicts “man,” it can’t predict the next token without knowing the attention score (a measure of relationship and relevance) between every previous token. Without KV cache, this requires redundant recomputations, a process that can make the responses feel impossibly slow. 

  1. That’s one small step for…
  2. That’s one small step for man…
  3. That’s one small step for man, one…
  4. That’s one small step for man, one giant…
  5. That’s one small step for man, one giant leap…

This illustrates why KV cache is essential: without it, every new token requires recalculating every previous relationship in the sequence.

However, with KV cache, the LLM can quickly pull these attention scores from the memory as it predicts the next token.

KV Cache is the Bottleneck Nobody Talks About

Although KV cache helps accelerate inference and reduce compute, this comes at the cost of memory. That’s because, as sequence lengths grow and context windows increase, so do the memory demands of KV cache. The result can turn KV cache into a stubborn bottleneck.

GPU memory is already stretched thin across AI infrastructure. A global memory shortage means that memory bandwidth is increasing at only about half the rate as GPU compute power – and that gap is growing. One reason is because the weights in large models can run into the hundreds of billions of parameters. Even larger models, such as those from OpenAI or Anthropic, can have weights with more than a trillion parameters. At these sizes, the weights by themselves can take up nearly all the high-bandwidth memory (HBM) for most GPUs.

This problem compounds further when multiple tenants take up one inference session, which puts further stress on the GPU memory. Each request adds to the KV cache, quickly using up any available HBM before compute has run out. When this happens, inference slows down as the AI model is forced to use slower bandwidth memory and run recalculations. This is the “memory wall.” Even though there is more than enough GPU infrastructure, the lack of available memory for the model’s KV cache means that inference cannot run smoothly.

The Memory Hierarchy: Where KV Cache Actually Lives

Does KV cache require HBM to function? While it works better when it’s stored within faster memory tiers, it does not need to live in HBM. In fact, understanding where KV cache goes after HBM fills up can help you learn how to optimize it.

Ideally, KV cache will overflow down a hierarchy of memory tiers, from fastest to slowest. NVIDIA has defined a framework for this divided into four levels:

  • G1 (HBM/VRAM): This is the very fastest tier. It delivers cache at the nanosecond level and is ideal for critical KVs in active use. 
  • G2 (CPU DRAM): The second-fastest tier delivers cache within 10 to 100 nanoseconds. This makes it well-suited for staging and spillover KV.
  • G3 (local NVMe): This tier delivers cache in the microsecond range, which can be sufficient for “warm KV” that does not require longer timescales.
  • G4 (network storage): The slowest tier works at the millisecond level, which means it should only be used for “cold KV” that is very rarely accessed.

The goal of dividing up KV cache into this memory hierarchy is to move beyond context windows (the amount of working memory, as measured by the maximum number of tokens an AI model can process at once) and into “context memory.” This is defined by a larger system that works together to store, serve, and optimize an AI model’s working memory during inference.

The critical question then becomes how to build an infrastructure that extends KV cache into G2 and G3 tiers while delivering memory-like performance.

KV Cache Optimization Techniques

The process of optimizing KV cache involves either reducing its size or reallocating its contents without compromising performance. There are several ways to do this.

Prefix Caching

For prompts that share the same prefix, this technique reuses cache states instead of recomputing them anew. It’s a simple strategy that nevertheless helps produce cache hit rates of 87% or more. Here’s how it works:

  1. After receiving an initial prompt, the model builds its KV cache during prefill.
  2. The KV cache is stored and used to generate the output tokens during decode.
  3. When a new request is made with a matching prefix, the model can utilize the stored prefix cache and skip to the new part of the prompt.

Note that this technique only works with prefixes that are exactly identical, making it ideal for production workloads with repeated prompt structures.

Quantization

This technique allows you to reduce the memory footprint of your KV cache by 2x or even 4x. It does this by converting the KV cache into a slightly less precise format.

KV cache consists of keys and values, which means that it is fundamentally composed of numerical values. Quantization takes advantage of this by rounding or shortening each numerical value in the cache in order to make it fit within a more compact form. Although this means a loss of precision and accuracy, this tradeoff can be managed by carefully selecting the parameters of the quantization (measured in floating points, or FP). 

For example, an FP8 will reduce the size of the KV cache by 50% with only very minimal impact on quality. This makes it an essential strategy for long-context inference.

Offloading

Not every token stored in the KV cache is needed. This technique recognizes this fact by moving inactive parts of the KV cache to slower storage tiers in order to free up GPU memory for fresh data.

Offloading is especially useful when deploying LLMs with long context windows, multiple users, or intermittent sessions that make keeping large parts of the KV cache in GPU memory wasteful. It also pairs well with tiered memory frameworks that allow you to distribute KV cache across different types of storage. By putting unused cache into slower tiers such as NVMe or network storage, you can easily recall that cache as needed while significantly reducing current latency.

Disaggregated Serving

A fundamental challenge of inference is that its two steps come with very different resource requirements. Whereas the prefill phase requires more compute, the decode phase requires more memory. Disaggregated serving addresses this by dividing these tasks between two GPUs.

With prefill and decode separate, you can accelerate both phases by optimizing each GPU for their individual task. While this happens, specialized inference software manages the transfer of KV cache between the two GPUs so that the entire architecture becomes more efficient, scalable, and faster even as demands on it grow.

What KV Cache Means for How You Build

The emerging importance of KV cache (along with the memory that it depends on) to maintaining persistent real-time inference is changing how organizations architect for AI.

Most notably, storage cannot just function as a place to park files. Instead, it needs to become an extension of system memory. As KV cache rates increase and demands on memory grow, HBM and DRAM cannot keep pace. While optimization techniques can help ease this pressure, even they have limits as multi-tenant usage increases, sequence lengths grow, and AI models continue to scale. Instead, it’s become necessary to utilize a hierarchy of memory tiers that include traditional storage.

While NVIDIA’s memory tier framework shows that this is possible, KV cache bottleneck challenges can remain if latency does not remain in the microsecond (or shorter) range. However, software-defined memory extension has become an emerging solution to this problem because it can transforms NVMe flash arrays into a high-performance memory layer for KV cache. Tokens can be recalled directly into GPU memory with microsecond latency, enabling persistent context across sessions, along with continuous, real-time inference.

Memory is now the defining constraint of inference – and traditional infrastructure is not built to solve this challenge. It’s time for organizations to step up and start treating KV cache like the central bottleneck it’s become.

Get more insights and learn about the biggest pain points enterprises face with Memory in this webinar with S&P Global and WEKA’s Chief AI Officer Breaking Down the Memory Wall in AI Infrastructure.

Want to learn how AI storage impacts inference workloads? Download the Buyer’s Guide to AI Storage for a full evaluation framework.

Frequently Asked Questions

What does KV cache actually do?

KV cache stores the key-value attention states computed during the prefill phase of inference so the model doesn’t need to recompute them for every new token it generates. In plain terms, it’s the short-term memory of an AI. When you’re having a long conversation with a chatbot, it doesn’t want to re-read everything you’ve already said from scratch just to come up with the next word. KV cache stores those past “thoughts” so the AI can stay fast and focused as the conversation grows.

Why is everyone calling it a “bottleneck”?

KV cache size scales linearly with sequence length, model depth, and batch size — and it competes directly with model weights for limited GPU HBM capacity. That’s a problem because AI models are getting smarter and prompts are getting longer, but GPU memory isn’t growing nearly as fast. The KV cache eventually fills up all the available space on the chip. When that happens, the AI hits a memory wall — it either slows to a crawl or simply forgets the beginning of your conversation.

How does “quantization” help save memory?

Quantization reduces the numerical precision of cached key-value tensors — typically from FP16 to FP8 or INT8 — cutting memory footprint by 2–4x with less than 0.1% quality degradation on most benchmarks. Think of it like saving a photo as a high-quality JPEG instead of a massive RAW file: you lose a tiny bit of mathematical precision, but you gain a massive amount of space, allowing the AI to handle much longer tasks.

What’s the difference between “warm” and “cold” cache?

Warm cache refers to KV states held in GPU HBM or system DRAM with nanosecond-scale access latency; cold cache refers to KV states persisted to NVMe or network storage with microsecond-to-millisecond access latency. The goal of a good AI infrastructure setup is to keep the most frequently accessed states warm so the user never notices a delay — and to tier the rest intelligently rather than evicting it entirely.

Can different users share the same cache?

Yes — through a technique called prefix caching, which stores the KV states for shared prompt prefixes (such as system instructions or a common document) and reuses them across requests without recomputation. If 1,000 people all ask a bot about the same 50-page PDF, the system only needs to “read” and cache that PDF once. Everyone pulls from that same starting point, which saves a massive amount of compute. SGLang’s RadixAttention achieves up to 87% cache hit rates for shared-prefix workloads.

How does WEKA fix the KV cache problem?

WEKA’s Augmented Memory Grid extends effective GPU memory by presenting NVMe and network storage as a unified, software-defined memory tier — delivering up to 1,000x more capacity than HBM alone with microsecond-class latency. Standard infrastructure usually breaks when the GPU runs out of room: you either evict cache (losing context) or add more GPUs (burning budget). AMG treats your storage like an extension of the GPU’s own memory, keeping KV states warm and accessible across the full memory hierarchy — HBM, DRAM, local NVMe, and network storage — without forcing the model to choose between context length and performance.