Prefill and Decode: A Technical Guide to the Two Phases of Inference

Delivering sustainable AI inference means balancing the competing demands of its two phases: prefill and decode. The problem is that most infrastructures don’t.
Prefill is the compute-heavy phase of LLM inference that happens immediately after a prompt is entered. It’s when the AI model first processes the input and generates the key-value (KV) cache it will reference when it begins producing the output. Decode is this next phase. Because it relies on the previously generated KV cache to build output tokens one at a time, it is memory-intensive. Both phases are essential to providing seamless, real-time inference. Both also need very different things from infrastructure.
This is the core challenge of inference optimization. Designing for just one phase at the expense of the other will ultimately slow latency, damage throughput, and hurt the user experience. Here’s what you need to know about these two phases – and how you can accommodate both in your inference system design.
Prefill — Processing the Prompt
From the perspective of a user, prefill is the wait that happens from when a prompt is first entered to when an output begins generating. This wait can feel near instantaneous or can drag on forever. It all depends on what is happening behind the scenes.
A compute-intensive process can often use the full extent of the GPU. This begins by processing the input the user gave the AI, then converting it into the keys and values that make up the KV cache. However, instead of just processing one token at a time, prefill speeds up this step by calculating the entire input simultaneously. This requires the GPU to perform a large amount of matrix multiplications all at once – a process called parallelism.
Reading and processing the entire prompt (as well as any system instructions and additional context) in one go achieves a high degree of GPU utilization, but it also scales quickly the longer the prompt gets. This is because what the model is actually calculating and converting into KV cache are attention scores, or the relation of one token to every other token. This is how the model understands context and relationships. It’s also how prefill uses so much compute.
For example, consider a 1,000-token prompt. Because this requires N × N computations, it will require 1,000,000 attention scores to be calculated. This continues to grow exponentially as input lengths grow. A 4,000-token prompt doesn’t require 4,000,000 attention scores – it requires 16,000,000.
So as prompt length grows, so does the computational burden on the GPU and the length of time it takes for the prefill phase to complete. The speed and efficiency of prefill is measured using time-to-first-token (TTFT), or the amount of time it takes for the model to generate the first token. Ideally, TTFT should be measured in microseconds or less. For a model to feel interactive, TTFT should be <500ms. For it to feel real-time, it should be <200ms.
Shaving off these crucial milliseconds matters, which is why the best AI infrastructures require massive sustained throughput and parallel I/O. When it comes to the user experience, the smallest delay can feel huge.
Decode — Generating the Response
Decode is the more visible phase of inference. It’s when the AI model moves from processing the input to using these calculations to generate the output the user will see.
The most notable difference between decode and prefill is that decode cannot use parallelism. Instead, it must generate each token one at a time. This is because every token it generates depends on the ones that came before it. With each step in this sequence, the model must read the entire KV cache the prefill built in order to understand the full context and create the next token. Since KV cache lives in GPU memory (HBM), decoding is memory bound — data transfer speed, not compute power, determines latency.
Decode efficiency depends directly on memory latency. The lower the latency, the more fluid inference feels. The metric used to calculate throughput is tokens per second (TPS). At least 50 TPS per user is needed for real-time output. Any lower and the user will experience notable gaps or lags in output. Inter-token latency (ITL) — the gap between consecutive tokens — typically stays at 11–21ms across production workloads, but P99 tail latency is what breaks agentic chains at scale.
Where is compute in all this? Since decode depends so much on GPU memory (such as HBM and DRAM), the cores that power compute largely sit idle. Between prefill and decode, arithmetic intensity drops from 200-400 operations per byte to 60-80, while GPU utilization falls to between 20-40%.
In contrast, memory can quickly become used up during decode, forcing the model to rely on lower-tier NVMe or network storage and turning memory into a significant bottleneck for inference – a key reason why optimizing for high memory bandwidth is crucial.
Why This Matters for Your Inference Stack
In one sense, satisfying the demands of prefill and decode may feel impossible. Prefill wants as much compute power as possible. Decode wants extremely low memory latency. However, knowing how these two phases function will inform the way you architect your infrastructure.
For instance, traditional approaches to inference architecture, such as Network Attached Storage (NAS), make no distinction between prefill and decode, running both through the same storage infrastructure. And with only one request coming in at a time, this approach can work just fine. However, the reality is that concurrent users are often making many simultaneous requests, which can put different resources (compute, memory bandwidth, I/O queue) in competition. Due to the different needs of prefill and decode, this means that delays stack up as requests contend for the same resources.
Instead, infrastructure designed to support AI inference must be able to handle the competing demands of prefill and decode together. This is where parallel file systems (PFS) can help. Rather than routing every user request through a single I/O queue, PFS decouples and distributes data and metadata operations (such as file lookups) across multiple nodes. This ensures that, for example, a latency-sensitive decode request doesn’t get stuck behind a bandwidth-hungry prefill request – or vice versa.
Another way to think of this is like a train in which all the carriages are like the user requests within an AI system. With NAS, the entire train only has one door for the passengers to enter or exit, quickly creating bottlenecks when too many try to go through. But with PFS, there are doors on every carriage, making it easy for passengers to come and go as needed, and ensuring inference flows smoothly.
Beyond Parallel File Systems: Disaggregated Serving and Chunked Prefill
A parallel File System isn’t the only strategy available when architecting inference systems. Disaggregation and chunked prefill offer even further opportunities for optimization.
Disaggregated Serving
This technique – also called disaggregated inference – directly addresses the competing demands of prefill and decode. It does this by leveraging the parallelism strategy of PFS to separate the compute-heavy tasks of prefill entirely from the memory-bound work of decode.
Disaggregation works by first placing prefill and decode on independent GPU pools. Each of these pools is then optimized for the specific I/O pattern that it serves: the prefill pools can be tweaked to deliver low TTFT, while the decode pools can be designed for efficient TPS. As a result, both GPUs can serve requests and scale independently as needed without interfering with the other’s workload, delivering improved concurrency and throughput.
Due to its effectiveness, disaggregated serving has become a widely used technique for delivering real-time, scalable inference. Perplexity, Meta, LinkedIn, and Mistral all run it in production, while NVIDIA even built its Dynamo framework around it. The approach was formalized by the DistServe paper (OSDI 2024), which demonstrated 7.4x more requests or 12.6x tighter SLO adherence compared to colocated serving.
Chunked Prefill
Instead of processing the entire input at once during the prefill phase, this technique breaks the input into smaller, less compute-intensive chunks. This helps resolve the amount of time needed to calculate attention scores for long prompts and improve inference.
Each chunked prefill step is processed individually and its KV pairs are cached. The system can then run a discrete decode phase for that unique chunk. This pattern continues sequentially until the entire input has been processed. Although this can produce a slight increase for TTFT, the overall result is a more efficient time to first output token, or the time it takes for the AI model to begin producing a response.
By breaking prefill into smaller chunks and processing them alongside decode phases in steps, this technique makes it possible to maximize resource efficiency.
Measuring What Matters
Deploying inference isn’t the finish line. Models evolve, scale, and change — and your infrastructure needs to keep up. This means continuously measuring the performance of your AI model and the infrastructure it depends on. But how should you do that?
TTFT (Time to First Token)
This metric tells you how long it takes for the model to digest the input, run its calculations, and generate its first token. It’s the gold standard for measuring the efficiency of the prefill phase.
The lower the TTFT, the better the user experience. In general, a good TTFT will be 200ms or lower. However, this metric will be affected by multiple factors, such as prompt length, model size, concurrency, and GPU utilization. All of these should be considered when trying to optimize for this metric.
TPS (Tokens Per Second)
This tells you how quickly a model generates output tokens. Because this is what users directly experience, this metric is the preferred method of measuring the decode phase.
A high TPS translates into an experience that feels like real-time typing (or often even faster). This is defined by a TPS of about 20 or higher. A low TPS (anything below 20) feels slow and unnatural. TPS can be measured either per user or for the system as a whole – and the gap between them is a useful indicator of decode performance under concurrency.
P99 Tail Latency
This is the worst-case decode performance that a model is delivering. It tells you that 99 percent of users are experiencing a latency of X or better, while the remaining one percent are worse. It’s a helpful way to measure outlier performance.
For example, if a model has a p99 latency of 300ms, this means that out of 100 requests, 99 will respond in under 300 ms, while the remaining 100th request will take longer. Although this may sound minor, at scale this can affect thousands of users. Because of this, taking this worst-case metric into consideration is important.
→ Know your numbers: How to Measure AI Inference
How to Benchmark: Real Workloads vs. Synthetic
These metrics are only as valuable as the data they rely on. Without a sufficient data pool to draw from, they can actually be misleading. This is why it is important to consider not only which metrics to use, but how they are put into practice.
This typically comes down to a question of using synthetic workloads or real ones. Both have their purposes. Synthetic (or artificial) workloads can be a great way to test specific hardware components or isolate certain algorithms. Because they can be standardized and repeated as needed, they provide an easy way to compare the performance of different equipment, architectures, and settings.
But when it comes to seeing how your infrastructure actually performs in realistic scenarios, there is no comparison to using real-life workloads. However, this comes with a few caveats. In order to get the most accurate results possible, use actual workloads taken from real customers, not just what you perceive to be a realistic workload. Also test with mixed prompt lengths in order to measure a range of different scenarios. Include concurrent users to measure how that changes performance. And do this all under sustained loads of at least 72 hours.
Benchmarking strategically like this is how you know whether your AI infrastructure is actually keeping up — before your users tell you it isn’t.
Build your AI infrastructure without bottlenecks. Download the Buyer’s Guide to AI Storage for a full evaluation framework.
Frequently Asked Questions
What is the difference between prefill and decode in LLM inference?
Prefill is the compute-bound phase where the model processes the entire input prompt in parallel and generates the KV cache. Decode is the memory-bound phase where the model generates output tokens one at a time, reading from the KV cache at each step. Prefill demands massive parallel throughput; decode demands ultra-low-latency memory reads. Optimizing for one at the expense of the other is the most common mistake in inference system design.
Why is decoding memory bound?
During decode, the model must read the entire KV cache from GPU memory (HBM) for every single token it generates. Data transfer speed — not compute power — determines latency. Arithmetic intensity drops to 60–80 operations per byte, and GPU utilization falls to 20–40%. This is why memory bandwidth is the critical bottleneck in the decode phase.
What is disaggregated serving (DistServe)?
Disaggregated serving separates prefill and decode onto independent GPU pools, each optimized for its specific workload. The DistServe paper (OSDI 2024) demonstrated this approach can serve 7.4x more requests or achieve 12.6x tighter SLO adherence. It’s now in production at Perplexity, Meta, LinkedIn, and Mistral, and NVIDIA built its Dynamo framework around the concept.
What is chunked prefill, and how does it improve inference?
Chunked prefill (introduced by the Sarathi paper) breaks large input prompts into smaller segments instead of processing them all at once. Each chunk is processed and its KV pairs cached, allowing decode steps to interleave with prefill. This can deliver up to 6.9x throughput improvement and prevents long prompts from starving decode operations of GPU resources.
What is a good time-to-first-token (TTFT)?
For interactive applications, TTFT should be under 500ms. For real-time experiences (chat, voice, agentic workflows), aim for under 200ms. Groq achieves sub-200ms TTFT, and infrastructure optimizations like persistent KV cache reuse can reduce TTFT dramatically — WEKA’s Augmented Memory Grid™ has demonstrated up to 41x improvement.
How many tokens per second (TPS) does inference need to feel real-time?
A minimum of 50 TPS per user is the threshold for fluid, interactive inference. Below 20 TPS, users perceive noticeable gaps and lag. Measure both per-user and aggregate system TPS — the gap between them reveals decode performance under concurrency.
Why can’t NAS handle inference workloads at scale?
NAS routes all I/O through a single metadata path. When hundreds of concurrent inference requests compete for the same queue — some bandwidth-hungry (prefill), some latency-sensitive (decode) — delays stack up. NAS metadata lookups add 1–10ms per operation. Parallel file systems distribute both data and metadata across multiple nodes, eliminating this contention.
What is inter-token latency (ITL) and why does it matter?
ITL measures the time between consecutive output tokens during decode. Typical production ITL is 11–21ms, but P99 tail latency matters more. If your median is fast but your P99 is 10x worse, agentic chains and multi-step workflows will break at scale.