Every autoregressive LLM — GPT, Llama, DeepSeek, you name it — processes requests in two fundamentally different phases. Understanding these phases, and why they should run on separate hardware, is the key insight behind prefill-decode (PD) disaggregation in SGLang.
The Two Phases of LLM Inference
Prefill (Prompt Processing)
When a request arrives, the model takes the entire input prompt and runs one large parallel forward pass across all tokens. This produces the initial KV cache for the full context.
Prefill is compute-hungry — dominated by matrix multiplications, high FLOPS, a short but intense burst. The goal is fast Time-to-First-Token (TTFT) so the user isn’t staring at a blank screen. It benefits from big batches and high tensor parallelism (TP) to saturate the GPU’s compute units.
Decode (Token-by-Token Generation)
After prefill completes, the model generates one new token at a time, conditioning on everything that came before. Each step appends to the KV cache and produces the next token.
Decode is memory-bandwidth-bound — each step is mostly fetching the growing KV cache from HBM for attention, not performing heavy computation. But generation can run for hundreds or thousands of steps. The goal is high tokens-per-second (TPS) and consistent Time-Per-Output-Token (TPOT). It thrives on concurrency, continuous batching, paged attention, and prefix caching.
These two phases have completely different computational profiles. Prefill wants raw compute. Decode wants memory bandwidth. Running them on the same hardware means compromising on both.
Unified Mode: The Classic Approach
In unified mode (SGLang’s default), a single runtime instance handles both prefill and decode for whatever requests land on it.
This works fine for experiments, small models, or low request rates. But under real production load with mixed traffic, the problems compound:
- Prefill starves decode: A new long-prompt request arrives, prefill takes over the GPUs, and any ongoing decode generations get stalled. Users see jittery output speeds — tokens flowing smoothly, then stuttering whenever a new prefill kicks in.
- Decode blocks prefill: Queued decode work holds GPU resources, preventing fresh requests from starting their prefill phase. TTFT degrades under load.
- Scaling is coarse: Scaling means replicating the entire runtime, but you’re stuck balancing compute-heavy and memory-heavy workloads in the same instances. You can’t independently scale the two phases based on traffic patterns.
The fundamental issue is interference between two workloads that want different things from the hardware.
PD Disaggregation: Splitting the Phases
PD disaggregation physically separates prefill and decode into dedicated instance pools:
- Prefill engines: Dedicated instances that only process prompts. Optimized for compute throughput — often configured with higher TP to parallelize the heavy matrix multiplications.
- Decode instances: Dedicated instances that only generate tokens. Optimized for memory bandwidth and concurrency — typically lower TP per replica but more replicas for parallel generation.
- Router: A cache-aware load balancer (SGLang’s router) that receives incoming requests, forwards prompts to prefill engines, and hands off to decode instances for generation.
The flow looks like this:
- Request arrives at the router
- Router forwards the prompt to a prefill engine
- Prefill engine processes the prompt and produces the KV cache
- KV cache is transferred to a decode instance (via RDMA, NVSHMEM, Mooncake, or similar high-speed interconnect)
- Decode instance generates tokens autoregressively using the received KV cache
The KV handoff happens only once — decode continues generating without ever touching the prompt again.
Why This Matters
No Interference
Prefill no longer blocks decode. Decode no longer starves incoming prompts. Each phase runs on hardware dedicated to its workload profile without contention.
Independent Scaling
You can over-provision prefill capacity for snappy TTFT on prompt-heavy bursts (think: long-context RAG, code completion with large contexts). Independently, you can scale decode massively — often many single-GPU or low-TP instances — for high TPS on long generations (think: chat, agents, streaming responses).
Resource Efficiency
The router typically needs zero GPUs. Prefill engines can run on compute-dense hardware with high TP. Decode instances can lean on bandwidth-optimized setups. Each component uses exactly the type of resource it needs.
Measured Gains
Real-world deployments report 2-3x throughput on mixed workloads, with dramatically tighter P99 TTFT and TPOT. Behavior under load becomes much more stable and predictable — no more jitter from phase interference.
When to Use PD Disaggregation
PD disaggregation pays off when you’re serving anything beyond toy traffic:
- Chat applications: Mixed long prompts and streaming decode, where TTFT and TPOT both matter
- Agent workflows: Rapid prompt-response cycles where prefill latency directly impacts agent loop speed
- RAG pipelines: Long retrieved contexts that make prefill expensive, followed by relatively short generations
- Code completion: Large context windows with fast response requirements
- Long-context workloads: Documents, summaries, and analysis where prompt processing dominates
For experimentation, small models, or low QPS, unified mode is perfectly fine. But once traffic gets real, unified mode starts leaving performance on the table.