9 min read
SGLang Diffusion: Why Serving Diffusion Models Is Nothing Like LLM Serving

If you’ve been running LLM inference at scale, you probably think you know what model serving looks like. A request comes in, the model chews through prefill, then spits out tokens one by one until it hits a stop condition. KV cache grows linearly, you batch requests together, maybe do some clever scheduling. That’s the game.

Then you try to serve a diffusion model and realize the rules are completely different.

SGLang — the inference engine that’s become the de facto standard for LLM serving — recently shipped SGLang Diffusion, extending its reach into image and video generation. And the architectural decisions they made tell you a lot about why these two workloads are fundamentally different beasts.

The LLM Serving Mental Model

Let’s start with what we know. LLM inference is autoregressive. The model generates one token, feeds it back in, generates the next. It’s inherently sequential at the token level. The entire optimization game revolves around this constraint:

  • Prefill processes the input prompt in parallel — that’s the easy part.
  • Decode generates tokens one at a time, and each step depends on the previous one.
  • KV cache stores the attention state so you don’t recompute everything for each new token.
  • Batching lets you pack multiple requests together because they all go through the same decode step, just with different KV caches.

The scheduler’s job is straightforward in principle: figure out which requests can be batched together, manage memory for their KV caches, and keep the GPU fed. Tools like continuous batching, PagedAttention, and RadixAttention (SGLang’s own innovation) all exist to squeeze more efficiency out of this sequential generation loop.

The compute pattern is predictable. Each decode step is roughly the same cost. Memory grows linearly with sequence length. You can reason about throughput and latency in relatively straightforward terms.

Diffusion Models Flip the Script

Now forget all of that.

A diffusion model doesn’t generate content sequentially. It starts with pure noise and iteratively refines it across multiple denoising steps until you get a coherent image or video. Every step processes the entire output at once. There’s no concept of “generating the next pixel” the way there’s “generating the next token”.

This changes everything about how you serve these models.

The compute pattern is completely different. Instead of a long sequence of lightweight decode steps, you have a fixed number of heavy denoising passes. Each pass runs the full transformer (or DiT — Diffusion Transformer) over the entire latent representation. For a video model like Wan 2.2, that latent could represent hundreds of frames. The compute per request is massive and mostly front-loaded.

There’s no KV cache equivalent. In LLM serving, the KV cache is the thing you’re constantly managing — allocating, evicting, sharing across requests with common prefixes. Diffusion models don’t have this concept. The state between denoising steps is the noisy latent itself, and there’s no incremental caching to exploit in the same way. (Though Cache-DiT, which SGLang integrates, does introduce a clever form of caching by skipping redundant transformer block computations across denoising steps.)

Batching dynamics are different. With LLMs, requests naturally batch because each decode step is small and uniform. With diffusion models, a single request can saturate multiple GPUs. You’re not trying to pack hundreds of requests into a batch — you’re trying to parallelize a single request across devices using techniques like Unified Sequence Parallelism (USP), CFG-parallelism, and tensor parallelism.

The pipeline is multi-stage. An LLM is basically one model doing one thing (next token prediction, with some prompt processing upfront). A diffusion pipeline has distinct stages: text encoding, denoising loop, VAE decoding. Each stage has different compute and memory characteristics. The text encoder is a one-shot forward pass. The denoising loop is iterative and dominates latency. The VAE decoder is another one-shot pass but can be memory-hungry, especially for video.

How SGLang Diffusion Handles This

SGLang’s approach is interesting because they didn’t just bolt a diffusion pipeline onto their LLM engine. They introduced a new abstraction — ComposedPipelineBase — that orchestrates a series of modular PipelineStages. Each stage encapsulates a specific function: DenoisingStage for the iterative denoising loop, DecodingStage for VAE decoding, and so on.

But here’s what’s smart: they reuse the battle-tested SGLang scheduler and the sgl-kernel optimized kernels. The scheduling infrastructure that manages request queuing, memory allocation, and device utilization still applies — it just orchestrates a different kind of computation graph.

The parallelism strategy is also tailored. Instead of the prefill-decode parallelism you see in LLM serving (like disaggregated prefill/decode), SGLang Diffusion uses:

  • USP (Unified Sequence Parallelism) combining Ulysses-SP and Ring-Attention for the core DiT transformer blocks
  • CFG parallelism to handle classifier-free guidance — where you need to run both conditional and unconditional forward passes
  • Tensor parallelism for other model components

And they get real results from this: 1.2x to 5.9x speedups across different workloads on H100 and H200 GPUs.

The Memory Story Is Completely Different Too

With LLMs, your memory bottleneck is the KV cache. A long context window means a huge KV cache, and you’re constantly playing Tetris with GPU memory to fit as many concurrent requests as possible.

With diffusion models, especially video models, the bottleneck is the latent representation and the model weights themselves. A single video generation can require tens of gigabytes. SGLang Diffusion addresses this with layer-wise offloading (--dit-layerwise-offload), which can reduce peak VRAM usage by up to 30GB while actually improving performance by up to 58%. The idea is that you don’t need all transformer layers resident in GPU memory simultaneously — you can stage them in and out as the denoising loop progresses through the network.

This is a completely foreign concept in LLM serving, where the model weights are typically fully resident and it’s the dynamic state (KV cache) that you’re managing.

The API Surface Looks Similar, But Don’t Be Fooled

SGLang Diffusion exposes an OpenAI-compatible API, which is nice for integration. You hit /v1/images/generations with a prompt and get back an image. Looks just like calling an LLM endpoint, right?

Under the hood, the request lifecycle is nothing alike. An LLM request starts streaming tokens back quickly (low TTFT) and then continues for a while (high total latency for long outputs). A diffusion request has high latency with no intermediate output — you wait for all denoising steps to complete, then get the full result at once.

This has implications for load balancing, timeout configuration, health checks, and basically every SRE concern you might have. You can’t use TTFT as a health signal. Your load balancer needs to understand that a “slow” response is normal, not a sign of a stuck request. And your capacity planning needs to account for the fact that a single video generation request might tie up multiple GPUs for minutes, not milliseconds.

Where It Gets Really Interesting: Convergence

The most forward-looking aspect of SGLang’s approach is that they’re positioning for the convergence of autoregressive and diffusion paradigms. Models like ByteDance’s Bagel and Meta’s Transfusion use a single transformer for both text and image generation. NVIDIA’s Fast-dLLM adapts autoregressive models for parallel generation using diffusion-style denoising.

And then there are Diffusion LLMs (dLLMs) like LLaDA 2.0, which SGLang also supports. These models generate text through denoising rather than autoregressive token prediction. They start with masked tokens and iteratively refine them — all tokens in parallel. SGLang handles these by cleverly reusing their existing Chunked-Prefill pipeline, since the block diffusion computation pattern looks surprisingly similar to prefill.

The fact that SGLang supports both traditional LLM serving, diffusion model serving for images/video, AND diffusion-based language models under one framework is a big deal. It means you don’t need separate inference stacks for different model types. One engine, one operational playbook (mostly), one set of monitoring and scaling patterns.

So What?

If you’re running inference infrastructure, the takeaway is this: diffusion model serving is not just “LLM serving with bigger models”. The compute patterns, memory management, parallelism strategies, and operational characteristics are fundamentally different.

SGLang’s approach of building specialized diffusion support within a unified framework — reusing what makes sense (scheduler, kernels, API layer) while introducing new abstractions where needed (pipeline stages, diffusion-specific parallelism) — feels like the right architecture for where the industry is heading.

The models are converging, the serving infrastructure should too.

References