Mini-SGLang Overview
What is this project?
Mini-SGLang is a compact (~5,000 lines of Python), high-performance LLM inference framework that implements an OpenAI-compatible API server, interactive shell, and batch inference capabilities. It supports models like Qwen and Llama via Hugging Face, with single-command deployment (e.g., python -m minisgl --model "Qwen/Qwen3-0.6B"), and integrates optimized kernels like FlashAttention and FlashInfer for throughput/latency matching state-of-the-art systems.
Why does it exist?
Full-featured LLM servers like SGLang are bloated and opaque, obscuring key optimizations behind massive codebases. Mini-SGLang fills this gap as a transparent, modular reference implementation—demystifying radix attention, chunked prefill, and overlap scheduling—while delivering comparable benchmarks (e.g., higher throughput than SGLang v0.4 on Qwen3-32B traces).
Who uses it?
- Researchers: Dissecting serving optimizations via readable code and ablations (e.g.,
MINISGL_DISABLE_OVERLAP_SCHEDULING=1). - Developers: Custom inference engines, prototyping features, or deploying lightweight servers (single/multi-GPU).
- Teams: Production serving for long-context LLMs, benchmarking, or education—e.g., API at
localhost:8000or terminal chat with/reset.
Key Concepts
Before diving into code, grasp these 5 pillars (linked to core files for internals):
- Radix Cache: Trie-based KV cache reuse for prefix-sharing requests, slashing memory/recompute. Trade-off: O(log N) lookup vs. massive savings on shared prefixes. See
runtime/radix_attention.py. - Chunked Prefill: Splits long prompts into chunks to bound peak HBM usage during attention. Clever: Pipelines prefill/decode to hide latency. See scheduler in
runtime/scheduler.py. - Overlap Scheduling: GPU compute overlaps CPU scheduling via async kernels, hiding ~50% overhead. Pattern: Event-loop like CUDA streams. Ablate via env var.
- Tensor Parallelism (TP): Shards model across GPUs (e.g.,
--tp 4), with all-gather for cross-node attention. Trade-off: Bandwidth-bound vs. scalable to 70B+ models. - Optimized Kernels: FlashInfer/FlashAttention for decode; JIT CUDA via
sgl-kernelfor custom ops. Linux-only due to kernel deps.
High-level data flow:
graph TD
A["Client Request<br/>(prompt, stream)"] --> B["API Server<br/>minisgl/__main__.py"]
B --> C["Request Scheduler<br/>runtime/scheduler.py"]
C --> D{"Radix Cache Hit?"}
D -->|Yes| E["Reuse KV<br/>runtime/radix_attention.py"]
D -->|No| F["Chunked Prefill<br/>→ Decode Loop"]
F --> G["TP All-Gather +<br/>FlashInfer Decode"]
G --> H["Stream Response"]
C -.->|Overlap| I["Async GPU Streams"]
style I fill:#f9f
Project Structure
Modular monorepo (~5k LoC, fully typed). Core in runtime/; entrypoints minimal.
mini-sglang/
├── minisgl/ # CLI entry: __main__.py launches server/shell
├── runtime/ # Engine core: scheduler.py, radix_attention.py, engine.py
├── kernels/ # Custom CUDA (sgl-kernel, FlashInfer bindings)
├── benchmark/ # Ablations: offline/bench.py, online/bench_qwen.py
├── docs/ # features.md, structures.md (detailed diagrams)
└── models/ # HF loader integrations
Module dependencies:
graph LR
Main["__main__.py"] --> Server["server.py"]
Server --> Scheduler["scheduler.py"]
Scheduler --> Engine["engine.py"]
Engine --> Radix["radix_attention.py"]
Engine --> TP["tp.py"]
Engine --> Kernels["flashinfer kernels"]
Benchmark --> Engine
Start here: Run server, trace a request via pdb in runtime/scheduler.py, tweak radix for experiments.