Mini-sglang

Mini-SGLang Overview

What is this project?

Mini-SGLang is a compact (~5,000 lines of Python), high-performance LLM inference framework that implements an OpenAI-compatible API server, interactive shell, and batch inference capabilities. It supports models like Qwen and Llama via Hugging Face, with single-command deployment (e.g., python -m minisgl --model "Qwen/Qwen3-0.6B"), and integrates optimized kernels like FlashAttention and FlashInfer for throughput/latency matching state-of-the-art systems.

Why does it exist?

Full-featured LLM servers like SGLang are bloated and opaque, obscuring key optimizations behind massive codebases. Mini-SGLang fills this gap as a transparent, modular reference implementation—demystifying radix attention, chunked prefill, and overlap scheduling—while delivering comparable benchmarks (e.g., higher throughput than SGLang v0.4 on Qwen3-32B traces).

Who uses it?

  • Researchers: Dissecting serving optimizations via readable code and ablations (e.g., MINISGL_DISABLE_OVERLAP_SCHEDULING=1).
  • Developers: Custom inference engines, prototyping features, or deploying lightweight servers (single/multi-GPU).
  • Teams: Production serving for long-context LLMs, benchmarking, or education—e.g., API at localhost:8000 or terminal chat with /reset.

Key Concepts

Before diving into code, grasp these 5 pillars (linked to core files for internals):

  1. Radix Cache: Trie-based KV cache reuse for prefix-sharing requests, slashing memory/recompute. Trade-off: O(log N) lookup vs. massive savings on shared prefixes. See runtime/radix_attention.py.
  2. Chunked Prefill: Splits long prompts into chunks to bound peak HBM usage during attention. Clever: Pipelines prefill/decode to hide latency. See scheduler in runtime/scheduler.py.
  3. Overlap Scheduling: GPU compute overlaps CPU scheduling via async kernels, hiding ~50% overhead. Pattern: Event-loop like CUDA streams. Ablate via env var.
  4. Tensor Parallelism (TP): Shards model across GPUs (e.g., --tp 4), with all-gather for cross-node attention. Trade-off: Bandwidth-bound vs. scalable to 70B+ models.
  5. Optimized Kernels: FlashInfer/FlashAttention for decode; JIT CUDA via sgl-kernel for custom ops. Linux-only due to kernel deps.

High-level data flow:

graph TD
    A["Client Request<br/>(prompt, stream)"] --> B["API Server<br/>minisgl/__main__.py"]
    B --> C["Request Scheduler<br/>runtime/scheduler.py"]
    C --> D{"Radix Cache Hit?"}
    D -->|Yes| E["Reuse KV<br/>runtime/radix_attention.py"]
    D -->|No| F["Chunked Prefill<br/>→ Decode Loop"]
    F --> G["TP All-Gather +<br/>FlashInfer Decode"]
    G --> H["Stream Response"]
    C -.->|Overlap| I["Async GPU Streams"]
    style I fill:#f9f

Project Structure

Modular monorepo (~5k LoC, fully typed). Core in runtime/; entrypoints minimal.

mini-sglang/
├── minisgl/              # CLI entry: __main__.py launches server/shell
├── runtime/              # Engine core: scheduler.py, radix_attention.py, engine.py
├── kernels/              # Custom CUDA (sgl-kernel, FlashInfer bindings)
├── benchmark/            # Ablations: offline/bench.py, online/bench_qwen.py
├── docs/                 # features.md, structures.md (detailed diagrams)
└── models/               # HF loader integrations

Module dependencies:

graph LR
    Main["__main__.py"] --> Server["server.py"]
    Server --> Scheduler["scheduler.py"]
    Scheduler --> Engine["engine.py"]
    Engine --> Radix["radix_attention.py"]
    Engine --> TP["tp.py"]
    Engine --> Kernels["flashinfer kernels"]
    Benchmark --> Engine

Start here: Run server, trace a request via pdb in runtime/scheduler.py, tweak radix for experiments.