Mini-SGLang Architecture
Mini-SGLang is a compact, high-performance LLM inference engine (~5,000 lines of Python) that mirrors key innovations from SGLang, such as radix cache, chunked prefill, overlap scheduling, and tensor parallelism. It uses PyTorch for core tensor ops, FlashInfer/FlashAttention for optimized kernels, and custom Python/CUDA components for efficiency. The design prioritizes readability and modularity, with full type annotations, while achieving near-SGLang performance on benchmarks.
High-Level Architecture
The architecture is a monolithic runtime with a clear separation of frontend (API/shell), scheduling, execution, and low-level kernels. Requests enter via HTTP (OpenAI-compatible) or shell, flow through a scheduler that manages batching and caching, and execute on a batched engine leveraging GPU kernels and distributed comms.
graph LR
subgraph Frontend["Frontend"]
Client["Client / OpenAI Clients"] -->|"HTTP POST /v1/chat/completions"| APIServer["FastAPI Server<br/>server/"]
User["Terminal User"] -->|"Interactive Chat"| Shell["REPL Shell<br/>shell.py"]
end
APIServer --> Core["Core Runtime<br/>core.py"]
Shell --> Core
Core --> Scheduler["Scheduler<br/>scheduler/"]
Scheduler -->|"Enqueue / Batch / Radix Lookup"| Queue["Request Queue"]
Scheduler -->|"Overlap Scheduling"| RadixCache["Radix KV Cache<br/>kvcache/ attention/"]
Queue --> Engine["Inference Engine<br/>engine/"]
Engine -->|"Prefill / Decode Batches"| Model["Model Loader / Layers<br/>models/ layers/ llm/"]
Model -->|"Forward Pass"| Kernels["Optimized Kernels<br/>kernel/ FlashInfer / FlashAttn"]
Kernels <--> RadixCache
Kernels <--> TP["Tensor Parallel<br/>distributed/"]
subgraph Backend["GPU Backend"]
TP
Kernels
end
Engine -->|"Logits / Tokens"| Scheduler
Scheduler -->|"Stream Response"| Core
Core --> Frontend
Key Interactions:
- Scheduler ↔ Engine: Bidirectional for iterative decode (prefill once, decode tokens).
- Radix Cache: Shared across requests for prefix reuse (trie-structured KV blocks).
- Tensor Parallel (TP): All-reduce comms during engine forward passes.
Component Breakdown
Frontend: API Server and Shell
Responsibility: Handles incoming requests, parses prompts/messages, and streams responses. OpenAI API compatible for easy integration (/v1/chat/completions, streaming support).
Key Files/Dirs:
server/: FastAPI app (server.pylikely), Uvicorn integration, request parsing viamessage/andtokenizer/.shell.py: PromptToolkit-based REPL with/resetfor history management.message/: Structured messages (system/user/assistant).
Interfaces:
- Inbound: HTTP (FastAPI), stdin (shell).
- Outbound: Calls
core.pyto enqueue requests in scheduler. - Clever Pattern: Unified request abstraction via
message/allows seamless API/shell switching.
Core Runtime
Responsibility: Entry point and orchestrator. Initializes model, scheduler, engine; launches server/shell based on CLI flags (e.g., --model, --tp, --shell).
Key Files:
__main__.py: CLI parser, launches viapython -m minisgl.core.py: CentralRuntimeclass coordinating all components.
Interfaces:
- Wires scheduler/engine/model; env config via
env.py. - Trade-off: Single-threaded init for simplicity, but scales via TP.
Scheduler
Responsibility: Advanced batching with radix cache (trie-based prefix sharing), chunked prefill (splits long inputs), and overlap scheduling (CPU batching hidden behind GPU compute). Manages request lifecycle: enqueue, prioritize, dispatch batches.
Key Files/Dirs:
scheduler/: Core scheduling logic, queue management.- Integrates
utils/for helpers.
Interfaces:
- From frontend:
add_request(prompt, params). - To engine:
run_batch(seqs, is_prefill). - To radix cache: Lookup/insert KV blocks by prefix hash.
- Clever Pattern: Priority queue with decode-first scheduling; overlap via async GPU streams.
KV Cache and Attention (RadixAttention)
Responsibility: Efficient KV storage/reuse. Radix cache uses a trie to share prefixes across requests, reducing recompute. Supports chunked access.
Key Files/Dirs:
kvcache/: KV store impl.attention/: Radix-aware attention ops.
Interfaces:
- Scheduler queries by sequence ID/prefix; engine reads/writes blocks.
- Trade-off: Memory-efficient (shares prefixes) but higher CPU overhead for trie ops vs. paged cache.
Inference Engine
Responsibility: Batched forward passes (prefill/decode). Handles chunking, looping over decode tokens, TP sync.
Key Files/Dirs:
engine/: Batch execution loop.
Interfaces:
- From scheduler: Batched sequences.
- To model/layers:
forward(hidden_states, positions). - Pattern: Loop unrolling for decode, with GPU stream overlap.
Model, Layers, and Kernels
Responsibility: Model loading (HuggingFace transformers), layer execution (MLP/Attention), custom kernels for speed.
Key Files/Dirs:
models/: Model config/loader.layers/: Linear/MLP/RMSNorm impls.llm/: LLM-specific wrappers.kernel/: Custom PyTorch ops, integratessgl_kernel, FlashInfer; C++ src incsrc/.
Interfaces:
- Engine drives sequential layer calls; kernels handle attn/concat.
- Clever Pattern: Kernel fusion (FlashInfer page KV); TP all-reduce in kernels.
Distributed (Tensor Parallelism)
Responsibility: Multi-GPU sharding (column-wise for layers), all-reduce for attn outputs.
Key Files/Dirs:
distributed/: TP init, comm primitives.- Tests:
tests/kernel/test_comm.py.
Interfaces:
- Transparent to engine; wraps model sharding.
- Trade-off: Simple TP (no pipeline), scales to 4+ GPUs via NVLink.
Data Flow
Typical online serving flow for a /chat/completions request (streaming decode). Prefill is chunked if long; decode loops with radix hits.
sequenceDiagram
participant C as Client
participant API as API Server (server/)
participant Sch as Scheduler (scheduler/)
participant Radix as Radix KV (kvcache/)
participant Eng as Engine (engine/)
participant Model as Model/Layers (models/)
participant Kern as Kernels (kernel/)
C->>+API: POST /v1/chat/completions {messages, stream=true}
API->>+Sch: runtime.scheduler.add_request(messages)
Sch->>+Radix: lookup_prefix(seq_id)
Radix-->>-Sch: kv_blocks or miss
Sch->>+Sch: batch_requests()
Sch->>+Eng: run_prefill_batch(seqs, positions)
Eng->>+Model: model.forward(seq_lens, positions)
Model->>+Kern: attn(queries, radix_kv)
Kern->>+Radix: evict/insert_kv_blocks()
Radix-->>-Kern: cached_kv
Kern-->>-Model: attn_out
Model-->>-Eng: logits
Eng-->>-Sch: new_tokens (top-k sample)
loop Decode (until EOS)
Sch->>+Eng: run_decode_batch(1 token)
Eng->>+Model: forward(extend positions)
Note over Model,Kern: Reuse radix KV +1 token
Model-->>-Eng: logits
Eng-->>-Sch: token
Sch-->>-API: yield token (stream)
end
API-->>-C: [token1, token2, ...] done
Notes:
- Chunked Prefill: Long inputs split into fixed-size chunks (e.g., 1024 tokens).
- Overlap: Scheduler batches next while engine decodes prior.
- Radix Hit: Skips prefill recompute for shared prefixes.
Key Design Decisions
-
Architectural Pattern: Monolith with Modular Components (single process, clear dirs for concerns). Event-loop-like scheduler (PyTorch async) vs. full actor model. Batch-oriented (dynamic batching) for throughput.
-
Optimizations as First-Class:
Feature Benefit Trade-off Radix Cache Prefix sharing (e.g., 2-5x TTFT in traces) Trie CPU overhead (~5-10% sched time); mem fragmentation Chunked Prefill Lowers peak HBM (long ctx on H200) Minor latency (+1-2% TTFT) Overlap Scheduling Hides 20-50% CPU overhead Requires GPU streams; disable via MINISGL_DISABLE_OVERLAP_SCHEDULING=1FlashInfer Kernels SOTA attn speed CUDA/Linux only; JIT compile time -
Trade-offs for Compactness:
- Python-heavy (~95% Python, minimal C++ in
kernel/csrc/); readable but no AOT Rust/C++ for prod scale. - HF Transformers loader: Easy model support, but slower load vs. custom safetensors.
- Single runtime (no microservices): Low latency, easy debug; scales vertically (TP).
- No pipeline parallelism: Simpler, focuses on TP + batching.
- Python-heavy (~95% Python, minimal C++ in
-
Extensibility: Override scheduler/engine in
core.py; add kernels viasgl_kernel. Benchmarks inbenchmark/validate (e.g., matches SGLang on Qwen traces).
This design demystifies LLM serving: scheduler is the “brain,” kernels the “muscle.” Dive into core.py for wiring.