Mini-sglang Architecture

Mini-SGLang Architecture

Mini-SGLang is a compact, high-performance LLM inference engine (~5,000 lines of Python) that mirrors key innovations from SGLang, such as radix cache, chunked prefill, overlap scheduling, and tensor parallelism. It uses PyTorch for core tensor ops, FlashInfer/FlashAttention for optimized kernels, and custom Python/CUDA components for efficiency. The design prioritizes readability and modularity, with full type annotations, while achieving near-SGLang performance on benchmarks.

High-Level Architecture

The architecture is a monolithic runtime with a clear separation of frontend (API/shell), scheduling, execution, and low-level kernels. Requests enter via HTTP (OpenAI-compatible) or shell, flow through a scheduler that manages batching and caching, and execute on a batched engine leveraging GPU kernels and distributed comms.

graph LR
    subgraph Frontend["Frontend"]
        Client["Client / OpenAI Clients"] -->|"HTTP POST /v1/chat/completions"| APIServer["FastAPI Server<br/>server/"]
        User["Terminal User"] -->|"Interactive Chat"| Shell["REPL Shell<br/>shell.py"]
    end

    APIServer --> Core["Core Runtime<br/>core.py"]
    Shell --> Core

    Core --> Scheduler["Scheduler<br/>scheduler/"]
    Scheduler -->|"Enqueue / Batch / Radix Lookup"| Queue["Request Queue"]
    Scheduler -->|"Overlap Scheduling"| RadixCache["Radix KV Cache<br/>kvcache/ attention/"]

    Queue --> Engine["Inference Engine<br/>engine/"]
    Engine -->|"Prefill / Decode Batches"| Model["Model Loader / Layers<br/>models/ layers/ llm/"]
    Model -->|"Forward Pass"| Kernels["Optimized Kernels<br/>kernel/ FlashInfer / FlashAttn"]
    Kernels <--> RadixCache
    Kernels <--> TP["Tensor Parallel<br/>distributed/"]

    subgraph Backend["GPU Backend"]
        TP
        Kernels
    end

    Engine -->|"Logits / Tokens"| Scheduler
    Scheduler -->|"Stream Response"| Core
    Core --> Frontend

Key Interactions:

  • Scheduler ↔ Engine: Bidirectional for iterative decode (prefill once, decode tokens).
  • Radix Cache: Shared across requests for prefix reuse (trie-structured KV blocks).
  • Tensor Parallel (TP): All-reduce comms during engine forward passes.

Component Breakdown

Frontend: API Server and Shell

Responsibility: Handles incoming requests, parses prompts/messages, and streams responses. OpenAI API compatible for easy integration (/v1/chat/completions, streaming support).

Key Files/Dirs:

  • server/: FastAPI app (server.py likely), Uvicorn integration, request parsing via message/ and tokenizer/.
  • shell.py: PromptToolkit-based REPL with /reset for history management.
  • message/: Structured messages (system/user/assistant).

Interfaces:

  • Inbound: HTTP (FastAPI), stdin (shell).
  • Outbound: Calls core.py to enqueue requests in scheduler.
  • Clever Pattern: Unified request abstraction via message/ allows seamless API/shell switching.

Core Runtime

Responsibility: Entry point and orchestrator. Initializes model, scheduler, engine; launches server/shell based on CLI flags (e.g., --model, --tp, --shell).

Key Files:

  • __main__.py: CLI parser, launches via python -m minisgl.
  • core.py: Central Runtime class coordinating all components.

Interfaces:

  • Wires scheduler/engine/model; env config via env.py.
  • Trade-off: Single-threaded init for simplicity, but scales via TP.

Scheduler

Responsibility: Advanced batching with radix cache (trie-based prefix sharing), chunked prefill (splits long inputs), and overlap scheduling (CPU batching hidden behind GPU compute). Manages request lifecycle: enqueue, prioritize, dispatch batches.

Key Files/Dirs:

  • scheduler/: Core scheduling logic, queue management.
  • Integrates utils/ for helpers.

Interfaces:

  • From frontend: add_request(prompt, params).
  • To engine: run_batch(seqs, is_prefill).
  • To radix cache: Lookup/insert KV blocks by prefix hash.
  • Clever Pattern: Priority queue with decode-first scheduling; overlap via async GPU streams.

KV Cache and Attention (RadixAttention)

Responsibility: Efficient KV storage/reuse. Radix cache uses a trie to share prefixes across requests, reducing recompute. Supports chunked access.

Key Files/Dirs:

Interfaces:

  • Scheduler queries by sequence ID/prefix; engine reads/writes blocks.
  • Trade-off: Memory-efficient (shares prefixes) but higher CPU overhead for trie ops vs. paged cache.

Inference Engine

Responsibility: Batched forward passes (prefill/decode). Handles chunking, looping over decode tokens, TP sync.

Key Files/Dirs:

Interfaces:

  • From scheduler: Batched sequences.
  • To model/layers: forward(hidden_states, positions).
  • Pattern: Loop unrolling for decode, with GPU stream overlap.

Model, Layers, and Kernels

Responsibility: Model loading (HuggingFace transformers), layer execution (MLP/Attention), custom kernels for speed.

Key Files/Dirs:

  • models/: Model config/loader.
  • layers/: Linear/MLP/RMSNorm impls.
  • llm/: LLM-specific wrappers.
  • kernel/: Custom PyTorch ops, integrates sgl_kernel, FlashInfer; C++ src in csrc/.

Interfaces:

  • Engine drives sequential layer calls; kernels handle attn/concat.
  • Clever Pattern: Kernel fusion (FlashInfer page KV); TP all-reduce in kernels.

Distributed (Tensor Parallelism)

Responsibility: Multi-GPU sharding (column-wise for layers), all-reduce for attn outputs.

Key Files/Dirs:

Interfaces:

  • Transparent to engine; wraps model sharding.
  • Trade-off: Simple TP (no pipeline), scales to 4+ GPUs via NVLink.

Data Flow

Typical online serving flow for a /chat/completions request (streaming decode). Prefill is chunked if long; decode loops with radix hits.

sequenceDiagram
    participant C as Client
    participant API as API Server (server/)
    participant Sch as Scheduler (scheduler/)
    participant Radix as Radix KV (kvcache/)
    participant Eng as Engine (engine/)
    participant Model as Model/Layers (models/)
    participant Kern as Kernels (kernel/)

    C->>+API: POST /v1/chat/completions {messages, stream=true}
    API->>+Sch: runtime.scheduler.add_request(messages)
    Sch->>+Radix: lookup_prefix(seq_id)
    Radix-->>-Sch: kv_blocks or miss
    Sch->>+Sch: batch_requests()
    Sch->>+Eng: run_prefill_batch(seqs, positions)
    Eng->>+Model: model.forward(seq_lens, positions)
    Model->>+Kern: attn(queries, radix_kv)
    Kern->>+Radix: evict/insert_kv_blocks()
    Radix-->>-Kern: cached_kv
    Kern-->>-Model: attn_out
    Model-->>-Eng: logits
    Eng-->>-Sch: new_tokens (top-k sample)

    loop Decode (until EOS)
        Sch->>+Eng: run_decode_batch(1 token)
        Eng->>+Model: forward(extend positions)
        Note over Model,Kern: Reuse radix KV +1 token
        Model-->>-Eng: logits
        Eng-->>-Sch: token
        Sch-->>-API: yield token (stream)
    end

    API-->>-C: [token1, token2, ...] done

Notes:

  • Chunked Prefill: Long inputs split into fixed-size chunks (e.g., 1024 tokens).
  • Overlap: Scheduler batches next while engine decodes prior.
  • Radix Hit: Skips prefill recompute for shared prefixes.

Key Design Decisions

  • Architectural Pattern: Monolith with Modular Components (single process, clear dirs for concerns). Event-loop-like scheduler (PyTorch async) vs. full actor model. Batch-oriented (dynamic batching) for throughput.

  • Optimizations as First-Class:

    FeatureBenefitTrade-off
    Radix CachePrefix sharing (e.g., 2-5x TTFT in traces)Trie CPU overhead (~5-10% sched time); mem fragmentation
    Chunked PrefillLowers peak HBM (long ctx on H200)Minor latency (+1-2% TTFT)
    Overlap SchedulingHides 20-50% CPU overheadRequires GPU streams; disable via MINISGL_DISABLE_OVERLAP_SCHEDULING=1
    FlashInfer KernelsSOTA attn speedCUDA/Linux only; JIT compile time
  • Trade-offs for Compactness:

    • Python-heavy (~95% Python, minimal C++ in kernel/csrc/); readable but no AOT Rust/C++ for prod scale.
    • HF Transformers loader: Easy model support, but slower load vs. custom safetensors.
    • Single runtime (no microservices): Low latency, easy debug; scales vertically (TP).
    • No pipeline parallelism: Simpler, focuses on TP + batching.
  • Extensibility: Override scheduler/engine in core.py; add kernels via sgl_kernel. Benchmarks in benchmark/ validate (e.g., matches SGLang on Qwen traces).

This design demystifies LLM serving: scheduler is the “brain,” kernels the “muscle.” Dive into core.py for wiring.