If you’re running SGLang in production to serve large language models, proper observability is non-negotiable. SGLang exposes a rich set of Prometheus metrics that give you deep visibility into token throughput, request latencies, queue depths, and cache efficiency. In this post, I’ll walk you through every metric available and show you how to use them effectively for production monitoring.
Why Monitor SGLang?
SGLang has become one of the go-to inference engines for LLM serving, powering deployments across thousands of GPUs. When you’re serving high throughput requests, you need to know:
- Are requests queueing up? High queue depth means you need to scale.
- What’s the time to first token (TTFT)? Users notice slow initial responses.
- Is the prefix cache working? Good cache hit rates dramatically improve throughput.
- Are all workers healthy? In distributed setups, worker failures need fast detection.
Let’s dive into the metrics that answer these questions.
Enabling Metrics
First, enable the metrics endpoint when launching your SGLang server:
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--port 30000 \
--enable-metrics
Metrics are then available at http://localhost:30000/metrics in Prometheus exposition format.
Version Note: SGLang v0.5.4+ changed the metric prefix from
sglang:tosglang_. If you’re upgrading, update your Grafana dashboards and alerting rules accordingly.
Server Engine Metrics
These metrics come directly from the SGLang inference engine and give you visibility into the core serving performance.
Token Throughput
| Metric | Type | Description |
|---|---|---|
sglang:prompt_tokens_total | Counter | Total prefill (input) tokens processed |
sglang:generation_tokens_total | Counter | Total generation (output) tokens produced |
sglang:cached_tokens_total | Counter | Tokens served from prefix cache |
sglang:gen_throughput | Gauge | Current generation rate (tokens/sec) |
Useful queries:
# Token generation rate over 5 minutes
rate(sglang:generation_tokens_total[5m])
# Input vs output token ratio
rate(sglang:prompt_tokens_total[5m]) / rate(sglang:generation_tokens_total[5m])
Queue and Running State
| Metric | Type | Description |
|---|---|---|
sglang:num_running_reqs | Gauge | Currently processing requests |
sglang:num_queue_reqs | Gauge | Requests waiting in queue |
sglang:num_used_tokens | Gauge | Tokens in KV cache |
These are your capacity indicators. When num_queue_reqs starts climbing, you’re approaching saturation.
Cache Efficiency
| Metric | Type | Description |
|---|---|---|
sglang:cache_hit_rate | Gauge | Prefix cache hit ratio (0.0-1.0) |
sglang:token_usage | Gauge | KV cache utilization ratio |
SGLang’s RadixAttention prefix caching is one of its killer features. A healthy deployment should see cache hit rates above 0.3-0.5 for repetitive workloads (like system prompts).
Latency Histograms
This is where it gets interesting. SGLang exposes three critical latency distributions:
Time to First Token (TTFT)
sglang:time_to_first_token_seconds_bucket{le="0.1"}
sglang:time_to_first_token_seconds_bucket{le="0.5"}
sglang:time_to_first_token_seconds_bucket{le="1.0"}
...
TTFT is what users “feel” first. Get your P95 with:
histogram_quantile(0.95, rate(sglang:time_to_first_token_seconds_bucket[5m]))
End-to-End Latency
sglang:e2e_request_latency_seconds_bucket{le="5.0"}
sglang:e2e_request_latency_seconds_bucket{le="10.0"}
...
Total request duration from arrival to completion.
Time Per Output Token (TPOT)
sglang:time_per_output_token_seconds_bucket{le="0.05"}
sglang:time_per_output_token_seconds_bucket{le="0.1"}
...
Also called inter-token latency (ITL). This affects perceived streaming speed.
Understanding Phase Labels
Many SGLang metrics include a phase label that distinguishes between the two distinct phases of LLM inference:
phase="prefill"- Processing the input prompt (compute-bound)phase="decode"- Generating output tokens one at a time (memory-bandwidth-bound)
This is critical for debugging performance issues because each phase has different bottlenecks:
sglang:num_running_reqs{phase="prefill"} # Requests currently in prefill
sglang:num_running_reqs{phase="decode"} # Requests currently decoding
Why this matters:
| Phase | Bottleneck | Symptom of Issues |
|---|---|---|
| Prefill | GPU compute | High TTFT, long prompt processing |
| Decode | Memory bandwidth | Slow token generation, high TPOT |
Useful queries with phase labels:
# Prefill vs decode request distribution
sglang:num_running_reqs{phase="prefill"} / sglang:num_running_reqs{phase="decode"}
# Track prefill-heavy workloads (long prompts)
rate(sglang:prompt_tokens_total[5m]) by (phase)
If you see many requests stuck in prefill, you may have long input sequences overwhelming compute. If decode is the bottleneck, consider batching strategies or memory optimization.
Router Metrics
If you’re running SGLang Router (sglang_router) for load balancing across multiple workers, you get an additional set of metrics at the router level (default port 29000).
python -m sglang_router.launch_router \
--worker-urls http://worker1:8000 http://worker2:8001 \
--prometheus-host 0.0.0.0 \
--prometheus-port 29000
Key Router Metrics
| Metric | Type | Description |
|---|---|---|
sgl_router_requests_total | Counter | Requests by route and method |
sgl_router_processed_requests_total | Counter | Requests handled per worker |
sgl_router_active_workers | Gauge | Healthy worker count |
sgl_router_running_requests | Gauge | In-flight requests per worker |
sgl_router_cache_hits_total | Counter | Cache-aware routing hits |
sgl_router_cache_misses_total | Counter | Cache-aware routing misses |
The cache-aware routing metrics are particularly useful—they tell you how effectively the router is directing requests to workers that already have relevant prefixes cached.
Production Alerting Rules
Here’s a starter set of alerting rules I recommend:
groups:
- name: sglang_alerts
rules:
# Queue backing up
- alert: SGLangHighQueueDepth
expr: sglang:num_queue_reqs > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Request queue depth is high ({{ $value }})"
description: "Consider scaling up workers or reducing request rate"
# Slow time to first token
- alert: SGLangHighTTFT
expr: histogram_quantile(0.95, rate(sglang:time_to_first_token_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "P95 TTFT exceeds 5 seconds"
# Cache not being utilized
- alert: SGLangLowCacheHitRate
expr: sglang:cache_hit_rate < 0.1
for: 15m
labels:
severity: info
annotations:
summary: "Prefix cache hit rate below 10%"
description: "Check if request patterns allow for caching"
# Worker down
- alert: SGLangNoActiveWorkers
expr: sgl_router_active_workers < 1
for: 1m
labels:
severity: critical
annotations:
summary: "No healthy SGLang workers available"
Dashboard Queries Cheat Sheet
Here are the PromQL queries I use most often:
# Overall throughput
rate(sglang:generation_tokens_total[5m])
# P50/P95/P99 TTFT
histogram_quantile(0.50, rate(sglang:time_to_first_token_seconds_bucket[5m]))
histogram_quantile(0.95, rate(sglang:time_to_first_token_seconds_bucket[5m]))
histogram_quantile(0.99, rate(sglang:time_to_first_token_seconds_bucket[5m]))
# Average inter-token latency
rate(sglang:time_per_output_token_seconds_sum[5m])
/ rate(sglang:time_per_output_token_seconds_count[5m])
# Requests per second
rate(sglang:e2e_request_latency_seconds_count[5m])
# Cache efficiency
sglang:cache_hit_rate
# Queue pressure
sglang:num_queue_reqs / (sglang:num_running_reqs + 1)
# Router load balance check
rate(sgl_router_processed_requests_total[5m])
Quick Setup with Docker Compose
SGLang ships with an example monitoring stack:
cd examples/monitoring
docker compose up -d
This spins up Prometheus and Grafana with a pre-configured dashboard. Access Grafana at http://localhost:3000 (default: admin/admin).
Complete Metrics Reference
For quick reference, here’s the complete list:
Server Engine Metrics
| Metric | Type | Description |
|---|---|---|
sglang:prompt_tokens_total | Counter | Total prefill (input) tokens processed |
sglang:generation_tokens_total | Counter | Total generation (output) tokens produced |
sglang:cached_tokens_total | Counter | Tokens served from prefix cache |
sglang:gen_throughput | Gauge | Current generation rate (tokens/sec) |
sglang:num_running_reqs | Gauge | Currently processing requests |
sglang:num_queue_reqs | Gauge | Requests waiting in queue |
sglang:num_used_tokens | Gauge | Tokens in KV cache |
sglang:cache_hit_rate | Gauge | Prefix cache hit ratio (0.0-1.0) |
sglang:token_usage | Gauge | KV cache utilization ratio |
sglang:time_to_first_token_seconds | Histogram | Time until first token generated |
sglang:e2e_request_latency_seconds | Histogram | Total request duration |
sglang:time_per_output_token_seconds | Histogram | Inter-token latency (ITL) |
sglang:func_latency_seconds | Histogram | Internal function latencies |
Router Metrics
| Metric | Type | Description |
|---|---|---|
sgl_router_requests_total | Counter | Requests by route and method |
sgl_router_processed_requests_total | Counter | Requests handled per worker |
sgl_router_active_workers | Gauge | Healthy worker count |
sgl_router_running_requests | Gauge | In-flight requests per worker |
sgl_router_worker_health | Gauge | Worker health status |
sgl_router_cache_hits_total | Counter | Cache-aware routing hits |
sgl_router_cache_misses_total | Counter | Cache-aware routing misses |
sgl_router_generate_duration_seconds | Histogram | Generation request duration |
Conclusion
Proper observability is what separates a demo from a production deployment. SGLang’s Prometheus metrics give you everything you need to monitor throughput, latencies, queue health, and cache efficiency. Set up your dashboards, configure alerts for queue depth and TTFT, and you’ll catch issues before your users do.