SGLang Prometheus Metrics: A Guide for Production Monitoring

If you’re running SGLang in production to serve large language models, proper observability is non-negotiable. SGLang exposes a rich set of Prometheus metrics that give you deep visibility into token throughput, request latencies, queue depths, and cache efficiency. In this post, I’ll walk you through every metric available and show you how to use them effectively for production monitoring.

Why Monitor SGLang?

SGLang has become one of the go-to inference engines for LLM serving, powering deployments across thousands of GPUs. When you’re serving high throughput requests, you need to know:

Are requests queueing up? High queue depth means you need to scale.
What’s the time to first token (TTFT)? Users notice slow initial responses.
Is the prefix cache working? Good cache hit rates dramatically improve throughput.
Are all workers healthy? In distributed setups, worker failures need fast detection.

Let’s dive into the metrics that answer these questions.

Enabling Metrics

First, enable the metrics endpoint when launching your SGLang server:

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000 \
  --enable-metrics

Metrics are then available at http://localhost:30000/metrics in Prometheus exposition format.

Version Note: SGLang v0.5.4+ changed the metric prefix from sglang: to sglang_. If you’re upgrading, update your Grafana dashboards and alerting rules accordingly.

Server Engine Metrics

These metrics come directly from the SGLang inference engine and give you visibility into the core serving performance.

Token Throughput

Metric	Type	Description
`sglang:prompt_tokens_total`	Counter	Total prefill (input) tokens processed
`sglang:generation_tokens_total`	Counter	Total generation (output) tokens produced
`sglang:cached_tokens_total`	Counter	Tokens served from prefix cache
`sglang:gen_throughput`	Gauge	Current generation rate (tokens/sec)

Useful queries:

# Token generation rate over 5 minutes
rate(sglang:generation_tokens_total[5m])

# Input vs output token ratio
rate(sglang:prompt_tokens_total[5m]) / rate(sglang:generation_tokens_total[5m])

Queue and Running State

Metric	Type	Description
`sglang:num_running_reqs`	Gauge	Currently processing requests
`sglang:num_queue_reqs`	Gauge	Requests waiting in queue
`sglang:num_used_tokens`	Gauge	Tokens in KV cache

These are your capacity indicators. When num_queue_reqs starts climbing, you’re approaching saturation.

Cache Efficiency

Metric	Type	Description
`sglang:cache_hit_rate`	Gauge	Prefix cache hit ratio (0.0-1.0)
`sglang:token_usage`	Gauge	KV cache utilization ratio

SGLang’s RadixAttention prefix caching is one of its killer features. A healthy deployment should see cache hit rates above 0.3-0.5 for repetitive workloads (like system prompts).

Latency Histograms

This is where it gets interesting. SGLang exposes three critical latency distributions:

Time to First Token (TTFT)

sglang:time_to_first_token_seconds_bucket{le="0.1"}
sglang:time_to_first_token_seconds_bucket{le="0.5"}
sglang:time_to_first_token_seconds_bucket{le="1.0"}
...

TTFT is what users “feel” first. Get your P95 with:

histogram_quantile(0.95, rate(sglang:time_to_first_token_seconds_bucket[5m]))

End-to-End Latency

sglang:e2e_request_latency_seconds_bucket{le="5.0"}
sglang:e2e_request_latency_seconds_bucket{le="10.0"}
...

Total request duration from arrival to completion.

Time Per Output Token (TPOT)

sglang:time_per_output_token_seconds_bucket{le="0.05"}
sglang:time_per_output_token_seconds_bucket{le="0.1"}
...

Also called inter-token latency (ITL). This affects perceived streaming speed.

Understanding Phase Labels

Many SGLang metrics include a phase label that distinguishes between the two distinct phases of LLM inference:

phase="prefill" - Processing the input prompt (compute-bound)
phase="decode" - Generating output tokens one at a time (memory-bandwidth-bound)

This is critical for debugging performance issues because each phase has different bottlenecks:

sglang:num_running_reqs{phase="prefill"}  # Requests currently in prefill
sglang:num_running_reqs{phase="decode"}   # Requests currently decoding

Why this matters:

Phase	Bottleneck	Symptom of Issues
Prefill	GPU compute	High TTFT, long prompt processing
Decode	Memory bandwidth	Slow token generation, high TPOT

Useful queries with phase labels:

# Prefill vs decode request distribution
sglang:num_running_reqs{phase="prefill"} / sglang:num_running_reqs{phase="decode"}

# Track prefill-heavy workloads (long prompts)
rate(sglang:prompt_tokens_total[5m]) by (phase)

If you see many requests stuck in prefill, you may have long input sequences overwhelming compute. If decode is the bottleneck, consider batching strategies or memory optimization.

Router Metrics

If you’re running SGLang Router (sglang_router) for load balancing across multiple workers, you get an additional set of metrics at the router level (default port 29000).

python -m sglang_router.launch_router \
  --worker-urls http://worker1:8000 http://worker2:8001 \
  --prometheus-host 0.0.0.0 \
  --prometheus-port 29000

Key Router Metrics

Metric	Type	Description
`sgl_router_requests_total`	Counter	Requests by route and method
`sgl_router_processed_requests_total`	Counter	Requests handled per worker
`sgl_router_active_workers`	Gauge	Healthy worker count
`sgl_router_running_requests`	Gauge	In-flight requests per worker
`sgl_router_cache_hits_total`	Counter	Cache-aware routing hits
`sgl_router_cache_misses_total`	Counter	Cache-aware routing misses

The cache-aware routing metrics are particularly useful—they tell you how effectively the router is directing requests to workers that already have relevant prefixes cached.

Production Alerting Rules

Here’s a starter set of alerting rules I recommend:

groups:
  - name: sglang_alerts
    rules:
      # Queue backing up
      - alert: SGLangHighQueueDepth
        expr: sglang:num_queue_reqs > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Request queue depth is high ({{ $value }})"
          description: "Consider scaling up workers or reducing request rate"

      # Slow time to first token
      - alert: SGLangHighTTFT
        expr: histogram_quantile(0.95, rate(sglang:time_to_first_token_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 TTFT exceeds 5 seconds"

      # Cache not being utilized
      - alert: SGLangLowCacheHitRate
        expr: sglang:cache_hit_rate < 0.1
        for: 15m
        labels:
          severity: info
        annotations:
          summary: "Prefix cache hit rate below 10%"
          description: "Check if request patterns allow for caching"

      # Worker down
      - alert: SGLangNoActiveWorkers
        expr: sgl_router_active_workers < 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "No healthy SGLang workers available"

Dashboard Queries Cheat Sheet

Here are the PromQL queries I use most often:

# Overall throughput
rate(sglang:generation_tokens_total[5m])

# P50/P95/P99 TTFT
histogram_quantile(0.50, rate(sglang:time_to_first_token_seconds_bucket[5m]))
histogram_quantile(0.95, rate(sglang:time_to_first_token_seconds_bucket[5m]))
histogram_quantile(0.99, rate(sglang:time_to_first_token_seconds_bucket[5m]))

# Average inter-token latency
rate(sglang:time_per_output_token_seconds_sum[5m])
  / rate(sglang:time_per_output_token_seconds_count[5m])

# Requests per second
rate(sglang:e2e_request_latency_seconds_count[5m])

# Cache efficiency
sglang:cache_hit_rate

# Queue pressure
sglang:num_queue_reqs / (sglang:num_running_reqs + 1)

# Router load balance check
rate(sgl_router_processed_requests_total[5m])

Quick Setup with Docker Compose

SGLang ships with an example monitoring stack:

cd examples/monitoring
docker compose up -d

This spins up Prometheus and Grafana with a pre-configured dashboard. Access Grafana at http://localhost:3000 (default: admin/admin).

Complete Metrics Reference

For quick reference, here’s the complete list:

Server Engine Metrics

Metric	Type	Description
`sglang:prompt_tokens_total`	Counter	Total prefill (input) tokens processed
`sglang:generation_tokens_total`	Counter	Total generation (output) tokens produced
`sglang:cached_tokens_total`	Counter	Tokens served from prefix cache
`sglang:gen_throughput`	Gauge	Current generation rate (tokens/sec)
`sglang:num_running_reqs`	Gauge	Currently processing requests
`sglang:num_queue_reqs`	Gauge	Requests waiting in queue
`sglang:num_used_tokens`	Gauge	Tokens in KV cache
`sglang:cache_hit_rate`	Gauge	Prefix cache hit ratio (0.0-1.0)
`sglang:token_usage`	Gauge	KV cache utilization ratio
`sglang:time_to_first_token_seconds`	Histogram	Time until first token generated
`sglang:e2e_request_latency_seconds`	Histogram	Total request duration
`sglang:time_per_output_token_seconds`	Histogram	Inter-token latency (ITL)
`sglang:func_latency_seconds`	Histogram	Internal function latencies

Router Metrics

Metric	Type	Description
`sgl_router_requests_total`	Counter	Requests by route and method
`sgl_router_processed_requests_total`	Counter	Requests handled per worker
`sgl_router_active_workers`	Gauge	Healthy worker count
`sgl_router_running_requests`	Gauge	In-flight requests per worker
`sgl_router_worker_health`	Gauge	Worker health status
`sgl_router_cache_hits_total`	Counter	Cache-aware routing hits
`sgl_router_cache_misses_total`	Counter	Cache-aware routing misses
`sgl_router_generate_duration_seconds`	Histogram	Generation request duration

Conclusion

Proper observability is what separates a demo from a production deployment. SGLang’s Prometheus metrics give you everything you need to monitor throughput, latencies, queue health, and cache efficiency. Set up your dashboards, configure alerts for queue depth and TTFT, and you’ll catch issues before your users do.

Why Monitor SGLang?

Enabling Metrics

Server Engine Metrics

Token Throughput

Queue and Running State

Cache Efficiency

Latency Histograms

Time to First Token (TTFT)

End-to-End Latency

Time Per Output Token (TPOT)

Understanding Phase Labels

Router Metrics

Key Router Metrics

Production Alerting Rules

Dashboard Queries Cheat Sheet

Quick Setup with Docker Compose

Complete Metrics Reference

Server Engine Metrics

Router Metrics

Conclusion

References