8 min read
SGLang Prometheus Metrics: A Guide for Production Monitoring

If you’re running SGLang in production to serve large language models, proper observability is non-negotiable. SGLang exposes a rich set of Prometheus metrics that give you deep visibility into token throughput, request latencies, queue depths, and cache efficiency. In this post, I’ll walk you through every metric available and show you how to use them effectively for production monitoring.

Why Monitor SGLang?

SGLang has become one of the go-to inference engines for LLM serving, powering deployments across thousands of GPUs. When you’re serving high throughput requests, you need to know:

  • Are requests queueing up? High queue depth means you need to scale.
  • What’s the time to first token (TTFT)? Users notice slow initial responses.
  • Is the prefix cache working? Good cache hit rates dramatically improve throughput.
  • Are all workers healthy? In distributed setups, worker failures need fast detection.

Let’s dive into the metrics that answer these questions.

Enabling Metrics

First, enable the metrics endpoint when launching your SGLang server:

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000 \
  --enable-metrics

Metrics are then available at http://localhost:30000/metrics in Prometheus exposition format.

Version Note: SGLang v0.5.4+ changed the metric prefix from sglang: to sglang_. If you’re upgrading, update your Grafana dashboards and alerting rules accordingly.

Server Engine Metrics

These metrics come directly from the SGLang inference engine and give you visibility into the core serving performance.

Token Throughput

MetricTypeDescription
sglang:prompt_tokens_totalCounterTotal prefill (input) tokens processed
sglang:generation_tokens_totalCounterTotal generation (output) tokens produced
sglang:cached_tokens_totalCounterTokens served from prefix cache
sglang:gen_throughputGaugeCurrent generation rate (tokens/sec)

Useful queries:

# Token generation rate over 5 minutes
rate(sglang:generation_tokens_total[5m])

# Input vs output token ratio
rate(sglang:prompt_tokens_total[5m]) / rate(sglang:generation_tokens_total[5m])

Queue and Running State

MetricTypeDescription
sglang:num_running_reqsGaugeCurrently processing requests
sglang:num_queue_reqsGaugeRequests waiting in queue
sglang:num_used_tokensGaugeTokens in KV cache

These are your capacity indicators. When num_queue_reqs starts climbing, you’re approaching saturation.

Cache Efficiency

MetricTypeDescription
sglang:cache_hit_rateGaugePrefix cache hit ratio (0.0-1.0)
sglang:token_usageGaugeKV cache utilization ratio

SGLang’s RadixAttention prefix caching is one of its killer features. A healthy deployment should see cache hit rates above 0.3-0.5 for repetitive workloads (like system prompts).

Latency Histograms

This is where it gets interesting. SGLang exposes three critical latency distributions:

Time to First Token (TTFT)

sglang:time_to_first_token_seconds_bucket{le="0.1"}
sglang:time_to_first_token_seconds_bucket{le="0.5"}
sglang:time_to_first_token_seconds_bucket{le="1.0"}
...

TTFT is what users “feel” first. Get your P95 with:

histogram_quantile(0.95, rate(sglang:time_to_first_token_seconds_bucket[5m]))

End-to-End Latency

sglang:e2e_request_latency_seconds_bucket{le="5.0"}
sglang:e2e_request_latency_seconds_bucket{le="10.0"}
...

Total request duration from arrival to completion.

Time Per Output Token (TPOT)

sglang:time_per_output_token_seconds_bucket{le="0.05"}
sglang:time_per_output_token_seconds_bucket{le="0.1"}
...

Also called inter-token latency (ITL). This affects perceived streaming speed.

Understanding Phase Labels

Many SGLang metrics include a phase label that distinguishes between the two distinct phases of LLM inference:

  • phase="prefill" - Processing the input prompt (compute-bound)
  • phase="decode" - Generating output tokens one at a time (memory-bandwidth-bound)

This is critical for debugging performance issues because each phase has different bottlenecks:

sglang:num_running_reqs{phase="prefill"}  # Requests currently in prefill
sglang:num_running_reqs{phase="decode"}   # Requests currently decoding

Why this matters:

PhaseBottleneckSymptom of Issues
PrefillGPU computeHigh TTFT, long prompt processing
DecodeMemory bandwidthSlow token generation, high TPOT

Useful queries with phase labels:

# Prefill vs decode request distribution
sglang:num_running_reqs{phase="prefill"} / sglang:num_running_reqs{phase="decode"}

# Track prefill-heavy workloads (long prompts)
rate(sglang:prompt_tokens_total[5m]) by (phase)

If you see many requests stuck in prefill, you may have long input sequences overwhelming compute. If decode is the bottleneck, consider batching strategies or memory optimization.

Router Metrics

If you’re running SGLang Router (sglang_router) for load balancing across multiple workers, you get an additional set of metrics at the router level (default port 29000).

python -m sglang_router.launch_router \
  --worker-urls http://worker1:8000 http://worker2:8001 \
  --prometheus-host 0.0.0.0 \
  --prometheus-port 29000

Key Router Metrics

MetricTypeDescription
sgl_router_requests_totalCounterRequests by route and method
sgl_router_processed_requests_totalCounterRequests handled per worker
sgl_router_active_workersGaugeHealthy worker count
sgl_router_running_requestsGaugeIn-flight requests per worker
sgl_router_cache_hits_totalCounterCache-aware routing hits
sgl_router_cache_misses_totalCounterCache-aware routing misses

The cache-aware routing metrics are particularly useful—they tell you how effectively the router is directing requests to workers that already have relevant prefixes cached.

Production Alerting Rules

Here’s a starter set of alerting rules I recommend:

groups:
  - name: sglang_alerts
    rules:
      # Queue backing up
      - alert: SGLangHighQueueDepth
        expr: sglang:num_queue_reqs > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Request queue depth is high ({{ $value }})"
          description: "Consider scaling up workers or reducing request rate"

      # Slow time to first token
      - alert: SGLangHighTTFT
        expr: histogram_quantile(0.95, rate(sglang:time_to_first_token_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 TTFT exceeds 5 seconds"

      # Cache not being utilized
      - alert: SGLangLowCacheHitRate
        expr: sglang:cache_hit_rate < 0.1
        for: 15m
        labels:
          severity: info
        annotations:
          summary: "Prefix cache hit rate below 10%"
          description: "Check if request patterns allow for caching"

      # Worker down
      - alert: SGLangNoActiveWorkers
        expr: sgl_router_active_workers < 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "No healthy SGLang workers available"

Dashboard Queries Cheat Sheet

Here are the PromQL queries I use most often:

# Overall throughput
rate(sglang:generation_tokens_total[5m])

# P50/P95/P99 TTFT
histogram_quantile(0.50, rate(sglang:time_to_first_token_seconds_bucket[5m]))
histogram_quantile(0.95, rate(sglang:time_to_first_token_seconds_bucket[5m]))
histogram_quantile(0.99, rate(sglang:time_to_first_token_seconds_bucket[5m]))

# Average inter-token latency
rate(sglang:time_per_output_token_seconds_sum[5m])
  / rate(sglang:time_per_output_token_seconds_count[5m])

# Requests per second
rate(sglang:e2e_request_latency_seconds_count[5m])

# Cache efficiency
sglang:cache_hit_rate

# Queue pressure
sglang:num_queue_reqs / (sglang:num_running_reqs + 1)

# Router load balance check
rate(sgl_router_processed_requests_total[5m])

Quick Setup with Docker Compose

SGLang ships with an example monitoring stack:

cd examples/monitoring
docker compose up -d

This spins up Prometheus and Grafana with a pre-configured dashboard. Access Grafana at http://localhost:3000 (default: admin/admin).

Complete Metrics Reference

For quick reference, here’s the complete list:

Server Engine Metrics

MetricTypeDescription
sglang:prompt_tokens_totalCounterTotal prefill (input) tokens processed
sglang:generation_tokens_totalCounterTotal generation (output) tokens produced
sglang:cached_tokens_totalCounterTokens served from prefix cache
sglang:gen_throughputGaugeCurrent generation rate (tokens/sec)
sglang:num_running_reqsGaugeCurrently processing requests
sglang:num_queue_reqsGaugeRequests waiting in queue
sglang:num_used_tokensGaugeTokens in KV cache
sglang:cache_hit_rateGaugePrefix cache hit ratio (0.0-1.0)
sglang:token_usageGaugeKV cache utilization ratio
sglang:time_to_first_token_secondsHistogramTime until first token generated
sglang:e2e_request_latency_secondsHistogramTotal request duration
sglang:time_per_output_token_secondsHistogramInter-token latency (ITL)
sglang:func_latency_secondsHistogramInternal function latencies

Router Metrics

MetricTypeDescription
sgl_router_requests_totalCounterRequests by route and method
sgl_router_processed_requests_totalCounterRequests handled per worker
sgl_router_active_workersGaugeHealthy worker count
sgl_router_running_requestsGaugeIn-flight requests per worker
sgl_router_worker_healthGaugeWorker health status
sgl_router_cache_hits_totalCounterCache-aware routing hits
sgl_router_cache_misses_totalCounterCache-aware routing misses
sgl_router_generate_duration_secondsHistogramGeneration request duration

Conclusion

Proper observability is what separates a demo from a production deployment. SGLang’s Prometheus metrics give you everything you need to monitor throughput, latencies, queue health, and cache efficiency. Set up your dashboards, configure alerts for queue depth and TTFT, and you’ll catch issues before your users do.

References