If you’re deploying large language models on Kubernetes with tensor parallelism across multiple GPUs or nodes, you’ve probably hit this problem: some pods in your group get scheduled, others don’t, and the whole deployment hangs. NCCL initialization never completes, ranks don’t sync, and you’re left debugging deadlocks on expensive GPU nodes. This is the exact problem gang scheduling solves.
What Is Gang Scheduling?
Gang scheduling ensures that a group of pods are scheduled on an all-or-nothing basis. If the cluster cannot accommodate the entire group (or a defined minimum number of pods), none of the pods are bound to a node. They all wait until resources are available for the full group.
In a busy or fragmented cluster, the default Kubernetes scheduler will happily place 5 out of 8 pods and leave the remaining 3 pending. For most workloads, that’s fine—each pod operates independently. But for distributed LLM inference with tensor parallelism, partial scheduling is a recipe for failure.
Why LLM Inference Needs Gang Scheduling
The core issue is tensor parallelism (TP). When you shard a model like Llama 3 70B across 8 GPUs (TP=8), all 8 processes must start together to initialize the NCCL communication ring. If only 6 of 8 pods are running:
- NCCL initialization blocks waiting for all ranks to join
- The running pods hold GPU memory but do no useful work
- In a fragmented cluster, the remaining 2 pods may never get scheduled because other workloads claim the resources first
- You end up with a deadlock: the running pods won’t release resources until they complete, but they can’t complete without the missing pods
This gets worse with multi-node deployments where you need, say, 4 nodes with 8 GPUs each for TP=8 across 32 GPUs. Without gang scheduling, partial placement across nodes leads to the same hanging behavior, but now with more expensive resources sitting idle.
Gang Scheduling Solutions in Kubernetes
Before Kubernetes 1.35 introduced native gang scheduling support via the Workload API (currently in alpha), the ecosystem relied on external schedulers and controllers.
Volcano
Volcano is the OG gang scheduler for Kubernetes. It’s been around for years and is battle-tested in large AI clusters.
The approach: create a PodGroup CRD with minMember set to your group size, annotate your pods with Volcano-specific keys, and set schedulerName: volcano.
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
name: llm-inference-group
spec:
minMember: 8 # TP size
queue: default
Then on each pod:
metadata:
annotations:
scheduling.volcano.sh/group-name: llm-inference-group
spec:
schedulerName: volcano
Volcano won’t schedule any pod in the group until all 8 can be placed. It works well, but requires running a separate scheduler alongside the default one.
Koordinator
Koordinator takes a similar approach with its own gang scheduling annotations on pods. It provides broader co-scheduling capabilities and integrates with its own resource management system.
Kueue
Kueue is the newer approach and the one gaining the most momentum. Rather than replacing the scheduler, Kueue acts as a job queueing system that gates admission. It uses podSets within Workload custom resources to define groups that must be admitted together.
The key advantage: Kueue works with the default scheduler. It holds back the entire workload until the cluster has capacity for all pod sets, then releases them for scheduling simultaneously. This is cleaner than running a separate scheduler and integrates well with cluster autoscaling and quota management.
OME: Gang Scheduling for Model Serving
The real eye-opener for me was OME (Open Model Engine)—a Kubernetes operator from the SGLang team (blog post). OME is purpose-built for LLM inference and takes a model-driven approach: you declare the model you want to serve, and the operator figures out the optimal deployment topology.
OME has first-class gang scheduling via Kueue integration. When you define an InferenceService (OME’s main custom resource), the controller:
- Determines the deployment mode automatically—single-node, multi-node TP, or PD disaggregated (prefill-decode separation)
- Creates Kueue Workload objects with
podSetswhereminCountmatches your TP/group size - Ensures all pods in a group start together or wait until resources are available
No manual PodGroup CRDs. No scheduler annotations. The gang scheduling semantics are derived from the model’s requirements.
Prefill-Decode Disaggregation
This is where OME’s gang scheduling really shines. Prefill-decode (PD) disaggregation is one of SGLang’s key architectural patterns, and OME makes it a first-class deployment mode.
The idea: prefill and decode have fundamentally different resource profiles.
- Prefill (engine) is compute-bound—it processes the full input prompt in one forward pass. You want higher TP for bursty compute, optimizing for fast time to first token (TTFT) on long prompts.
- Decode (decoder) is memory-bandwidth-bound—it generates tokens one at a time autoregressively. You want lower TP per replica but more replicas for concurrent generation throughput (TPS).
OME deploys these as separate components that scale independently:
engine: handles prefill, often with higher TPdecoder: handles decode, with horizontal scaling for concurrencyrouter(optional): cache-aware load balancing with KV cache handoff via RDMA
Each component’s pods form a gang-scheduled group. A prefill group with TP=4 won’t partially boot—all 4 pods start together or none do. Same for decode groups. This prevents the cascading failure where a partially-started engine component causes router 5xx errors to dead endpoints.
Example InferenceService
Here’s what a PD disaggregated deployment looks like with OME:
apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
name: my-llama-chat
spec:
model:
name: llama-3-70b-instruct
engine: # prefill component
minReplicas: 2
maxReplicas: 10
decoder: # decode component
minReplicas: 4
maxReplicas: 20
router:
minReplicas: 2
From this, OME:
- Selects the appropriate SGLang PD serving runtime
- Configures TP sizes based on the model’s requirements
- Creates Kueue workloads with gang scheduling constraints per component
- Injects node affinities and RDMA configuration
- Manages the lifecycle as replicas scale up and down
You scale engine replicas from 2 to 10 and decoder replicas from 4 to 20 independently, and Kueue ensures each new replica group is gang-scheduled correctly. During scale-up or spot instance reclamation, partial groups are held back rather than allowed to deadlock.
When You Need This
Gang scheduling becomes essential when:
- TP > 1: Any tensor-parallel deployment where NCCL requires all ranks present at initialization
- Multi-node inference: Large models spanning multiple nodes where partial placement wastes GPU resources across machines
- PD disaggregation: Separate prefill and decode components that each need their full pod group running
- Spot/preemptible instances: Where nodes can disappear and replacement pods need to be gang-scheduled with survivors
- Busy multi-tenant clusters: Where resource fragmentation makes partial scheduling likely
Without gang enforcement in these scenarios, you’ll see constant initialization failures, wasted GPU hours on pods that can’t do useful work, and cascading errors through your serving stack.