1. What problem does gang scheduling solve for distributed training jobs?
2. In Kueue, what is the relationship between a ClusterQueue and a LocalQueue?
3. What is the purpose of taints and tolerations on GPU nodes?
4. What does Karpenter do differently from the legacy Cluster Autoscaler?
5. What is Dominant Resource Fairness (DRF)?
6. How does Kueue handle job admission when a cluster is at capacity?
7. What is the benefit of bin-packing pods onto fewer GPU nodes?
8. In a Kueue cohort, what happens when one team's ClusterQueue has idle GPU quota?
9. Why are spot instances well-suited for training workloads but less ideal for inference?
10. What is the difference between Volcano's preemption and reclaim actions?
11. What does NVIDIA MIG allow on an A100 GPU?
12. What is the recommended architecture for large-scale GPU clusters using both Kueue and Volcano?
Every pod in Kubernetes can be assigned a PriorityClass, a cluster-scoped object that assigns a numerical priority value. When the scheduler cannot fit a new pod onto any node, it preempts (evicts) lower-priority pods to make room. A typical AI platform uses a tiered scheme:
| Priority Class | Value | Typical Use |
|---|---|---|
system-critical | 1,000,000 | Kubernetes system components |
production-inference | 10,000 | Real-time model serving |
training-high | 1,000 | Scheduled training jobs with SLAs |
training-standard | 500 | Standard research experiments |
development | 100 | Interactive notebooks, dev jobs |
With this setup, a newly submitted production inference pod can preempt a running development notebook without disrupting higher-priority work. Research jobs at training-standard cannot block production inference — the scheduler always prefers the higher value.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: training-high
value: 1000
globalDefault: false
description: "High-priority training jobs with team SLAs"
---
apiVersion: batch/v1
kind: Job
metadata:
name: gpt-finetune-v3
spec:
template:
spec:
priorityClassName: training-high
containers:
- name: trainer
image: my-registry/trainer:latest
resources:
limits:
nvidia.com/gpu: "4"
GPU nodes are specialized and expensive. Without controls, ordinary CPU workloads can land on GPU nodes and waste hardware. Two mechanisms prevent this:
# Taint the GPU node
kubectl taint nodes gpu-node-01 nvidia.com/gpu=present:NoSchedule
# Pod toleration
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values: ["NVIDIA-A100-SXM4-80GB"]
For distributed training across multiple GPUs, physical placement directly impacts performance. Pods sharing NVLink fabric communicate orders of magnitude faster than pods on separate nodes connected by Ethernet. The topologySpreadConstraints feature lets jobs request co-location on the same high-bandwidth domain.
The default Kubernetes scheduler assigns pods as fast as they arrive. In a multi-tenant GPU cluster, a single team can submit thousands of jobs and consume all GPUs immediately. There is no mechanism for fairness, burst borrowing, or holding a job until all resources are simultaneously available. Kueue intercepts Jobs before pod creation, places them in a managed queue, and admits entire workloads only when resources are guaranteed.
| Resource | Scope | Purpose |
|---|---|---|
ClusterQueue | Cluster-wide | Defines a resource pool with quotas, cohort membership, and sharing rules |
LocalQueue | Namespaced | Points to a ClusterQueue; teams submit jobs to their LocalQueue |
Workload | Namespaced | Kueue's internal representation of an admitted job (auto-created) |
Cluster administrators control resource pools through ClusterQueues. Teams interact only with their namespaced LocalQueue. This cleanly separates platform governance from user access.
When multiple ClusterQueues belong to the same cohort, their unused nominal quota becomes a shared pool. Fair sharing uses a scoring system: the queue consuming the least relative to its entitlement receives the highest admission priority. The queue consuming the most is the first target for preemption. This creates a self-balancing system converging toward equitable utilization.
Kueue supports two queueing strategies: StrictFIFO and BestEffortFIFO. Three preemption policies govern behavior when a high-priority workload cannot fit:
| Policy | Behavior |
|---|---|
reclaimWithinCohort: Never | Do not preempt any workload in the cohort |
reclaimWithinCohort: LowerPriority | Preempt cohort workloads with lower priority |
reclaimWithinCohort: Any | Preempt any cohort workload regardless of priority |
The LessThanInitialShare policy adds a fairness guard: preemption is only allowed if the preempting queue's share would remain strictly less than the target queue's current share.
Kueue integrates natively with batch/v1 Job, RayJob, PyTorchJob, TFJob, and JobSet. Each integration wraps the custom resource in a Kueue Workload and manages its spec.suspend field.
Consider a distributed training job needing 16 GPUs across 4 nodes. The default scheduler may place 12 pods, leaving 4 waiting. The placed pods hold GPUs while waiting for peers — creating a deadlock where everyone waits and nobody makes progress.
Gang scheduling solves this with an all-or-nothing guarantee. A PodGroup is admitted only when all requested resources are simultaneously available. If 16 GPUs are not free at once, the entire job waits without holding any resources.
| Plugin | Behavior |
|---|---|
proportion | Allocates cluster capacity in proportion to queue weights |
gang | Enforces all-or-nothing PodGroup admission |
binpack | Packs pods densely onto fewer nodes to leave whole nodes free |
drf | Dominant Resource Fairness — maximizes fairness across multi-resource requests |
priority | Respects PriorityClass values when ordering jobs |
nodeorder | Scores nodes based on resource fit, NUMA locality, and topology |
| Responsibility | Tool |
|---|---|
| Admission control and quota enforcement | Kueue |
| Burst borrowing across tenant boundaries | Kueue (cohorts) |
| Strict gang semantics (all-or-nothing) | Volcano |
| MPI/PyTorch/TensorFlow operator integration | Volcano |
| DRF and multi-resource fairness | Volcano |
Many clusters appear busy while actual GPU compute utilization sits below 30%. NVIDIA DCGM exposes per-GPU metrics (SM utilization, memory utilization, NVLink bandwidth) that flow into Prometheus and Grafana.
| Anti-Pattern | Symptom | Fix |
|---|---|---|
| Oversized requests | GPU allocated, SM util < 10% | Right-size with DCGM data; use MIG |
| Partial allocation deadlock | Jobs pending despite free GPUs | Enable Volcano gang scheduling |
| Single-tenant monopoly | One namespace consumes all GPUs | Kueue ClusterQueue quotas |
| Node fragmentation | Multiple nodes partially used | Enable Volcano binpack plugin |
MIG partitions a single A100 or H100 into isolated slices with dedicated memory and compute. An A100 80GB can split into up to seven 1g.10gb MIG instances, each appearing as an independent GPU to Kubernetes.
GPU spot instances offer up to 70% discounts with the tradeoff that the cloud provider can reclaim them. Training workloads with checkpointing are well-suited; inference serving live traffic is less suitable.
| Factor | Recommendation |
|---|---|
| Instance diversity | Request multiple GPU families (G4dn, G5, P4) |
| AZ spread | Request across all AZs for wider capacity pool |
| Checkpointing | Save model state to durable storage at regular intervals |
| Interruption handling | Use node termination handlers for graceful drain |
| Fallback | Configure on-demand for critical non-checkpointable jobs |
| Capability | Cluster Autoscaler | Karpenter |
|---|---|---|
| Node selection | Fixed node groups | Dynamic, constraint-based |
| Provisioning speed | 3–5 minutes | Under 60 seconds |
| Instance diversity | Per-group definition | Flexible NodePool requirements |
| Consolidation | Basic scale-down only | Active bin-packing with pod migration |
| Spot integration | Via node group configuration | First-class, automatic fallback |
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-nodepool
spec:
template:
spec:
requirements:
- key: "karpenter.k8s.aws/instance-family"
operator: In
values: ["g4dn", "g5", "p4d"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot", "on-demand"]
- key: "topology.kubernetes.io/zone"
operator: In
values: ["us-east-1a", "us-east-1b", "us-east-1c"]
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s
1. What problem does gang scheduling solve for distributed training jobs?
2. In Kueue, what is the relationship between a ClusterQueue and a LocalQueue?
3. What is the purpose of taints and tolerations on GPU nodes?
4. What does Karpenter do differently from the legacy Cluster Autoscaler?
5. What is Dominant Resource Fairness (DRF)?
6. How does Kueue handle job admission when a cluster is at capacity?
7. What is the benefit of bin-packing pods onto fewer GPU nodes?
8. In a Kueue cohort, what happens when one team's ClusterQueue has idle GPU quota?
9. Why are spot instances well-suited for training workloads but less ideal for inference?
10. What is the difference between Volcano's preemption and reclaim actions?
11. What does NVIDIA MIG allow on an A100 GPU?
12. What is the recommended architecture for large-scale GPU clusters using both Kueue and Volcano?