Chapter 6: Resource Scheduling and Cluster Optimization

Learning Objectives

Pre-Assessment Quiz

1. What problem does gang scheduling solve for distributed training jobs?

It ensures pods are spread across different availability zones
It prevents partial allocation deadlock by requiring all pods to be schedulable before any start
It encrypts communication between training pods
It automatically checkpoints model state during training

2. In Kueue, what is the relationship between a ClusterQueue and a LocalQueue?

They are the same object with different names
A LocalQueue is namespaced and points to a cluster-wide ClusterQueue that defines the resource pool
A ClusterQueue is a child of the LocalQueue
LocalQueues replace ClusterQueues when fair-sharing is enabled

3. What is the purpose of taints and tolerations on GPU nodes?

To increase GPU clock speed for AI workloads
To monitor GPU temperature and power consumption
To prevent non-GPU workloads from scheduling on expensive GPU nodes
To enable NVLink communication between GPUs

4. What does Karpenter do differently from the legacy Cluster Autoscaler?

It provisions nodes from pre-defined node groups only
It reads pending pod requirements and provisions exactly the right node type dynamically
It only works with CPU nodes, not GPU nodes
It replaces the Kubernetes scheduler entirely

5. What is Dominant Resource Fairness (DRF)?

A policy that gives all resources to the highest-priority job
A multi-resource fairness algorithm that equalizes the fraction of each job's most-consumed resource
A mechanism for time-slicing GPUs between pods
A network protocol for distributed training communication

6. How does Kueue handle job admission when a cluster is at capacity?

It immediately rejects the job with an error
It holds the job in a suspended state in the queue until resources are available
It creates new nodes automatically to fit the job
It splits the job into smaller sub-jobs

7. What is the benefit of bin-packing pods onto fewer GPU nodes?

It improves network latency between pods
It leaves entire nodes free that can be scaled down, reducing cost
It enables GPU memory sharing between pods
It disables preemption for packed pods

8. In a Kueue cohort, what happens when one team's ClusterQueue has idle GPU quota?

The idle GPUs are powered down to save energy
Other ClusterQueues in the same cohort can borrow the idle quota
The quota is permanently reassigned to the busiest team
An alert is sent to the cluster administrator

9. Why are spot instances well-suited for training workloads but less ideal for inference?

Spot instances have slower GPUs than on-demand instances
Training can checkpoint and resume after interruption; live inference cannot tolerate sudden termination
Spot instances do not support NVIDIA drivers
Inference workloads use too little GPU to justify spot pricing

10. What is the difference between Volcano's preemption and reclaim actions?

They are identical mechanisms with different names
Preemption evicts lower-priority pods; reclaim returns capacity from queues exceeding their fair share
Preemption is for CPU pods; reclaim is for GPU pods
Reclaim only works with Kueue, not Volcano alone

11. What does NVIDIA MIG allow on an A100 GPU?

Overclocking the GPU beyond its rated speed
Partitioning the GPU into isolated slices with dedicated memory and compute
Connecting multiple GPUs via NVLink into a single virtual GPU
Running GPU workloads without a device driver

12. What is the recommended architecture for large-scale GPU clusters using both Kueue and Volcano?

Use Volcano for everything and disable Kueue
Use Kueue for admission control and quotas; use Volcano for gang scheduling and pod placement
Use Kueue only for inference and Volcano only for training
Alternate between Kueue and Volcano on different days of the week

Section 1: Kubernetes Scheduling for AI

Priority Classes and Preemption

Every pod in Kubernetes can be assigned a PriorityClass, a cluster-scoped object that assigns a numerical priority value. When the scheduler cannot fit a new pod onto any node, it preempts (evicts) lower-priority pods to make room. A typical AI platform uses a tiered scheme:

Priority ClassValueTypical Use
system-critical1,000,000Kubernetes system components
production-inference10,000Real-time model serving
training-high1,000Scheduled training jobs with SLAs
training-standard500Standard research experiments
development100Interactive notebooks, dev jobs

With this setup, a newly submitted production inference pod can preempt a running development notebook without disrupting higher-priority work. Research jobs at training-standard cannot block production inference — the scheduler always prefers the higher value.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: training-high
value: 1000
globalDefault: false
description: "High-priority training jobs with team SLAs"
---
apiVersion: batch/v1
kind: Job
metadata:
  name: gpt-finetune-v3
spec:
  template:
    spec:
      priorityClassName: training-high
      containers:
        - name: trainer
          image: my-registry/trainer:latest
          resources:
            limits:
              nvidia.com/gpu: "4"

Node Affinity, Taints, and Tolerations

GPU nodes are specialized and expensive. Without controls, ordinary CPU workloads can land on GPU nodes and waste hardware. Two mechanisms prevent this:

# Taint the GPU node
kubectl taint nodes gpu-node-01 nvidia.com/gpu=present:NoSchedule

# Pod toleration
spec:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Equal"
      value: "present"
      effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: nvidia.com/gpu.product
                operator: In
                values: ["NVIDIA-A100-SXM4-80GB"]

Topology-Aware Scheduling

For distributed training across multiple GPUs, physical placement directly impacts performance. Pods sharing NVLink fabric communicate orders of magnitude faster than pods on separate nodes connected by Ethernet. The topologySpreadConstraints feature lets jobs request co-location on the same high-bandwidth domain.

Key Takeaway

Section 2: Kueue — Kubernetes-Native Job Queuing

The Problem Kueue Solves

The default Kubernetes scheduler assigns pods as fast as they arrive. In a multi-tenant GPU cluster, a single team can submit thousands of jobs and consume all GPUs immediately. There is no mechanism for fairness, burst borrowing, or holding a job until all resources are simultaneously available. Kueue intercepts Jobs before pod creation, places them in a managed queue, and admits entire workloads only when resources are guaranteed.

ClusterQueue, LocalQueue, and Workloads

ResourceScopePurpose
ClusterQueueCluster-wideDefines a resource pool with quotas, cohort membership, and sharing rules
LocalQueueNamespacedPoints to a ClusterQueue; teams submit jobs to their LocalQueue
WorkloadNamespacedKueue's internal representation of an admitted job (auto-created)

Cluster administrators control resource pools through ClusterQueues. Teams interact only with their namespaced LocalQueue. This cleanly separates platform governance from user access.

Kueue Job Queuing — Fair-Share Admission with Borrowing

TEAM RESEARCH TEAM PRODUCTION CLUSTER RESOURCES LocalQueue: lq-research Job: llm-ft 4 GPUs Job: embed 2 GPUs Job: rlhf 6 GPUs ClusterQueue: cq-research Nominal: 8 GPUs Borrowing Limit: 8 GPUs 6/8 GPUs used LocalQueue: lq-prod Job: serve 8 GPUs ClusterQueue: cq-prod Nominal: 16 GPUs Borrowing Limit: 0 8/16 GPUs used Cohort: org-cohort (shared idle pool) borrow 4 GPUs GPU Pool: 24 GPUs total Step 1: Research admits llm-ft (4 GPU) + embed (2 GPU) Step 2: Production admits serve (8 GPU) Step 3: Research borrows 4 idle prod GPUs for rlhf (8 own + 4 borrowed = 12) Fair Share Score Research 75% Production 50%

Fair-Sharing and Borrowing

When multiple ClusterQueues belong to the same cohort, their unused nominal quota becomes a shared pool. Fair sharing uses a scoring system: the queue consuming the least relative to its entitlement receives the highest admission priority. The queue consuming the most is the first target for preemption. This creates a self-balancing system converging toward equitable utilization.

Priority-Based Preemption

Kueue supports two queueing strategies: StrictFIFO and BestEffortFIFO. Three preemption policies govern behavior when a high-priority workload cannot fit:

PolicyBehavior
reclaimWithinCohort: NeverDo not preempt any workload in the cohort
reclaimWithinCohort: LowerPriorityPreempt cohort workloads with lower priority
reclaimWithinCohort: AnyPreempt any cohort workload regardless of priority

The LessThanInitialShare policy adds a fairness guard: preemption is only allowed if the preempting queue's share would remain strictly less than the target queue's current share.

Integration with Jobs, RayJobs, and Training Operator

Kueue integrates natively with batch/v1 Job, RayJob, PyTorchJob, TFJob, and JobSet. Each integration wraps the custom resource in a Kueue Workload and manages its spec.suspend field.

Key Takeaway

Section 3: Volcano Batch Scheduler

Gang Scheduling and the Deadlock Problem

Consider a distributed training job needing 16 GPUs across 4 nodes. The default scheduler may place 12 pods, leaving 4 waiting. The placed pods hold GPUs while waiting for peers — creating a deadlock where everyone waits and nobody makes progress.

Gang scheduling solves this with an all-or-nothing guarantee. A PodGroup is admitted only when all requested resources are simultaneously available. If 16 GPUs are not free at once, the entire job waits without holding any resources.

Priority Preemption — High-Priority Job Evicts Lower-Priority Pods

3-Node GPU Cluster (4 GPUs per node = 12 total) Node 1 (4 GPUs) Node 2 (4 GPUs) Node 3 (4 GPUs) dev p:100 dev p:100 dev p:100 dev p:100 train p:500 train p:500 train p:500 train p:500 train p:500 train p:500 train p:500 train p:500 NEW: production-inference (p:10000) Needs 4 GPUs — cluster is FULL infer p:10k infer p:10k infer p:10k infer p:10k Phase 1: Cluster fully occupied with dev (p:100) and train (p:500) pods Phase 2: production-inference (p:10000) arrives — needs 4 GPUs Phase 3: Scheduler evicts 4 lowest-priority dev pods (p:100) from Node 1 Phase 4: Inference pods placed on Node 1 — train pods unaffected

Volcano Plugins

PluginBehavior
proportionAllocates cluster capacity in proportion to queue weights
gangEnforces all-or-nothing PodGroup admission
binpackPacks pods densely onto fewer nodes to leave whole nodes free
drfDominant Resource Fairness — maximizes fairness across multi-resource requests
priorityRespects PriorityClass values when ordering jobs
nodeorderScores nodes based on resource fit, NUMA locality, and topology

Preemption vs. Reclaim

Kueue + Volcano: Better Together

ResponsibilityTool
Admission control and quota enforcementKueue
Burst borrowing across tenant boundariesKueue (cohorts)
Strict gang semantics (all-or-nothing)Volcano
MPI/PyTorch/TensorFlow operator integrationVolcano
DRF and multi-resource fairnessVolcano

Key Takeaway

Section 4: Cluster Bin-Packing and Cost Optimization

GPU Utilization Monitoring

Many clusters appear busy while actual GPU compute utilization sits below 30%. NVIDIA DCGM exposes per-GPU metrics (SM utilization, memory utilization, NVLink bandwidth) that flow into Prometheus and Grafana.

Anti-PatternSymptomFix
Oversized requestsGPU allocated, SM util < 10%Right-size with DCGM data; use MIG
Partial allocation deadlockJobs pending despite free GPUsEnable Volcano gang scheduling
Single-tenant monopolyOne namespace consumes all GPUsKueue ClusterQueue quotas
Node fragmentationMultiple nodes partially usedEnable Volcano binpack plugin

NVIDIA MIG

MIG partitions a single A100 or H100 into isolated slices with dedicated memory and compute. An A100 80GB can split into up to seven 1g.10gb MIG instances, each appearing as an independent GPU to Kubernetes.

Bin-Packing Optimization — Efficient vs. Wasteful GPU Placement

Spread Placement (Wasteful) Node 1 — 4 GPUs P1 3 GPUs idle Node 2 — 4 GPUs P2 3 GPUs idle Node 3 — 4 GPUs P3 3 GPUs idle 3 nodes active 9 of 12 GPUs idle 25% utilization No nodes can be scaled down Bin-Packed (Efficient) Node 1 — 4 GPUs P1 P2 P3 75% utilized Node 2 — 4 GPUs SCALE DOWN Node 3 — 4 GPUs SCALE DOWN 1 node active 3 of 4 GPUs used 75% utilization 2 nodes scaled down = 67% cost savings Karpenter consolidation automatically migrates pods and terminates empty nodes

Spot and Preemptible Instances

GPU spot instances offer up to 70% discounts with the tradeoff that the cloud provider can reclaim them. Training workloads with checkpointing are well-suited; inference serving live traffic is less suitable.

FactorRecommendation
Instance diversityRequest multiple GPU families (G4dn, G5, P4)
AZ spreadRequest across all AZs for wider capacity pool
CheckpointingSave model state to durable storage at regular intervals
Interruption handlingUse node termination handlers for graceful drain
FallbackConfigure on-demand for critical non-checkpointable jobs

Karpenter vs. Cluster Autoscaler

CapabilityCluster AutoscalerKarpenter
Node selectionFixed node groupsDynamic, constraint-based
Provisioning speed3–5 minutesUnder 60 seconds
Instance diversityPer-group definitionFlexible NodePool requirements
ConsolidationBasic scale-down onlyActive bin-packing with pod migration
Spot integrationVia node group configurationFirst-class, automatic fallback
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  template:
    spec:
      requirements:
        - key: "karpenter.k8s.aws/instance-family"
          operator: In
          values: ["g4dn", "g5", "p4d"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot", "on-demand"]
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ["us-east-1a", "us-east-1b", "us-east-1c"]
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s

Key Takeaway

Post-Assessment Quiz

1. What problem does gang scheduling solve for distributed training jobs?

It ensures pods are spread across different availability zones
It prevents partial allocation deadlock by requiring all pods to be schedulable before any start
It encrypts communication between training pods
It automatically checkpoints model state during training

2. In Kueue, what is the relationship between a ClusterQueue and a LocalQueue?

They are the same object with different names
A LocalQueue is namespaced and points to a cluster-wide ClusterQueue that defines the resource pool
A ClusterQueue is a child of the LocalQueue
LocalQueues replace ClusterQueues when fair-sharing is enabled

3. What is the purpose of taints and tolerations on GPU nodes?

To increase GPU clock speed for AI workloads
To monitor GPU temperature and power consumption
To prevent non-GPU workloads from scheduling on expensive GPU nodes
To enable NVLink communication between GPUs

4. What does Karpenter do differently from the legacy Cluster Autoscaler?

It provisions nodes from pre-defined node groups only
It reads pending pod requirements and provisions exactly the right node type dynamically
It only works with CPU nodes, not GPU nodes
It replaces the Kubernetes scheduler entirely

5. What is Dominant Resource Fairness (DRF)?

A policy that gives all resources to the highest-priority job
A multi-resource fairness algorithm that equalizes the fraction of each job's most-consumed resource
A mechanism for time-slicing GPUs between pods
A network protocol for distributed training communication

6. How does Kueue handle job admission when a cluster is at capacity?

It immediately rejects the job with an error
It holds the job in a suspended state in the queue until resources are available
It creates new nodes automatically to fit the job
It splits the job into smaller sub-jobs

7. What is the benefit of bin-packing pods onto fewer GPU nodes?

It improves network latency between pods
It leaves entire nodes free that can be scaled down, reducing cost
It enables GPU memory sharing between pods
It disables preemption for packed pods

8. In a Kueue cohort, what happens when one team's ClusterQueue has idle GPU quota?

The idle GPUs are powered down to save energy
Other ClusterQueues in the same cohort can borrow the idle quota
The quota is permanently reassigned to the busiest team
An alert is sent to the cluster administrator

9. Why are spot instances well-suited for training workloads but less ideal for inference?

Spot instances have slower GPUs than on-demand instances
Training can checkpoint and resume after interruption; live inference cannot tolerate sudden termination
Spot instances do not support NVIDIA drivers
Inference workloads use too little GPU to justify spot pricing

10. What is the difference between Volcano's preemption and reclaim actions?

They are identical mechanisms with different names
Preemption evicts lower-priority pods; reclaim returns capacity from queues exceeding their fair share
Preemption is for CPU pods; reclaim is for GPU pods
Reclaim only works with Kueue, not Volcano alone

11. What does NVIDIA MIG allow on an A100 GPU?

Overclocking the GPU beyond its rated speed
Partitioning the GPU into isolated slices with dedicated memory and compute
Connecting multiple GPUs via NVLink into a single virtual GPU
Running GPU workloads without a device driver

12. What is the recommended architecture for large-scale GPU clusters using both Kueue and Volcano?

Use Volcano for everything and disable Kueue
Use Kueue for admission control and quotas; use Volcano for gang scheduling and pod placement
Use Kueue only for inference and Volcano only for training
Alternate between Kueue and Volcano on different days of the week

Your Progress

Answer Explanations