Study Guide: Chapter 6 — Resource Scheduling and Cluster Optimization

Pre-Assessment Quiz

1. What problem does gang scheduling solve for distributed training jobs?

It ensures pods are spread across different availability zones

It prevents partial allocation deadlock by requiring all pods to be schedulable before any start

It encrypts communication between training pods

It automatically checkpoints model state during training

2. In Kueue, what is the relationship between a ClusterQueue and a LocalQueue?

They are the same object with different names

A LocalQueue is namespaced and points to a cluster-wide ClusterQueue that defines the resource pool

A ClusterQueue is a child of the LocalQueue

LocalQueues replace ClusterQueues when fair-sharing is enabled

3. What is the purpose of taints and tolerations on GPU nodes?

To increase GPU clock speed for AI workloads

To monitor GPU temperature and power consumption

To prevent non-GPU workloads from scheduling on expensive GPU nodes

To enable NVLink communication between GPUs

4. What does Karpenter do differently from the legacy Cluster Autoscaler?

It provisions nodes from pre-defined node groups only

It reads pending pod requirements and provisions exactly the right node type dynamically

It only works with CPU nodes, not GPU nodes

It replaces the Kubernetes scheduler entirely

5. What is Dominant Resource Fairness (DRF)?

A policy that gives all resources to the highest-priority job

A multi-resource fairness algorithm that equalizes the fraction of each job's most-consumed resource

A mechanism for time-slicing GPUs between pods

A network protocol for distributed training communication

6. How does Kueue handle job admission when a cluster is at capacity?

It immediately rejects the job with an error

It holds the job in a suspended state in the queue until resources are available

It creates new nodes automatically to fit the job

It splits the job into smaller sub-jobs

7. What is the benefit of bin-packing pods onto fewer GPU nodes?

It improves network latency between pods

It leaves entire nodes free that can be scaled down, reducing cost

It enables GPU memory sharing between pods

It disables preemption for packed pods

8. In a Kueue cohort, what happens when one team's ClusterQueue has idle GPU quota?

The idle GPUs are powered down to save energy

Other ClusterQueues in the same cohort can borrow the idle quota

The quota is permanently reassigned to the busiest team

An alert is sent to the cluster administrator

9. Why are spot instances well-suited for training workloads but less ideal for inference?

Spot instances have slower GPUs than on-demand instances

Training can checkpoint and resume after interruption; live inference cannot tolerate sudden termination

Spot instances do not support NVIDIA drivers

Inference workloads use too little GPU to justify spot pricing

10. What is the difference between Volcano's preemption and reclaim actions?

They are identical mechanisms with different names

Preemption evicts lower-priority pods; reclaim returns capacity from queues exceeding their fair share

Preemption is for CPU pods; reclaim is for GPU pods

Reclaim only works with Kueue, not Volcano alone

11. What does NVIDIA MIG allow on an A100 GPU?

Overclocking the GPU beyond its rated speed

Partitioning the GPU into isolated slices with dedicated memory and compute

Connecting multiple GPUs via NVLink into a single virtual GPU

Running GPU workloads without a device driver

12. What is the recommended architecture for large-scale GPU clusters using both Kueue and Volcano?

Use Volcano for everything and disable Kueue

Use Kueue for admission control and quotas; use Volcano for gang scheduling and pod placement

Use Kueue only for inference and Volcano only for training

Alternate between Kueue and Volcano on different days of the week

Section 1: Kubernetes Scheduling for AI

Priority Classes and Preemption

Every pod in Kubernetes can be assigned a PriorityClass, a cluster-scoped object that assigns a numerical priority value. When the scheduler cannot fit a new pod onto any node, it preempts (evicts) lower-priority pods to make room. A typical AI platform uses a tiered scheme:

Priority Class	Value	Typical Use
`system-critical`	1,000,000	Kubernetes system components
`production-inference`	10,000	Real-time model serving
`training-high`	1,000	Scheduled training jobs with SLAs
`training-standard`	500	Standard research experiments
`development`	100	Interactive notebooks, dev jobs

With this setup, a newly submitted production inference pod can preempt a running development notebook without disrupting higher-priority work. Research jobs at training-standard cannot block production inference — the scheduler always prefers the higher value.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: training-high
value: 1000
globalDefault: false
description: "High-priority training jobs with team SLAs"
---
apiVersion: batch/v1
kind: Job
metadata:
  name: gpt-finetune-v3
spec:
  template:
    spec:
      priorityClassName: training-high
      containers:
        - name: trainer
          image: my-registry/trainer:latest
          resources:
            limits:
              nvidia.com/gpu: "4"

Node Affinity, Taints, and Tolerations

GPU nodes are specialized and expensive. Without controls, ordinary CPU workloads can land on GPU nodes and waste hardware. Two mechanisms prevent this:

Taints and Tolerations act as a lock-and-key system. A taint on a GPU node repels all pods that do not carry a matching toleration.
Node Affinity allows pods to express a preference or requirement for specific node characteristics such as GPU model, memory capacity, or network interconnect.

# Taint the GPU node
kubectl taint nodes gpu-node-01 nvidia.com/gpu=present:NoSchedule

# Pod toleration
spec:
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Equal"
      value: "present"
      effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: nvidia.com/gpu.product
                operator: In
                values: ["NVIDIA-A100-SXM4-80GB"]

Topology-Aware Scheduling

For distributed training across multiple GPUs, physical placement directly impacts performance. Pods sharing NVLink fabric communicate orders of magnitude faster than pods on separate nodes connected by Ethernet. The topologySpreadConstraints feature lets jobs request co-location on the same high-bandwidth domain.

Key Takeaway

PriorityClasses define preemption ordering across workload tiers
Taints and tolerations prevent resource leakage onto wrong hardware
Node affinity ensures jobs land on the right GPU model
Topology constraints co-locate distributed training for NVLink performance

Section 2: Kueue — Kubernetes-Native Job Queuing

The Problem Kueue Solves

The default Kubernetes scheduler assigns pods as fast as they arrive. In a multi-tenant GPU cluster, a single team can submit thousands of jobs and consume all GPUs immediately. There is no mechanism for fairness, burst borrowing, or holding a job until all resources are simultaneously available. Kueue intercepts Jobs before pod creation, places them in a managed queue, and admits entire workloads only when resources are guaranteed.

ClusterQueue, LocalQueue, and Workloads

Resource	Scope	Purpose
`ClusterQueue`	Cluster-wide	Defines a resource pool with quotas, cohort membership, and sharing rules
`LocalQueue`	Namespaced	Points to a ClusterQueue; teams submit jobs to their LocalQueue
`Workload`	Namespaced	Kueue's internal representation of an admitted job (auto-created)

Cluster administrators control resource pools through ClusterQueues. Teams interact only with their namespaced LocalQueue. This cleanly separates platform governance from user access.

Kueue Job Queuing — Fair-Share Admission with Borrowing

Fair-Sharing and Borrowing

When multiple ClusterQueues belong to the same cohort, their unused nominal quota becomes a shared pool. Fair sharing uses a scoring system: the queue consuming the least relative to its entitlement receives the highest admission priority. The queue consuming the most is the first target for preemption. This creates a self-balancing system converging toward equitable utilization.

Priority-Based Preemption

Kueue supports two queueing strategies: StrictFIFO and BestEffortFIFO. Three preemption policies govern behavior when a high-priority workload cannot fit:

Policy	Behavior
`reclaimWithinCohort: Never`	Do not preempt any workload in the cohort
`reclaimWithinCohort: LowerPriority`	Preempt cohort workloads with lower priority
`reclaimWithinCohort: Any`	Preempt any cohort workload regardless of priority

The LessThanInitialShare policy adds a fairness guard: preemption is only allowed if the preempting queue's share would remain strictly less than the target queue's current share.

Integration with Jobs, RayJobs, and Training Operator

Kueue integrates natively with batch/v1 Job, RayJob, PyTorchJob, TFJob, and JobSet. Each integration wraps the custom resource in a Kueue Workload and manages its spec.suspend field.

Key Takeaway

Kueue extends the default scheduler with admission control, quota enforcement, and fair-sharing
Teams submit to namespaced LocalQueues; admins control ClusterQueues
Cohorts enable burst borrowing and preemption-based fairness across tenant boundaries
Kueue never replaces the default scheduler — it controls when jobs are unsuspended

Section 3: Volcano Batch Scheduler

Gang Scheduling and the Deadlock Problem

Consider a distributed training job needing 16 GPUs across 4 nodes. The default scheduler may place 12 pods, leaving 4 waiting. The placed pods hold GPUs while waiting for peers — creating a deadlock where everyone waits and nobody makes progress.

Gang scheduling solves this with an all-or-nothing guarantee. A PodGroup is admitted only when all requested resources are simultaneously available. If 16 GPUs are not free at once, the entire job waits without holding any resources.

Priority Preemption — High-Priority Job Evicts Lower-Priority Pods

Volcano Plugins

Plugin	Behavior
`proportion`	Allocates cluster capacity in proportion to queue weights
`gang`	Enforces all-or-nothing PodGroup admission
`binpack`	Packs pods densely onto fewer nodes to leave whole nodes free
`drf`	Dominant Resource Fairness — maximizes fairness across multi-resource requests
`priority`	Respects PriorityClass values when ordering jobs
`nodeorder`	Scores nodes based on resource fit, NUMA locality, and topology

Preemption vs. Reclaim

Preemption: A higher-priority job evicts lower-priority pods in the same queue
Reclaim: A queue exceeding its fair share has pods evicted to return capacity to underserved queues

Kueue + Volcano: Better Together

Responsibility	Tool
Admission control and quota enforcement	Kueue
Burst borrowing across tenant boundaries	Kueue (cohorts)
Strict gang semantics (all-or-nothing)	Volcano
MPI/PyTorch/TensorFlow operator integration	Volcano
DRF and multi-resource fairness	Volcano

Key Takeaway

Volcano solves distributed training deadlock through PodGroup gang scheduling
Plugin architecture supports DRF fairness, bin-packing, and priority preemption
Pair Kueue (admission control) with Volcano (gang scheduling) for production clusters
Reclaim actions prevent any single team from permanently monopolizing resources

Section 4: Cluster Bin-Packing and Cost Optimization

GPU Utilization Monitoring

Many clusters appear busy while actual GPU compute utilization sits below 30%. NVIDIA DCGM exposes per-GPU metrics (SM utilization, memory utilization, NVLink bandwidth) that flow into Prometheus and Grafana.

Anti-Pattern	Symptom	Fix
Oversized requests	GPU allocated, SM util < 10%	Right-size with DCGM data; use MIG
Partial allocation deadlock	Jobs pending despite free GPUs	Enable Volcano gang scheduling
Single-tenant monopoly	One namespace consumes all GPUs	Kueue ClusterQueue quotas
Node fragmentation	Multiple nodes partially used	Enable Volcano `binpack` plugin

NVIDIA MIG

MIG partitions a single A100 or H100 into isolated slices with dedicated memory and compute. An A100 80GB can split into up to seven 1g.10gb MIG instances, each appearing as an independent GPU to Kubernetes.

Bin-Packing Optimization — Efficient vs. Wasteful GPU Placement

Spot and Preemptible Instances

GPU spot instances offer up to 70% discounts with the tradeoff that the cloud provider can reclaim them. Training workloads with checkpointing are well-suited; inference serving live traffic is less suitable.

Factor	Recommendation
Instance diversity	Request multiple GPU families (G4dn, G5, P4)
AZ spread	Request across all AZs for wider capacity pool
Checkpointing	Save model state to durable storage at regular intervals
Interruption handling	Use node termination handlers for graceful drain
Fallback	Configure on-demand for critical non-checkpointable jobs

Karpenter vs. Cluster Autoscaler

Capability	Cluster Autoscaler	Karpenter
Node selection	Fixed node groups	Dynamic, constraint-based
Provisioning speed	3–5 minutes	Under 60 seconds
Instance diversity	Per-group definition	Flexible NodePool requirements
Consolidation	Basic scale-down only	Active bin-packing with pod migration
Spot integration	Via node group configuration	First-class, automatic fallback

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  template:
    spec:
      requirements:
        - key: "karpenter.k8s.aws/instance-family"
          operator: In
          values: ["g4dn", "g5", "p4d"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot", "on-demand"]
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ["us-east-1a", "us-east-1b", "us-east-1c"]
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s

Key Takeaway

Right-size GPU requests using DCGM metrics; use MIG for sub-GPU workloads
Spot instances save up to 70% for fault-tolerant training with checkpointing
Karpenter provisions exact node types in under 60 seconds with active consolidation
The layered stack (Kueue + Volcano + MIG + Karpenter) addresses every waste vector

Post-Assessment Quiz

1. What problem does gang scheduling solve for distributed training jobs?

It ensures pods are spread across different availability zones

It prevents partial allocation deadlock by requiring all pods to be schedulable before any start

It encrypts communication between training pods

It automatically checkpoints model state during training

2. In Kueue, what is the relationship between a ClusterQueue and a LocalQueue?

They are the same object with different names

A LocalQueue is namespaced and points to a cluster-wide ClusterQueue that defines the resource pool

A ClusterQueue is a child of the LocalQueue

LocalQueues replace ClusterQueues when fair-sharing is enabled

3. What is the purpose of taints and tolerations on GPU nodes?

To increase GPU clock speed for AI workloads

To monitor GPU temperature and power consumption

To prevent non-GPU workloads from scheduling on expensive GPU nodes

To enable NVLink communication between GPUs

4. What does Karpenter do differently from the legacy Cluster Autoscaler?

It provisions nodes from pre-defined node groups only

It reads pending pod requirements and provisions exactly the right node type dynamically

It only works with CPU nodes, not GPU nodes

It replaces the Kubernetes scheduler entirely

5. What is Dominant Resource Fairness (DRF)?

A policy that gives all resources to the highest-priority job

A multi-resource fairness algorithm that equalizes the fraction of each job's most-consumed resource

A mechanism for time-slicing GPUs between pods

A network protocol for distributed training communication

6. How does Kueue handle job admission when a cluster is at capacity?

It immediately rejects the job with an error

It holds the job in a suspended state in the queue until resources are available

It creates new nodes automatically to fit the job

It splits the job into smaller sub-jobs

7. What is the benefit of bin-packing pods onto fewer GPU nodes?

It improves network latency between pods

It leaves entire nodes free that can be scaled down, reducing cost

It enables GPU memory sharing between pods

It disables preemption for packed pods

8. In a Kueue cohort, what happens when one team's ClusterQueue has idle GPU quota?

The idle GPUs are powered down to save energy

Other ClusterQueues in the same cohort can borrow the idle quota

The quota is permanently reassigned to the busiest team

An alert is sent to the cluster administrator

9. Why are spot instances well-suited for training workloads but less ideal for inference?

Spot instances have slower GPUs than on-demand instances

Training can checkpoint and resume after interruption; live inference cannot tolerate sudden termination

Spot instances do not support NVIDIA drivers

Inference workloads use too little GPU to justify spot pricing

10. What is the difference between Volcano's preemption and reclaim actions?

They are identical mechanisms with different names

Preemption evicts lower-priority pods; reclaim returns capacity from queues exceeding their fair share

Preemption is for CPU pods; reclaim is for GPU pods

Reclaim only works with Kueue, not Volcano alone

11. What does NVIDIA MIG allow on an A100 GPU?

Overclocking the GPU beyond its rated speed

Partitioning the GPU into isolated slices with dedicated memory and compute

Connecting multiple GPUs via NVLink into a single virtual GPU

Running GPU workloads without a device driver

12. What is the recommended architecture for large-scale GPU clusters using both Kueue and Volcano?

Use Volcano for everything and disable Kueue

Use Kueue for admission control and quotas; use Volcano for gang scheduling and pod placement

Use Kueue only for inference and Volcano only for training

Alternate between Kueue and Volcano on different days of the week

Chapter 6: Resource Scheduling and Cluster Optimization

Learning Objectives

Section 1: Kubernetes Scheduling for AI

Priority Classes and Preemption

Node Affinity, Taints, and Tolerations

Topology-Aware Scheduling

Key Takeaway

Section 2: Kueue — Kubernetes-Native Job Queuing

The Problem Kueue Solves

ClusterQueue, LocalQueue, and Workloads

Kueue Job Queuing — Fair-Share Admission with Borrowing

Fair-Sharing and Borrowing

Priority-Based Preemption

Integration with Jobs, RayJobs, and Training Operator

Key Takeaway

Section 3: Volcano Batch Scheduler

Gang Scheduling and the Deadlock Problem

Priority Preemption — High-Priority Job Evicts Lower-Priority Pods

Volcano Plugins

Preemption vs. Reclaim

Kueue + Volcano: Better Together

Key Takeaway

Section 4: Cluster Bin-Packing and Cost Optimization

GPU Utilization Monitoring

NVIDIA MIG

Bin-Packing Optimization — Efficient vs. Wasteful GPU Placement

Spot and Preemptible Instances

Karpenter vs. Cluster Autoscaler

Key Takeaway

Your Progress

Answer Explanations