Chapter 10: Production Patterns and Scaling AI Platforms

Learning Objectives

Pre-Study Assessment

1. What is the primary limitation of namespace-based isolation compared to virtual clusters?

Namespaces cannot enforce ResourceQuotas
Custom Resource Definitions (CRDs) are cluster-scoped and cannot be isolated per namespace
Namespaces do not support RBAC
Namespaces cannot be used with NetworkPolicies

2. In Kueue, what happens to idle quota allocated to one team when another team needs resources?

It is permanently lost until the next billing cycle
It stays reserved and cannot be used by others
Unused quota can be lent to neighboring teams via borrowing limits
It is automatically deleted by the garbage collector

3. What is the primary advantage of the burst-to-cloud pattern?

It eliminates the need for on-premises hardware entirely
On-prem handles steady-state workloads while cloud absorbs peak demand spikes
It requires no network connection between clusters
It automatically migrates all data to the cloud

4. What does a PodDisruptionBudget (PDB) protect against?

Hardware failures and OOM kills
Voluntary disruptions such as node drains during maintenance
Network partitions between pods
Container image pull failures

5. How does DRA (Dynamic Resource Allocation) improve upon traditional device plugins?

It removes the need for GPU drivers entirely
It allows attribute-based requests like memory size and NVLink connectivity instead of simple counts
It only supports NVIDIA GPUs
It replaces the Kubernetes scheduler with a new component

6. What is the best mitigation for losing training progress when a node is drained?

Running training jobs in privileged mode
Using only spot instances
Frequent checkpointing combined with SIGTERM handlers that save state before exit
Disabling node drains entirely

7. What problem does LeaderWorkerSet solve?

Autoscaling inference replicas based on request rate
Managing tightly coupled leader/worker process groups for distributed training and inference
Encrypting network traffic between pods
Scheduling CronJobs for periodic model retraining

8. In a hybrid training/serving architecture, what serves as the synchronization bridge?

A shared NFS mount between clusters
A versioned model registry backed by object storage
Direct etcd replication between clusters
A message queue like Kafka

9. What is the purpose of chaos engineering for AI infrastructure?

To stress-test GPUs until they fail permanently
To deliberately inject failures and discover weaknesses before they manifest in production
To benchmark maximum throughput of inference endpoints
To test whether models produce correct predictions

10. Which DRA API object advertises available hardware on each node?

DeviceClass
ResourceClaim
ResourceSlice
ResourceQuota

11. What is the "noisy neighbor" problem in multi-tenancy?

When a DNS misconfiguration causes pods to resolve the wrong service
When one tenant's resource-intensive workload degrades performance for others sharing the same infrastructure
When log output from one pod pollutes another pod's stdout
When two pods are scheduled on the same node by mistake

12. In a tiered disaster recovery strategy, what does Tier 2 (model artifact replication) typically achieve for RPO?

Zero RPO (synchronous replication)
Minutes (asynchronous cross-region replication)
Days (weekly manual backups)
Hours (nightly snapshots)

Section 1: Multi-Tenant AI Platform Design

The Tenant Problem

Imagine a university research computing center. Every faculty lab needs GPU time, every PhD student needs to run experiments, and everyone believes their deadline is the most important. Without structure, the researcher with the most aggressive scripts wins -- and everyone else waits. This is the noisy neighbor problem, the central challenge of multi-tenancy.

Multi-tenancy on Kubernetes exists on a spectrum. Soft multi-tenancy assumes cooperative tenants within the same organization; the goal is fairness. Hard multi-tenancy treats tenants as potentially hostile, guarding against data exfiltration, privilege escalation, and denial-of-service. Most enterprise AI platforms fall between these poles.

Namespace-Based Isolation vs. Virtual Clusters

Namespace-based isolation is the Kubernetes default. Each team gets one or more namespaces with RBAC Roles, NetworkPolicies, ResourceQuotas, and LimitRanges scoped to them. It is operationally simple and works well when teams share a common API version.

The key limitation: namespaces are a logical boundary, not a physical one. CRDs are cluster-scoped, so if team A installs a CRD at v1alpha1 and team B needs v1beta1, there is an irreconcilable conflict.

Virtual clusters (e.g., vCluster) solve this by giving each tenant a fully isolated Kubernetes API server running as pods inside the host cluster. Tenants can install arbitrary CRDs without affecting each other, while the host cluster still owns compute resources.

DimensionNamespace IsolationVirtual Clusters
Operational complexityLowMedium
CRD isolationNo (cluster-scoped)Yes (per-vcluster)
Control plane overheadNoneOne vcluster pod set per tenant
Security boundaryLogicalNear-physical
Recommended forInternal teams, shared toolingExternal tenants, CRD conflicts

Resource Governance with Kueue and Quotas

Isolation defines who can access resources; governance defines how much. Kubernetes provides two native primitives:

For sophisticated batch scheduling, Kueue provides ClusterQueue and LocalQueue objects. A ClusterQueue defines a pool with nominal quotas and borrowing limits; teams draw from their LocalQueue, and unused quota flows to whoever needs it. For hardware-level isolation, NVIDIA MIG partitions physical A100/H100 GPUs into isolated instances with dedicated memory and compute slices.

Self-Service Interfaces and Platform Engineering

Effective AI platforms provide self-service interfaces so data scientists can provision compute without filing tickets:

Platform engineering patterns include golden paths (pre-configured templates for common job types), paved roads with guardrails (safe defaults with escape hatches), and chargeback/showback (cost transparency via OpenCost or Kubecost).

Figure 10.1 — Multi-Tenant AI Platform with Kueue Governance
Team A (CV) Namespace: team-cv Train Job (8 GPU) Fine-tune (2 GPU) Team B (NLP) Namespace: team-nlp LLM Train (16 GPU) Eval (4 GPU) QUEUED Team C (RecSys) Namespace: team-recsys Serving (4 GPU) Sweep (2 GPU) Kueue Scheduler ClusterQueue: 32 nominal GPU | 48 max-burst | fair-share + borrowing LocalQueue: team-cv LocalQueue: team-nlp LocalQueue: team-recsys Shared GPU Node Pool (32 GPUs) G0 G1 G2 G3 A A A A B B B B C C -- -- A = Team CV | B = Team NLP | C = Team RecSys | -- = idle (borrowable) | yellow = unassigned

Key Takeaway

Section 2: Multi-Cluster and Hybrid Architectures

Why One Cluster Is Not Enough

A single Kubernetes cluster has fundamental limitations: blast radius (a control plane outage takes down everything), geographic constraints (data sovereignty), cost optimization (on-prem is cheaper for steady-state but wasteful for peaks), and technology heterogeneity (training needs A100s, serving needs global load balancing).

Training On-Prem, Serving in Cloud

A common pattern separates training and serving lifecycles. Training runs on-premises on owned hardware with InfiniBand/RoCE networking. Trained model artifacts are pushed to a model registry (MLflow, W&B, or OCI registry). Cloud deployments detect new versions and roll them out with auto-scaling and global load balancing.

Burst-to-Cloud for Peak Training Demand

Like a town's water supply drawing from a regional reservoir during drought, on-premises clusters handle baseline demand while cloud clusters absorb spikes. Implementation involves:

Multi-Cluster Federation for GPU Pools

At scale, federation creates a logical GPU pool spanning physical boundaries. A 1,000-GPU training job can be placed across clusters transparently. Key challenges include consistent networking (WAN degrades tight communication patterns), consistent software environments (driver/CUDA version alignment), and unified observability.

Cross-Cluster Synchronization

Data TypeDirectionTooling
Training datasetsOn-prem to CloudRclone, AWS DataSync, object storage replication
Model checkpointsTraining cluster to RegistryMLflow, W&B, OCI artifact push
Serving model artifactsRegistry to Serving clusterCD pipeline, image pull, PVC snapshot
Cluster state backupsAll clusters to Remote storeVelero, Kasten K10

The key insight: model artifacts are immutable and versioned, making cross-cluster synchronization straightforward compared to mutable databases.

Figure 10.2 — Burst-to-Cloud: On-Prem Overflow to Cloud GPU Nodes
On-Premises Cluster Steady-state workloads | InfiniBand GPU Node Pool (8 GPUs) Utilization: CAPACITY EXCEEDED - BURST Queued workloads: HP Sweep (4 GPU) Eval Run (2 GPU) Fine-tune (2 GPU) WAN / VPN Cloud Cluster (Burst) Spot instances | Auto-scaling Cloud GPU Pool (elastic) Cost model: spot instances ~60-70% savings vs on-demand Federation Controller (Liqo/OCM) cost-aware scheduling Pre-replicated dataset (S3) avoids WAN transfer per job

Key Takeaway

Section 3: Production Hardening and Reliability

Pod Disruption Budgets for Inference Services

When nodes are drained for maintenance, Kubernetes evicts pods. For high-traffic inference services with strict latency SLAs, evicting all replicas simultaneously is catastrophic. PodDisruptionBudgets (PDBs) constrain how many pods can be voluntarily disrupted at once:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: inference-pdb
spec:
  minAvailable: 2          # At least 2 replicas must stay up
  selector:
    matchLabels:
      app: llm-inference

PDBs only constrain voluntary disruptions (node drains, cluster upgrades). Involuntary disruptions (hardware failure, OOM kills) bypass PDBs and require replica counts and topology spread constraints.

Graceful Shutdown for Long-Running Training

Training jobs run for hours or days. Production hardening strategies include:

Disaster Recovery for Model Registries and Pipelines

DR TierWhat It ProtectsToolingTypical RPO
Tier 1: Cluster stateYAML manifests, PVCsVelero, Kasten K10Hours
Tier 2: Model artifactsTrained weights, checkpointsS3 CRR, GCS multi-regionMinutes
Tier 3: Active-passiveFull platform availabilityOCM, Karmada, DNS failoverMinutes

Chaos Engineering for AI Infrastructure

Like fire drills, chaos engineering discovers weaknesses before they manifest in production. Relevant experiments for AI infrastructure:

Tools like Chaos Mesh and LitmusChaos provide native Kubernetes fault injection experiments.

Key Takeaway

Section 4: The Future of AI on Kubernetes

Dynamic Resource Allocation (DRA) for GPUs

The traditional resources.limits: nvidia.com/gpu: 1 is a blunt instrument. The scheduler knows only that a pod needs one GPU -- it has no visibility into GPU memory, NVLink connectivity, or PCIe topology. DRA (GA in Kubernetes 1.34) replaces device plugins with attribute-based scheduling:

DRA ObjectRoleCreated By
ResourceSliceAdvertises hardware on each node with device attributesDRA driver / node agent
DeviceClassDefines a category of requestable devicesCluster admin / driver
ResourceClaimWorkload's request for specific devicesUser / pipeline
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: training-gpu-claim
spec:
  devices:
    requests:
    - name: gpu
      deviceClassName: nvidia-gpu
      selectors:
      - cel:
          expression: device.attributes["memory"].isGreaterThan(quantity("40Gi"))

DRA also supports topology-aware co-scheduling: a workload can request a GPU and a NIC on the same PCIe Root Complex, eliminating manual affinity rules.

LeaderWorkerSet and JobSet APIs

JobSet manages groups of related Job objects as a single unit with shared failure policies. LeaderWorkerSet targets LLM training/inference where one leader coordinates many workers, replacing the awkward headless-Service-plus-StatefulSet workaround.

Emerging Hardware on Kubernetes

HardwarePrimary UseK8s Integration
TPU v4/v5 (Google)Large-scale trainingGKE TPU node pools, google.com/tpu
AWS TrainiumCost-efficient trainingEKS Neuron device plugin
AWS InferentiaHigh-throughput inferenceEKS Neuron device plugin
Intel Gaudi 2/3Training, inferenceGaudi device plugin
AMD Instinct MI300XTraining, LLM inferenceROCm device plugin

DRA's attribute-based model is architecturally suited to this diversity: a workload requesting "GPU with 80 GB memory and NVLink" can match H100, MI300X, or future hardware without changing job definitions. This enables true hardware abstraction for AI platforms.

Figure 10.3 — DRA vs Device Plugin: Improved Resource Allocation Lifecycle
OLD: Device Plugin Model Pod Spec nvidia.com/gpu: 1 Scheduler count-based only Assigned: gpu-node-03 Random GPU selected Problems: No memory visibility No topology awareness Manual node selectors NEW: Dynamic Resource Allocation ResourceClaim memory > 40Gi nvlink == true ResourceSlice gpu-node-01 H100 80Gi NVLink MATCH DRA Scheduler attribute matching topology-aware Optimal Binding gpu-node-01: H100 80Gi NVLink + same PCIe root Benefits: Attribute-based matching Topology co-scheduling Hardware-agnostic claims

Key Takeaway

Post-Study Assessment

1. What is the primary limitation of namespace-based isolation compared to virtual clusters?

Namespaces cannot enforce ResourceQuotas
Custom Resource Definitions (CRDs) are cluster-scoped and cannot be isolated per namespace
Namespaces do not support RBAC
Namespaces cannot be used with NetworkPolicies

2. In Kueue, what happens to idle quota allocated to one team when another team needs resources?

It is permanently lost until the next billing cycle
It stays reserved and cannot be used by others
Unused quota can be lent to neighboring teams via borrowing limits
It is automatically deleted by the garbage collector

3. What is the primary advantage of the burst-to-cloud pattern?

It eliminates the need for on-premises hardware entirely
On-prem handles steady-state workloads while cloud absorbs peak demand spikes
It requires no network connection between clusters
It automatically migrates all data to the cloud

4. What does a PodDisruptionBudget (PDB) protect against?

Hardware failures and OOM kills
Voluntary disruptions such as node drains during maintenance
Network partitions between pods
Container image pull failures

5. How does DRA (Dynamic Resource Allocation) improve upon traditional device plugins?

It removes the need for GPU drivers entirely
It allows attribute-based requests like memory size and NVLink connectivity instead of simple counts
It only supports NVIDIA GPUs
It replaces the Kubernetes scheduler with a new component

6. What is the best mitigation for losing training progress when a node is drained?

Running training jobs in privileged mode
Using only spot instances
Frequent checkpointing combined with SIGTERM handlers that save state before exit
Disabling node drains entirely

7. What problem does LeaderWorkerSet solve?

Autoscaling inference replicas based on request rate
Managing tightly coupled leader/worker process groups for distributed training and inference
Encrypting network traffic between pods
Scheduling CronJobs for periodic model retraining

8. In a hybrid training/serving architecture, what serves as the synchronization bridge?

A shared NFS mount between clusters
A versioned model registry backed by object storage
Direct etcd replication between clusters
A message queue like Kafka

9. What is the purpose of chaos engineering for AI infrastructure?

To stress-test GPUs until they fail permanently
To deliberately inject failures and discover weaknesses before they manifest in production
To benchmark maximum throughput of inference endpoints
To test whether models produce correct predictions

10. Which DRA API object advertises available hardware on each node?

DeviceClass
ResourceClaim
ResourceSlice
ResourceQuota

11. What is the "noisy neighbor" problem in multi-tenancy?

When a DNS misconfiguration causes pods to resolve the wrong service
When one tenant's resource-intensive workload degrades performance for others sharing the same infrastructure
When log output from one pod pollutes another pod's stdout
When two pods are scheduled on the same node by mistake

12. In a tiered disaster recovery strategy, what does Tier 2 (model artifact replication) typically achieve for RPO?

Zero RPO (synchronous replication)
Minutes (asynchronous cross-region replication)
Days (weekly manual backups)
Hours (nightly snapshots)

Your Progress

Answer Explanations