Study Guide: Chapter 10 — Production Patterns and Scaling AI Platforms

Pre-Study Assessment

1. What is the primary limitation of namespace-based isolation compared to virtual clusters?

Namespaces cannot enforce ResourceQuotas

Custom Resource Definitions (CRDs) are cluster-scoped and cannot be isolated per namespace

Namespaces do not support RBAC

Namespaces cannot be used with NetworkPolicies

2. In Kueue, what happens to idle quota allocated to one team when another team needs resources?

It is permanently lost until the next billing cycle

It stays reserved and cannot be used by others

Unused quota can be lent to neighboring teams via borrowing limits

It is automatically deleted by the garbage collector

3. What is the primary advantage of the burst-to-cloud pattern?

It eliminates the need for on-premises hardware entirely

On-prem handles steady-state workloads while cloud absorbs peak demand spikes

It requires no network connection between clusters

It automatically migrates all data to the cloud

4. What does a PodDisruptionBudget (PDB) protect against?

Hardware failures and OOM kills

Voluntary disruptions such as node drains during maintenance

Network partitions between pods

Container image pull failures

5. How does DRA (Dynamic Resource Allocation) improve upon traditional device plugins?

It removes the need for GPU drivers entirely

It allows attribute-based requests like memory size and NVLink connectivity instead of simple counts

It only supports NVIDIA GPUs

It replaces the Kubernetes scheduler with a new component

6. What is the best mitigation for losing training progress when a node is drained?

Running training jobs in privileged mode

Using only spot instances

Frequent checkpointing combined with SIGTERM handlers that save state before exit

Disabling node drains entirely

7. What problem does LeaderWorkerSet solve?

Autoscaling inference replicas based on request rate

Managing tightly coupled leader/worker process groups for distributed training and inference

Encrypting network traffic between pods

Scheduling CronJobs for periodic model retraining

8. In a hybrid training/serving architecture, what serves as the synchronization bridge?

A shared NFS mount between clusters

A versioned model registry backed by object storage

Direct etcd replication between clusters

A message queue like Kafka

9. What is the purpose of chaos engineering for AI infrastructure?

To stress-test GPUs until they fail permanently

To deliberately inject failures and discover weaknesses before they manifest in production

To benchmark maximum throughput of inference endpoints

To test whether models produce correct predictions

10. Which DRA API object advertises available hardware on each node?

DeviceClass

ResourceClaim

ResourceSlice

ResourceQuota

11. What is the "noisy neighbor" problem in multi-tenancy?

When a DNS misconfiguration causes pods to resolve the wrong service

When one tenant's resource-intensive workload degrades performance for others sharing the same infrastructure

When log output from one pod pollutes another pod's stdout

When two pods are scheduled on the same node by mistake

12. In a tiered disaster recovery strategy, what does Tier 2 (model artifact replication) typically achieve for RPO?

Zero RPO (synchronous replication)

Minutes (asynchronous cross-region replication)

Days (weekly manual backups)

Hours (nightly snapshots)

Section 1: Multi-Tenant AI Platform Design

The Tenant Problem

Imagine a university research computing center. Every faculty lab needs GPU time, every PhD student needs to run experiments, and everyone believes their deadline is the most important. Without structure, the researcher with the most aggressive scripts wins -- and everyone else waits. This is the noisy neighbor problem, the central challenge of multi-tenancy.

Multi-tenancy on Kubernetes exists on a spectrum. Soft multi-tenancy assumes cooperative tenants within the same organization; the goal is fairness. Hard multi-tenancy treats tenants as potentially hostile, guarding against data exfiltration, privilege escalation, and denial-of-service. Most enterprise AI platforms fall between these poles.

Namespace-Based Isolation vs. Virtual Clusters

Namespace-based isolation is the Kubernetes default. Each team gets one or more namespaces with RBAC Roles, NetworkPolicies, ResourceQuotas, and LimitRanges scoped to them. It is operationally simple and works well when teams share a common API version.

The key limitation: namespaces are a logical boundary, not a physical one. CRDs are cluster-scoped, so if team A installs a CRD at v1alpha1 and team B needs v1beta1, there is an irreconcilable conflict.

Virtual clusters (e.g., vCluster) solve this by giving each tenant a fully isolated Kubernetes API server running as pods inside the host cluster. Tenants can install arbitrary CRDs without affecting each other, while the host cluster still owns compute resources.

Dimension	Namespace Isolation	Virtual Clusters
Operational complexity	Low	Medium
CRD isolation	No (cluster-scoped)	Yes (per-vcluster)
Control plane overhead	None	One vcluster pod set per tenant
Security boundary	Logical	Near-physical
Recommended for	Internal teams, shared tooling	External tenants, CRD conflicts

Resource Governance with Kueue and Quotas

Isolation defines who can access resources; governance defines how much. Kubernetes provides two native primitives:

ResourceQuota: caps aggregate consumption of CPU, memory, and GPU within a namespace
LimitRange: sets default and maximum resource requests for individual pods

For sophisticated batch scheduling, Kueue provides ClusterQueue and LocalQueue objects. A ClusterQueue defines a pool with nominal quotas and borrowing limits; teams draw from their LocalQueue, and unused quota flows to whoever needs it. For hardware-level isolation, NVIDIA MIG partitions physical A100/H100 GPUs into isolated instances with dedicated memory and compute slices.

Self-Service Interfaces and Platform Engineering

Effective AI platforms provide self-service interfaces so data scientists can provision compute without filing tickets:

Namespace-as-a-Service: GitOps-driven workflows where a PR provisions namespaces, RBAC, queues automatically
ML Platform portals: Kubeflow, MLflow on Kubernetes, or internal dashboards for launching jobs
Policy enforcement: OPA Gatekeeper or Kyverno admission controllers enforce governance rules automatically

Platform engineering patterns include golden paths (pre-configured templates for common job types), paved roads with guardrails (safe defaults with escape hatches), and chargeback/showback (cost transparency via OpenCost or Kubecost).

Section 2: Multi-Cluster and Hybrid Architectures

Why One Cluster Is Not Enough

A single Kubernetes cluster has fundamental limitations: blast radius (a control plane outage takes down everything), geographic constraints (data sovereignty), cost optimization (on-prem is cheaper for steady-state but wasteful for peaks), and technology heterogeneity (training needs A100s, serving needs global load balancing).

Training On-Prem, Serving in Cloud

A common pattern separates training and serving lifecycles. Training runs on-premises on owned hardware with InfiniBand/RoCE networking. Trained model artifacts are pushed to a model registry (MLflow, W&B, or OCI registry). Cloud deployments detect new versions and roll them out with auto-scaling and global load balancing.

Burst-to-Cloud for Peak Training Demand

Like a town's water supply drawing from a regional reservoir during drought, on-premises clusters handle baseline demand while cloud clusters absorb spikes. Implementation involves:

Cluster federation: Liqo, Open Cluster Management (OCM), or Karmada federate clusters behind a single control plane
Cost-aware scheduling: Prefer on-prem when utilization is below threshold, fall back to cloud spot instances
Data locality: Replicate datasets to cloud storage in advance to avoid expensive WAN transfers per job

Multi-Cluster Federation for GPU Pools

At scale, federation creates a logical GPU pool spanning physical boundaries. A 1,000-GPU training job can be placed across clusters transparently. Key challenges include consistent networking (WAN degrades tight communication patterns), consistent software environments (driver/CUDA version alignment), and unified observability.

Cross-Cluster Synchronization

Data Type	Direction	Tooling
Training datasets	On-prem to Cloud	Rclone, AWS DataSync, object storage replication
Model checkpoints	Training cluster to Registry	MLflow, W&B, OCI artifact push
Serving model artifacts	Registry to Serving cluster	CD pipeline, image pull, PVC snapshot
Cluster state backups	All clusters to Remote store	Velero, Kasten K10

The key insight: model artifacts are immutable and versioned, making cross-cluster synchronization straightforward compared to mutable databases.

Section 3: Production Hardening and Reliability

Pod Disruption Budgets for Inference Services

When nodes are drained for maintenance, Kubernetes evicts pods. For high-traffic inference services with strict latency SLAs, evicting all replicas simultaneously is catastrophic. PodDisruptionBudgets (PDBs) constrain how many pods can be voluntarily disrupted at once:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: inference-pdb
spec:
  minAvailable: 2          # At least 2 replicas must stay up
  selector:
    matchLabels:
      app: llm-inference

PDBs only constrain voluntary disruptions (node drains, cluster upgrades). Involuntary disruptions (hardware failure, OOM kills) bypass PDBs and require replica counts and topology spread constraints.

Graceful Shutdown for Long-Running Training

Training jobs run for hours or days. Production hardening strategies include:

Frequent checkpointing: Every 10-15 minutes limits maximum work lost
SIGTERM handlers: Training code registers a handler that triggers checkpoint save before exiting; the job controller reschedules and resumes
Extended termination grace periods: terminationGracePeriodSeconds: 300-600 gives time for checkpoint writes
Job prioritization: PriorityClasses ensure production jobs are not preempted for experiments

Disaster Recovery for Model Registries and Pipelines

DR Tier	What It Protects	Tooling	Typical RPO
Tier 1: Cluster state	YAML manifests, PVCs	Velero, Kasten K10	Hours
Tier 2: Model artifacts	Trained weights, checkpoints	S3 CRR, GCS multi-region	Minutes
Tier 3: Active-passive	Full platform availability	OCM, Karmada, DNS failover	Minutes

Chaos Engineering for AI Infrastructure

Like fire drills, chaos engineering discovers weaknesses before they manifest in production. Relevant experiments for AI infrastructure:

Node failure injection: Terminate a GPU node mid-training to verify checkpoint-and-resume
Network partition simulation: Cut link to model registry to verify graceful failure and retry
etcd latency injection: Observe operator reconciliation under degraded API server
GPU memory pressure: Validate MIG partitioning or namespace quotas prevent interference

Tools like Chaos Mesh and LitmusChaos provide native Kubernetes fault injection experiments.

Section 4: The Future of AI on Kubernetes

Dynamic Resource Allocation (DRA) for GPUs

The traditional resources.limits: nvidia.com/gpu: 1 is a blunt instrument. The scheduler knows only that a pod needs one GPU -- it has no visibility into GPU memory, NVLink connectivity, or PCIe topology. DRA (GA in Kubernetes 1.34) replaces device plugins with attribute-based scheduling:

DRA Object	Role	Created By
`ResourceSlice`	Advertises hardware on each node with device attributes	DRA driver / node agent
`DeviceClass`	Defines a category of requestable devices	Cluster admin / driver
`ResourceClaim`	Workload's request for specific devices	User / pipeline

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: training-gpu-claim
spec:
  devices:
    requests:
    - name: gpu
      deviceClassName: nvidia-gpu
      selectors:
      - cel:
          expression: device.attributes["memory"].isGreaterThan(quantity("40Gi"))

DRA also supports topology-aware co-scheduling: a workload can request a GPU and a NIC on the same PCIe Root Complex, eliminating manual affinity rules.

LeaderWorkerSet and JobSet APIs

JobSet manages groups of related Job objects as a single unit with shared failure policies. LeaderWorkerSet targets LLM training/inference where one leader coordinates many workers, replacing the awkward headless-Service-plus-StatefulSet workaround.

Emerging Hardware on Kubernetes

Hardware	Primary Use	K8s Integration
TPU v4/v5 (Google)	Large-scale training	GKE TPU node pools, `google.com/tpu`
AWS Trainium	Cost-efficient training	EKS Neuron device plugin
AWS Inferentia	High-throughput inference	EKS Neuron device plugin
Intel Gaudi 2/3	Training, inference	Gaudi device plugin
AMD Instinct MI300X	Training, LLM inference	ROCm device plugin

DRA's attribute-based model is architecturally suited to this diversity: a workload requesting "GPU with 80 GB memory and NVLink" can match H100, MI300X, or future hardware without changing job definitions. This enables true hardware abstraction for AI platforms.

Post-Study Assessment

1. What is the primary limitation of namespace-based isolation compared to virtual clusters?

Namespaces cannot enforce ResourceQuotas

Custom Resource Definitions (CRDs) are cluster-scoped and cannot be isolated per namespace

Namespaces do not support RBAC

Namespaces cannot be used with NetworkPolicies

2. In Kueue, what happens to idle quota allocated to one team when another team needs resources?

It is permanently lost until the next billing cycle

It stays reserved and cannot be used by others

Unused quota can be lent to neighboring teams via borrowing limits

It is automatically deleted by the garbage collector

3. What is the primary advantage of the burst-to-cloud pattern?

It eliminates the need for on-premises hardware entirely

On-prem handles steady-state workloads while cloud absorbs peak demand spikes

It requires no network connection between clusters

It automatically migrates all data to the cloud

4. What does a PodDisruptionBudget (PDB) protect against?

Hardware failures and OOM kills

Voluntary disruptions such as node drains during maintenance

Network partitions between pods

Container image pull failures

5. How does DRA (Dynamic Resource Allocation) improve upon traditional device plugins?

It removes the need for GPU drivers entirely

It allows attribute-based requests like memory size and NVLink connectivity instead of simple counts

It only supports NVIDIA GPUs

It replaces the Kubernetes scheduler with a new component

6. What is the best mitigation for losing training progress when a node is drained?

Running training jobs in privileged mode

Using only spot instances

Frequent checkpointing combined with SIGTERM handlers that save state before exit

Disabling node drains entirely

7. What problem does LeaderWorkerSet solve?

Autoscaling inference replicas based on request rate

Managing tightly coupled leader/worker process groups for distributed training and inference

Encrypting network traffic between pods

Scheduling CronJobs for periodic model retraining

8. In a hybrid training/serving architecture, what serves as the synchronization bridge?

A shared NFS mount between clusters

A versioned model registry backed by object storage

Direct etcd replication between clusters

A message queue like Kafka

9. What is the purpose of chaos engineering for AI infrastructure?

To stress-test GPUs until they fail permanently

To deliberately inject failures and discover weaknesses before they manifest in production

To benchmark maximum throughput of inference endpoints

To test whether models produce correct predictions

10. Which DRA API object advertises available hardware on each node?

DeviceClass

ResourceClaim

ResourceSlice

ResourceQuota

11. What is the "noisy neighbor" problem in multi-tenancy?

When a DNS misconfiguration causes pods to resolve the wrong service

When one tenant's resource-intensive workload degrades performance for others sharing the same infrastructure

When log output from one pod pollutes another pod's stdout

When two pods are scheduled on the same node by mistake

12. In a tiered disaster recovery strategy, what does Tier 2 (model artifact replication) typically achieve for RPO?

Zero RPO (synchronous replication)

Minutes (asynchronous cross-region replication)

Days (weekly manual backups)

Hours (nightly snapshots)

Chapter 10: Production Patterns and Scaling AI Platforms

Learning Objectives

Section 1: Multi-Tenant AI Platform Design

The Tenant Problem

Namespace-Based Isolation vs. Virtual Clusters

Resource Governance with Kueue and Quotas

Self-Service Interfaces and Platform Engineering

Key Takeaway

Section 2: Multi-Cluster and Hybrid Architectures

Why One Cluster Is Not Enough

Training On-Prem, Serving in Cloud

Burst-to-Cloud for Peak Training Demand

Multi-Cluster Federation for GPU Pools

Cross-Cluster Synchronization

Key Takeaway

Section 3: Production Hardening and Reliability

Pod Disruption Budgets for Inference Services

Graceful Shutdown for Long-Running Training

Disaster Recovery for Model Registries and Pipelines

Chaos Engineering for AI Infrastructure

Key Takeaway

Section 4: The Future of AI on Kubernetes

Dynamic Resource Allocation (DRA) for GPUs

LeaderWorkerSet and JobSet APIs

Emerging Hardware on Kubernetes

Key Takeaway

Your Progress

Answer Explanations