1. What is the primary limitation of namespace-based isolation compared to virtual clusters?
Namespaces cannot enforce ResourceQuotas
Custom Resource Definitions (CRDs) are cluster-scoped and cannot be isolated per namespace
Namespaces do not support RBAC
Namespaces cannot be used with NetworkPolicies
2. In Kueue, what happens to idle quota allocated to one team when another team needs resources?
It is permanently lost until the next billing cycle
It stays reserved and cannot be used by others
Unused quota can be lent to neighboring teams via borrowing limits
It is automatically deleted by the garbage collector
3. What is the primary advantage of the burst-to-cloud pattern?
It eliminates the need for on-premises hardware entirely
On-prem handles steady-state workloads while cloud absorbs peak demand spikes
It requires no network connection between clusters
It automatically migrates all data to the cloud
4. What does a PodDisruptionBudget (PDB) protect against?
Hardware failures and OOM kills
Voluntary disruptions such as node drains during maintenance
Network partitions between pods
Container image pull failures
5. How does DRA (Dynamic Resource Allocation) improve upon traditional device plugins?
It removes the need for GPU drivers entirely
It allows attribute-based requests like memory size and NVLink connectivity instead of simple counts
It only supports NVIDIA GPUs
It replaces the Kubernetes scheduler with a new component
6. What is the best mitigation for losing training progress when a node is drained?
Running training jobs in privileged mode
Using only spot instances
Frequent checkpointing combined with SIGTERM handlers that save state before exit
Disabling node drains entirely
7. What problem does LeaderWorkerSet solve?
Autoscaling inference replicas based on request rate
Managing tightly coupled leader/worker process groups for distributed training and inference
Encrypting network traffic between pods
Scheduling CronJobs for periodic model retraining
8. In a hybrid training/serving architecture, what serves as the synchronization bridge?
A shared NFS mount between clusters
A versioned model registry backed by object storage
Direct etcd replication between clusters
A message queue like Kafka
9. What is the purpose of chaos engineering for AI infrastructure?
To stress-test GPUs until they fail permanently
To deliberately inject failures and discover weaknesses before they manifest in production
To benchmark maximum throughput of inference endpoints
To test whether models produce correct predictions
10. Which DRA API object advertises available hardware on each node?
DeviceClass
ResourceClaim
ResourceSlice
ResourceQuota
11. What is the "noisy neighbor" problem in multi-tenancy?
When a DNS misconfiguration causes pods to resolve the wrong service
When one tenant's resource-intensive workload degrades performance for others sharing the same infrastructure
When log output from one pod pollutes another pod's stdout
When two pods are scheduled on the same node by mistake
12. In a tiered disaster recovery strategy, what does Tier 2 (model artifact replication) typically achieve for RPO?
Zero RPO (synchronous replication)
Minutes (asynchronous cross-region replication)
Days (weekly manual backups)
Hours (nightly snapshots)
The Tenant Problem
Imagine a university research computing center. Every faculty lab needs GPU time, every PhD student needs to run experiments, and everyone believes their deadline is the most important. Without structure, the researcher with the most aggressive scripts wins -- and everyone else waits. This is the noisy neighbor problem, the central challenge of multi-tenancy.
Multi-tenancy on Kubernetes exists on a spectrum. Soft multi-tenancy assumes cooperative tenants within the same organization; the goal is fairness. Hard multi-tenancy treats tenants as potentially hostile, guarding against data exfiltration, privilege escalation, and denial-of-service. Most enterprise AI platforms fall between these poles.
Namespace-Based Isolation vs. Virtual Clusters
Namespace-based isolation is the Kubernetes default. Each team gets one or more namespaces with RBAC Roles, NetworkPolicies, ResourceQuotas, and LimitRanges scoped to them. It is operationally simple and works well when teams share a common API version.
The key limitation: namespaces are a logical boundary, not a physical one. CRDs are cluster-scoped, so if team A installs a CRD at v1alpha1 and team B needs v1beta1, there is an irreconcilable conflict.
Virtual clusters (e.g., vCluster) solve this by giving each tenant a fully isolated Kubernetes API server running as pods inside the host cluster. Tenants can install arbitrary CRDs without affecting each other, while the host cluster still owns compute resources.
| Dimension | Namespace Isolation | Virtual Clusters |
| Operational complexity | Low | Medium |
| CRD isolation | No (cluster-scoped) | Yes (per-vcluster) |
| Control plane overhead | None | One vcluster pod set per tenant |
| Security boundary | Logical | Near-physical |
| Recommended for | Internal teams, shared tooling | External tenants, CRD conflicts |
Resource Governance with Kueue and Quotas
Isolation defines who can access resources; governance defines how much. Kubernetes provides two native primitives:
- ResourceQuota: caps aggregate consumption of CPU, memory, and GPU within a namespace
- LimitRange: sets default and maximum resource requests for individual pods
For sophisticated batch scheduling, Kueue provides ClusterQueue and LocalQueue objects. A ClusterQueue defines a pool with nominal quotas and borrowing limits; teams draw from their LocalQueue, and unused quota flows to whoever needs it. For hardware-level isolation, NVIDIA MIG partitions physical A100/H100 GPUs into isolated instances with dedicated memory and compute slices.
Self-Service Interfaces and Platform Engineering
Effective AI platforms provide self-service interfaces so data scientists can provision compute without filing tickets:
- Namespace-as-a-Service: GitOps-driven workflows where a PR provisions namespaces, RBAC, queues automatically
- ML Platform portals: Kubeflow, MLflow on Kubernetes, or internal dashboards for launching jobs
- Policy enforcement: OPA Gatekeeper or Kyverno admission controllers enforce governance rules automatically
Platform engineering patterns include golden paths (pre-configured templates for common job types), paved roads with guardrails (safe defaults with escape hatches), and chargeback/showback (cost transparency via OpenCost or Kubecost).
Why One Cluster Is Not Enough
A single Kubernetes cluster has fundamental limitations: blast radius (a control plane outage takes down everything), geographic constraints (data sovereignty), cost optimization (on-prem is cheaper for steady-state but wasteful for peaks), and technology heterogeneity (training needs A100s, serving needs global load balancing).
Training On-Prem, Serving in Cloud
A common pattern separates training and serving lifecycles. Training runs on-premises on owned hardware with InfiniBand/RoCE networking. Trained model artifacts are pushed to a model registry (MLflow, W&B, or OCI registry). Cloud deployments detect new versions and roll them out with auto-scaling and global load balancing.
Burst-to-Cloud for Peak Training Demand
Like a town's water supply drawing from a regional reservoir during drought, on-premises clusters handle baseline demand while cloud clusters absorb spikes. Implementation involves:
- Cluster federation: Liqo, Open Cluster Management (OCM), or Karmada federate clusters behind a single control plane
- Cost-aware scheduling: Prefer on-prem when utilization is below threshold, fall back to cloud spot instances
- Data locality: Replicate datasets to cloud storage in advance to avoid expensive WAN transfers per job
Multi-Cluster Federation for GPU Pools
At scale, federation creates a logical GPU pool spanning physical boundaries. A 1,000-GPU training job can be placed across clusters transparently. Key challenges include consistent networking (WAN degrades tight communication patterns), consistent software environments (driver/CUDA version alignment), and unified observability.
Cross-Cluster Synchronization
| Data Type | Direction | Tooling |
| Training datasets | On-prem to Cloud | Rclone, AWS DataSync, object storage replication |
| Model checkpoints | Training cluster to Registry | MLflow, W&B, OCI artifact push |
| Serving model artifacts | Registry to Serving cluster | CD pipeline, image pull, PVC snapshot |
| Cluster state backups | All clusters to Remote store | Velero, Kasten K10 |
The key insight: model artifacts are immutable and versioned, making cross-cluster synchronization straightforward compared to mutable databases.
Pod Disruption Budgets for Inference Services
When nodes are drained for maintenance, Kubernetes evicts pods. For high-traffic inference services with strict latency SLAs, evicting all replicas simultaneously is catastrophic. PodDisruptionBudgets (PDBs) constrain how many pods can be voluntarily disrupted at once:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: inference-pdb
spec:
minAvailable: 2 # At least 2 replicas must stay up
selector:
matchLabels:
app: llm-inference
PDBs only constrain voluntary disruptions (node drains, cluster upgrades). Involuntary disruptions (hardware failure, OOM kills) bypass PDBs and require replica counts and topology spread constraints.
Graceful Shutdown for Long-Running Training
Training jobs run for hours or days. Production hardening strategies include:
- Frequent checkpointing: Every 10-15 minutes limits maximum work lost
- SIGTERM handlers: Training code registers a handler that triggers checkpoint save before exiting; the job controller reschedules and resumes
- Extended termination grace periods:
terminationGracePeriodSeconds: 300-600 gives time for checkpoint writes
- Job prioritization: PriorityClasses ensure production jobs are not preempted for experiments
Disaster Recovery for Model Registries and Pipelines
| DR Tier | What It Protects | Tooling | Typical RPO |
| Tier 1: Cluster state | YAML manifests, PVCs | Velero, Kasten K10 | Hours |
| Tier 2: Model artifacts | Trained weights, checkpoints | S3 CRR, GCS multi-region | Minutes |
| Tier 3: Active-passive | Full platform availability | OCM, Karmada, DNS failover | Minutes |
Chaos Engineering for AI Infrastructure
Like fire drills, chaos engineering discovers weaknesses before they manifest in production. Relevant experiments for AI infrastructure:
- Node failure injection: Terminate a GPU node mid-training to verify checkpoint-and-resume
- Network partition simulation: Cut link to model registry to verify graceful failure and retry
- etcd latency injection: Observe operator reconciliation under degraded API server
- GPU memory pressure: Validate MIG partitioning or namespace quotas prevent interference
Tools like Chaos Mesh and LitmusChaos provide native Kubernetes fault injection experiments.
Dynamic Resource Allocation (DRA) for GPUs
The traditional resources.limits: nvidia.com/gpu: 1 is a blunt instrument. The scheduler knows only that a pod needs one GPU -- it has no visibility into GPU memory, NVLink connectivity, or PCIe topology. DRA (GA in Kubernetes 1.34) replaces device plugins with attribute-based scheduling:
| DRA Object | Role | Created By |
ResourceSlice | Advertises hardware on each node with device attributes | DRA driver / node agent |
DeviceClass | Defines a category of requestable devices | Cluster admin / driver |
ResourceClaim | Workload's request for specific devices | User / pipeline |
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
name: training-gpu-claim
spec:
devices:
requests:
- name: gpu
deviceClassName: nvidia-gpu
selectors:
- cel:
expression: device.attributes["memory"].isGreaterThan(quantity("40Gi"))
DRA also supports topology-aware co-scheduling: a workload can request a GPU and a NIC on the same PCIe Root Complex, eliminating manual affinity rules.
LeaderWorkerSet and JobSet APIs
JobSet manages groups of related Job objects as a single unit with shared failure policies. LeaderWorkerSet targets LLM training/inference where one leader coordinates many workers, replacing the awkward headless-Service-plus-StatefulSet workaround.
Emerging Hardware on Kubernetes
| Hardware | Primary Use | K8s Integration |
| TPU v4/v5 (Google) | Large-scale training | GKE TPU node pools, google.com/tpu |
| AWS Trainium | Cost-efficient training | EKS Neuron device plugin |
| AWS Inferentia | High-throughput inference | EKS Neuron device plugin |
| Intel Gaudi 2/3 | Training, inference | Gaudi device plugin |
| AMD Instinct MI300X | Training, LLM inference | ROCm device plugin |
DRA's attribute-based model is architecturally suited to this diversity: a workload requesting "GPU with 80 GB memory and NVLink" can match H100, MI300X, or future hardware without changing job definitions. This enables true hardware abstraction for AI platforms.
1. What is the primary limitation of namespace-based isolation compared to virtual clusters?
Namespaces cannot enforce ResourceQuotas
Custom Resource Definitions (CRDs) are cluster-scoped and cannot be isolated per namespace
Namespaces do not support RBAC
Namespaces cannot be used with NetworkPolicies
2. In Kueue, what happens to idle quota allocated to one team when another team needs resources?
It is permanently lost until the next billing cycle
It stays reserved and cannot be used by others
Unused quota can be lent to neighboring teams via borrowing limits
It is automatically deleted by the garbage collector
3. What is the primary advantage of the burst-to-cloud pattern?
It eliminates the need for on-premises hardware entirely
On-prem handles steady-state workloads while cloud absorbs peak demand spikes
It requires no network connection between clusters
It automatically migrates all data to the cloud
4. What does a PodDisruptionBudget (PDB) protect against?
Hardware failures and OOM kills
Voluntary disruptions such as node drains during maintenance
Network partitions between pods
Container image pull failures
5. How does DRA (Dynamic Resource Allocation) improve upon traditional device plugins?
It removes the need for GPU drivers entirely
It allows attribute-based requests like memory size and NVLink connectivity instead of simple counts
It only supports NVIDIA GPUs
It replaces the Kubernetes scheduler with a new component
6. What is the best mitigation for losing training progress when a node is drained?
Running training jobs in privileged mode
Using only spot instances
Frequent checkpointing combined with SIGTERM handlers that save state before exit
Disabling node drains entirely
7. What problem does LeaderWorkerSet solve?
Autoscaling inference replicas based on request rate
Managing tightly coupled leader/worker process groups for distributed training and inference
Encrypting network traffic between pods
Scheduling CronJobs for periodic model retraining
8. In a hybrid training/serving architecture, what serves as the synchronization bridge?
A shared NFS mount between clusters
A versioned model registry backed by object storage
Direct etcd replication between clusters
A message queue like Kafka
9. What is the purpose of chaos engineering for AI infrastructure?
To stress-test GPUs until they fail permanently
To deliberately inject failures and discover weaknesses before they manifest in production
To benchmark maximum throughput of inference endpoints
To test whether models produce correct predictions
10. Which DRA API object advertises available hardware on each node?
DeviceClass
ResourceClaim
ResourceSlice
ResourceQuota
11. What is the "noisy neighbor" problem in multi-tenancy?
When a DNS misconfiguration causes pods to resolve the wrong service
When one tenant's resource-intensive workload degrades performance for others sharing the same infrastructure
When log output from one pod pollutes another pod's stdout
When two pods are scheduled on the same node by mistake
12. In a tiered disaster recovery strategy, what does Tier 2 (model artifact replication) typically achieve for RPO?
Zero RPO (synchronous replication)
Minutes (asynchronous cross-region replication)
Days (weekly manual backups)
Hours (nightly snapshots)