Study Guide: Chapter 1 — Introduction to AI on Kubernetes

Pre-Quiz: Why Kubernetes for AI/ML

1. A distributed training job needs 8 GPUs simultaneously, but only some cluster nodes have GPUs. What Kubernetes mechanism ensures the job lands on appropriate hardware?

Resource quotas reject the job if GPU nodes are busy

The NVIDIA device plugin registers GPUs as schedulable resources, enabling the scheduler to place pods on GPU-equipped nodes

Container images automatically detect and bind to available GPUs at runtime

Kubernetes evicts CPU workloads to make room for GPU requests

2. An AI team trains a model on a local cluster, then deploys it to a cloud provider. Slightly different CUDA toolkit versions cause numerical discrepancies. Which Kubernetes concept most directly prevents this problem?

Namespace isolation ensures environment separation

Kubernetes autoscaling normalizes resource differences across environments

Containerization packages the exact CUDA version and dependencies into a portable image

The Kubernetes API server translates provider-specific configurations automatically

3. What is the primary practical advantage of Kubernetes having 96%+ industry adoption for AI teams specifically?

Higher adoption means Kubernetes clusters run faster

It creates a self-reinforcing ecosystem where every major AI framework and tool builds Kubernetes-native integrations

It eliminates the need for specialized GPU hardware

Adoption levels directly correlate with improved model accuracy

4. Two data science teams share a Kubernetes cluster. One team launches a poorly-tuned training job that consumes all available GPUs. Which Kubernetes feature prevents this from starving the other team?

Pod priority classes automatically kill lower-priority jobs

The container runtime throttles excessive GPU usage

Resource quotas on namespaces cap the total GPU consumption per team

Kubernetes distributes GPU resources equally among all pods by default

5. Compared to traditional HPC systems like Slurm, what is a key trade-off when using Kubernetes for AI workloads?

Kubernetes cannot schedule GPU workloads at all

Kubernetes has a steeper learning curve and more verbose manifests, but offers superior portability and richer ML-specific ecosystem tooling

Slurm supports containerization while Kubernetes does not

Kubernetes is limited to cloud environments only

Resource Orchestration Needs of AI Workloads

Training a large language model or running a computer vision pipeline demands fundamentally different resources than serving a web application. Where a web server might need two CPU cores and a few hundred megabytes of memory, a distributed training job might need dozens of GPUs, terabytes of RAM, and high-bandwidth interconnects simultaneously.

Kubernetes serves as an air-traffic control system for your compute cluster. Through plugins such as the NVIDIA device plugin, the scheduler gains awareness of GPU hardware on each node, placing GPU-hungry training jobs only on nodes that have GPUs available and reserving exactly the number requested.

Beyond GPU placement, Kubernetes enforces resource quotas — hard limits on how much CPU, memory, or GPU a team or namespace can consume. This is critical for multi-tenant clusters where one team's poorly-tuned job could starve every other team's workloads.

flowchart TD A["User submits YAML manifest\nrequesting 4 GPUs"] --> B["Kubernetes API Server\nreceives request"] B --> C["Scheduler queries\nnode resource state"] C --> D{"Node has\n4+ GPUs available?"} D -- No --> E["Node skipped"] D -- Yes --> F["Pod bound to\nGPU-equipped node"] F --> G["NVIDIA Device Plugin\nexposes nvidia.com/gpu resources"] G --> H["Training job runs\non reserved GPUs"] E --> C

Portability Across Cloud and On-Premises Environments

AI teams frequently move workloads between environments: from local development to staging to production across different cloud providers or on-premises data centers. Kubernetes solves this through containerization — packaging an application and all its dependencies into a portable, self-contained image. Because Kubernetes runs on AWS (EKS), Google Cloud (GKE), Azure (AKS), and on-premises infrastructure using the same API surface, a container image that works in development works in production without modification.

This consistency is not merely convenience — it is a correctness guarantee. AI models are notoriously sensitive to environment variations: a different version of NumPy or a different CUDA toolkit can produce different numerical outputs. Containers eliminate that class of problem by making the environment a reproducible artifact.

Ecosystem Maturity and Community Momentum

According to the CNCF Annual Survey 2023, more than 96% of organizations are using or evaluating Kubernetes. Every major AI framework — PyTorch, TensorFlow, JAX, Hugging Face Transformers — has Kubernetes-native tooling. Platforms like Kubeflow, Ray, and MLflow have all converged on Kubernetes as their deployment substrate.

Factor	Kubernetes	Traditional HPC (Slurm)	Cloud-Managed VMs
GPU scheduling	Native via device plugins	Native, mature	Manual or autoscaling groups
Portability	Excellent (cloud + on-prem)	Limited (usually on-prem)	Cloud-vendor specific
Ecosystem tooling	Very rich (Kubeflow, KServe, Kueue)	Limited ML-specific tooling	Vendor-specific services
Multi-tenancy	Namespaces + quotas	Fair-share queues	Separate accounts/projects
Learning curve	Steep	Moderate (for HPC teams)	Low initially, high at scale

Post-Quiz: Why Kubernetes for AI/ML

1. A distributed training job needs 8 GPUs simultaneously, but only some cluster nodes have GPUs. What Kubernetes mechanism ensures the job lands on appropriate hardware?

Resource quotas reject the job if GPU nodes are busy

The NVIDIA device plugin registers GPUs as schedulable resources, enabling the scheduler to place pods on GPU-equipped nodes

Container images automatically detect and bind to available GPUs at runtime

Kubernetes evicts CPU workloads to make room for GPU requests

Namespace isolation ensures environment separation

Kubernetes autoscaling normalizes resource differences across environments

Containerization packages the exact CUDA version and dependencies into a portable image

The Kubernetes API server translates provider-specific configurations automatically

3. What is the primary practical advantage of Kubernetes having 96%+ industry adoption for AI teams specifically?

Higher adoption means Kubernetes clusters run faster

It creates a self-reinforcing ecosystem where every major AI framework and tool builds Kubernetes-native integrations

It eliminates the need for specialized GPU hardware

Adoption levels directly correlate with improved model accuracy

Pod priority classes automatically kill lower-priority jobs

The container runtime throttles excessive GPU usage

Resource quotas on namespaces cap the total GPU consumption per team

Kubernetes distributes GPU resources equally among all pods by default

5. Compared to traditional HPC systems like Slurm, what is a key trade-off when using Kubernetes for AI workloads?

Kubernetes cannot schedule GPU workloads at all

Kubernetes has a steeper learning curve and more verbose manifests, but offers superior portability and richer ML-specific ecosystem tooling

Slurm supports containerization while Kubernetes does not

Kubernetes is limited to cloud environments only

Section 2: AI/ML Workload Characteristics

Pre-Quiz: AI/ML Workload Characteristics

1. A team is deciding which Kubernetes resource type to use for their model training pipeline. Training runs for 12 hours and should stop when complete. Which resource type is the best fit, and why?

Deployment, because it ensures the training process is always running

Job, because it runs pods to completion and does not restart them afterward

CronJob, because long-running tasks should be scheduled

StatefulSet, because training requires persistent state

2. An inference service experiences a sudden 50x traffic spike. What characteristic of real-time inference workloads makes this scenario fundamentally different from a batch training spike?

Inference requires more total GPU memory than training

Inference must maintain low latency under unpredictable load, requiring horizontal scale-out rather than scale-up

Training spikes are more expensive than inference spikes

Inference workloads cannot use GPUs during traffic spikes

3. A training pod's GPU is spending 60% of its time idle, waiting for data from storage. What is this problem called and what storage strategy addresses it?

GPU fragmentation; solved by using fractional GPU allocation

The I/O bottleneck; solved by using high-throughput storage like NVMe-backed PersistentVolumes for hot data

Memory thrashing; solved by increasing pod memory limits

Scheduler contention; solved by increasing the number of training replicas

4. Why does Kubernetes need additional tools like Kueue for AI batch workloads, even though it already supports Job resources?

Kubernetes Jobs cannot run on GPU nodes

The default scheduler lacks queue management, preemption, and fair-share scheduling needed for concurrent batch GPU jobs

Kueue replaces the Kubernetes scheduler entirely

Jobs in Kubernetes automatically restart on failure, which wastes GPU resources

5. Which combination correctly describes the GPU utilization pattern for batch training versus real-time inference?

Training: bursty and underutilized; Inference: continuously saturated

Training: continuously saturated; Inference: bursty and often underutilized

Both are continuously saturated during operation

Neither uses GPUs — they rely on CPUs for computation

Batch Training vs Real-Time Inference Patterns

Batch training is a compute marathon: a job starts, runs for hours or days consuming enormous GPU resources, and produces a saved model artifact. Training is not user-facing; latency is irrelevant.

Real-time inference (online serving) is the opposite. A trained model sits behind an API endpoint, and users expect predictions in milliseconds. Inference workloads must handle unpredictable traffic spikes — scaling out horizontally by adding replicas rather than scaling up with more GPUs per pod.

Dimension	Batch Training	Real-Time Inference
Duration	Hours to days	Milliseconds per request
GPU utilization	Continuously saturated	Bursty, often underutilized
Latency sensitivity	Low	Very high
Traffic pattern	Predictable (job-scheduled)	Unpredictable spikes
Scaling strategy	Scale-up (more GPUs per job)	Scale-out (more replicas)
Failure tolerance	Restart job or checkpoint	Must be highly available

Batch Training vs Real-Time Inference

GPU and Accelerator Requirements

Kubernetes does not natively understand GPUs out of the box. Hardware vendors publish device plugins that extend the Kubernetes API to expose accelerators as schedulable resources. The NVIDIA device plugin registers each GPU as a nvidia.com/gpu resource unit. A pod can request nvidia.com/gpu: 2 in its resource spec, and the scheduler places it only on a node with at least 2 available NVIDIA GPUs.

GPU resources are expensive — a cloud A100 instance can cost $30/hour or more. Setting precise requests and limits in your pod spec is not optional for cost control.

Data-Intensive I/O Patterns

Training a large model requires moving enormous volumes of data into GPU memory efficiently. A GPU that spends more time waiting for data from disk than computing gradients is an expensive idle resource. This is the I/O bottleneck. In Kubernetes, choosing the right PersistentVolume type — fast NVMe-backed storage for hot data, object storage for cold datasets — is critical.

Long-Running and Ephemeral Job Types

AI workloads span both extremes: a multi-day distributed training run must tolerate node failures gracefully via checkpointing, while a feature engineering step might be ephemeral — a pod spins up, processes data, writes results, and exits. The infrastructure must support tens to thousands of concurrent long-running batch jobs, which is a different operational profile from what Kubernetes was originally optimized for.

Post-Quiz: AI/ML Workload Characteristics

1. A team is deciding which Kubernetes resource type to use for their model training pipeline. Training runs for 12 hours and should stop when complete. Which resource type is the best fit, and why?

Deployment, because it ensures the training process is always running

Job, because it runs pods to completion and does not restart them afterward

CronJob, because long-running tasks should be scheduled

StatefulSet, because training requires persistent state

2. An inference service experiences a sudden 50x traffic spike. What characteristic of real-time inference workloads makes this scenario fundamentally different from a batch training spike?

Inference requires more total GPU memory than training

Inference must maintain low latency under unpredictable load, requiring horizontal scale-out rather than scale-up

Training spikes are more expensive than inference spikes

Inference workloads cannot use GPUs during traffic spikes

3. A training pod's GPU is spending 60% of its time idle, waiting for data from storage. What is this problem called and what storage strategy addresses it?

GPU fragmentation; solved by using fractional GPU allocation

The I/O bottleneck; solved by using high-throughput storage like NVMe-backed PersistentVolumes for hot data

Memory thrashing; solved by increasing pod memory limits

Scheduler contention; solved by increasing the number of training replicas

4. Why does Kubernetes need additional tools like Kueue for AI batch workloads, even though it already supports Job resources?

Kubernetes Jobs cannot run on GPU nodes

The default scheduler lacks queue management, preemption, and fair-share scheduling needed for concurrent batch GPU jobs

Kueue replaces the Kubernetes scheduler entirely

Jobs in Kubernetes automatically restart on failure, which wastes GPU resources

5. Which combination correctly describes the GPU utilization pattern for batch training versus real-time inference?

Training: bursty and underutilized; Inference: continuously saturated

Training: continuously saturated; Inference: bursty and often underutilized

Both are continuously saturated during operation

Neither uses GPUs — they rely on CPUs for computation

Section 3: The AI/ML Lifecycle on Kubernetes

Pre-Quiz: The AI/ML Lifecycle on Kubernetes

1. A model deployed in production starts returning less accurate predictions over months because user behavior has changed. What is this phenomenon called, and which lifecycle stage addresses it?

Overfitting; addressed by collecting more training data

Model drift; addressed by the monitoring and retraining stage that detects distribution shifts and triggers automated retraining

Data leakage; addressed by fixing the feature engineering pipeline

Underfitting; addressed by increasing model complexity

2. In a distributed training job with 8 worker pods, 7 are running but the 8th cannot be scheduled due to insufficient GPU resources. What scheduling problem does this illustrate?

Resource fragmentation across namespaces

The lack of gang scheduling in the default Kubernetes scheduler, where all workers must start simultaneously to avoid wasting GPU resources

Insufficient resource quotas on the namespace

The NVIDIA device plugin failing to register GPUs

3. Which Kubernetes-native tool provides a unified InferenceService custom resource that abstracts over multiple serving runtimes and supports autoscaling to zero?

Kubeflow Trainer

Kueue

KServe

Kubeflow Pipelines

4. What is the correct ordering of AI/ML lifecycle stages, and how does the cycle close?

Training, Data Prep, Serving, Monitoring — loops back to Training

Data Preparation, Model Training, Model Serving, Monitoring — monitoring triggers retraining when drift is detected

Serving, Training, Monitoring, Data Prep — loops back to Serving

Data Preparation, Monitoring, Training, Serving — no loop

5. NVIDIA's disaggregated inference architecture separates prefill and decode into independent Kubernetes services. Why is this beneficial?

It reduces the total number of GPUs needed to zero

Because prefill is compute-bound and decode is memory-bandwidth-bound, separating them enables fine-grained resource allocation and better GPU utilization

It eliminates the need for model artifacts

It makes Kubernetes scheduling unnecessary

Data Preparation and Feature Engineering

Every AI project begins with data. Data preparation covers ingestion, cleaning, normalization, and transformation. Feature engineering extracts or constructs the numerical representations a model trains on. On Kubernetes, this stage typically runs as batch jobs using distributed data-processing frameworks like Apache Spark, Dask, or Ray, distributed across many pods for parallelism.

Model Training and Experimentation

This is where most GPU spend happens. Teams run many experiments in parallel, varying hyperparameters, architectures, or training data. Distributed training extends a single run across multiple pods. Kubeflow Trainer provides Kubernetes-native custom resources (PyTorchJob, JAXJob) to manage distributed training pod lifecycles.

The key scheduling challenge is gang scheduling: all worker pods must start simultaneously, because a job where 7 of 8 workers are running but waiting for the 8th is consuming 7 GPUs while doing zero useful work. Kueue addresses this with atomic job admission.

Model Serving and Inference

KServe is the Kubernetes-native standard for model serving. It provides a unified InferenceService custom resource that abstracts over multiple serving runtimes (Triton, TorchServe, ONNX Runtime, vLLM), handles autoscaling, and can scale to zero when no traffic is present.

For large language models, NVIDIA has introduced disaggregated inference that separates prefill (compute-bound) and decode (memory-bandwidth-bound) into independent Kubernetes services for better GPU utilization.

Monitoring and Retraining Loops

A deployed model is not static. Model drift — degradation of model performance as real-world data distributions shift — requires continuous monitoring and automated retraining. On Kubernetes, this involves metrics collection via Prometheus, drift detection via scheduled jobs, and automated retraining triggered when performance drops below thresholds.

AI/ML Lifecycle on Kubernetes

flowchart TD A["Raw Data\nsensor logs, text, images"] --> B["Data Preparation\nSpark / Dask / Ray\nKubernetes Job"] B --> C["Feature Store\nor Object Storage"] C --> D["Model Training\nPyTorch / DeepSpeed\nKubeflow Trainer Job"] D --> E["Model Artifact\nsaved weights"] E --> F["Model Serving\nKServe InferenceService\nautoscaling Deployment"] F --> G["Production Traffic\nuser predictions"] G --> H["Monitoring\nPrometheus / Grafana\nDrift Detection"] H -- drift detected --> D

Lifecycle Stage	Primary Tools	Kubernetes Resource Type
Data preparation	Spark, Dask, Ray	Job, StatefulSet
Model training	PyTorch, DeepSpeed, JAX	Job (via Kubeflow Trainer)
Experiment tracking	MLflow, Weights & Biases	Deployment
Model serving	KServe, Triton, vLLM	InferenceService (CRD)
Pipeline orchestration	Kubeflow Pipelines	Custom resources
Monitoring	Prometheus, Grafana	Deployment, DaemonSet

Post-Quiz: The AI/ML Lifecycle on Kubernetes

1. A model deployed in production starts returning less accurate predictions over months because user behavior has changed. What is this phenomenon called, and which lifecycle stage addresses it?

Overfitting; addressed by collecting more training data

Model drift; addressed by the monitoring and retraining stage that detects distribution shifts and triggers automated retraining

Data leakage; addressed by fixing the feature engineering pipeline

Underfitting; addressed by increasing model complexity

2. In a distributed training job with 8 worker pods, 7 are running but the 8th cannot be scheduled due to insufficient GPU resources. What scheduling problem does this illustrate?

Resource fragmentation across namespaces

The lack of gang scheduling in the default Kubernetes scheduler, where all workers must start simultaneously to avoid wasting GPU resources

Insufficient resource quotas on the namespace

The NVIDIA device plugin failing to register GPUs

3. Which Kubernetes-native tool provides a unified InferenceService custom resource that abstracts over multiple serving runtimes and supports autoscaling to zero?

Kubeflow Trainer

Kueue

KServe

Kubeflow Pipelines

4. What is the correct ordering of AI/ML lifecycle stages, and how does the cycle close?

Training, Data Prep, Serving, Monitoring — loops back to Training

Data Preparation, Model Training, Model Serving, Monitoring — monitoring triggers retraining when drift is detected

Serving, Training, Monitoring, Data Prep — loops back to Serving

Data Preparation, Monitoring, Training, Serving — no loop

5. NVIDIA's disaggregated inference architecture separates prefill and decode into independent Kubernetes services. Why is this beneficial?

It reduces the total number of GPUs needed to zero

Because prefill is compute-bound and decode is memory-bandwidth-bound, separating them enables fine-grained resource allocation and better GPU utilization

It eliminates the need for model artifacts

It makes Kubernetes scheduling unnecessary

Section 4: Kubernetes Architecture Refresher for AI Practitioners

Pre-Quiz: Kubernetes Architecture for AI

1. Which Kubernetes component is responsible for deciding which worker node a new pod should run on?

etcd

The kubelet

The scheduler

The controller manager

2. A research team accidentally submits 100 GPU pods simultaneously in their namespace. With a 16-GPU resource quota, what happens?

All 100 pods are scheduled and compete for GPUs at runtime

Kubernetes admits only as many pods as fit within the 16-GPU quota and rejects the rest

The entire namespace is shut down for exceeding limits

Kubernetes automatically increases the quota to accommodate the request

3. An inference server should always have at least 3 replicas running and automatically replace crashed pods. Which Kubernetes resource type provides this behavior?

Job

CronJob

Deployment

Bare Pod

4. What happens when a pod exceeds its memory limit in Kubernetes?

The pod is throttled to use less memory

The pod is killed (OOMKilled)

The memory limit is automatically increased

Other pods on the same node are evicted to make room

5. In a multi-team AI cluster, the NLP team and computer vision team each have their own namespace. What is the primary purpose of this namespace separation?

Namespaces make pods run faster by reducing scheduling overhead

Namespaces provide logical isolation and enable per-team resource quotas to prevent one team from consuming all cluster resources

Namespaces ensure pods from different teams run on different physical nodes

Namespaces automatically encrypt communication between teams

Control Plane and Worker Node Roles

A Kubernetes cluster consists of control plane nodes and worker nodes. The control plane is the brain: it runs the API server (single entry point for all operations), the scheduler (decides where pods run), the controller manager (reconciles desired vs. actual state), and etcd (distributed key-value store for all cluster configuration).

Worker nodes are where workloads actually run. Each worker runs a kubelet (agent communicating with the control plane) and a container runtime (typically containerd). For AI, worker nodes are the GPU-equipped machines where training and inference pods are scheduled.

Kubernetes Architecture for AI Workloads

Pods, Deployments, Jobs, and CronJobs

Pod: the smallest schedulable unit. Every AI workload runs inside a pod. Pods are ephemeral — they do not restart unless managed by a higher-level resource.

Deployment: manages a set of identical, long-running pods and ensures the desired replica count is always running. The right resource for inference servers that should always be available.

Job: runs one or more pods to completion. The natural fit for training runs, preprocessing, and evaluation scripts. A completed Job is not restarted.

CronJob: a Job that runs on a schedule. Useful for periodic tasks: nightly feature recomputation, weekly retraining triggers, or hourly data ingestion.

Resource	Use Case in AI	Restart Behavior
Pod (bare)	Rarely used directly	Never restarted
Deployment	Inference serving, MLflow server	Restarted on failure
Job	Training run, preprocessing, evaluation	Runs to completion
CronJob	Scheduled retraining, data ingestion	Runs on schedule
StatefulSet	Distributed databases, feature stores	Restarted with stable identity

Namespaces and Resource Quotas

A namespace is a virtual partition within a cluster. Resources in one namespace are logically isolated from resources in another. In multi-team AI environments, namespaces are typically assigned per team (e.g., nlp-team, cv-team).

Resource quotas cap total resource consumption per namespace. Resource requests are what the scheduler reserves; limits are the maximum a pod can consume. For CPU, exceeding limits causes throttling. For memory, exceeding limits causes OOMKill. For GPUs, Kubernetes does not currently support fractional allocation — a pod gets whole GPUs.

Post-Quiz: Kubernetes Architecture for AI