Chapter 1: Introduction to AI on Kubernetes

Learning Objectives

Section 1: Why Kubernetes for AI/ML

Pre-Quiz: Why Kubernetes for AI/ML

1. A distributed training job needs 8 GPUs simultaneously, but only some cluster nodes have GPUs. What Kubernetes mechanism ensures the job lands on appropriate hardware?

Resource quotas reject the job if GPU nodes are busy
The NVIDIA device plugin registers GPUs as schedulable resources, enabling the scheduler to place pods on GPU-equipped nodes
Container images automatically detect and bind to available GPUs at runtime
Kubernetes evicts CPU workloads to make room for GPU requests

2. An AI team trains a model on a local cluster, then deploys it to a cloud provider. Slightly different CUDA toolkit versions cause numerical discrepancies. Which Kubernetes concept most directly prevents this problem?

Namespace isolation ensures environment separation
Kubernetes autoscaling normalizes resource differences across environments
Containerization packages the exact CUDA version and dependencies into a portable image
The Kubernetes API server translates provider-specific configurations automatically

3. What is the primary practical advantage of Kubernetes having 96%+ industry adoption for AI teams specifically?

Higher adoption means Kubernetes clusters run faster
It creates a self-reinforcing ecosystem where every major AI framework and tool builds Kubernetes-native integrations
It eliminates the need for specialized GPU hardware
Adoption levels directly correlate with improved model accuracy

4. Two data science teams share a Kubernetes cluster. One team launches a poorly-tuned training job that consumes all available GPUs. Which Kubernetes feature prevents this from starving the other team?

Pod priority classes automatically kill lower-priority jobs
The container runtime throttles excessive GPU usage
Resource quotas on namespaces cap the total GPU consumption per team
Kubernetes distributes GPU resources equally among all pods by default

5. Compared to traditional HPC systems like Slurm, what is a key trade-off when using Kubernetes for AI workloads?

Kubernetes cannot schedule GPU workloads at all
Kubernetes has a steeper learning curve and more verbose manifests, but offers superior portability and richer ML-specific ecosystem tooling
Slurm supports containerization while Kubernetes does not
Kubernetes is limited to cloud environments only

Resource Orchestration Needs of AI Workloads

Training a large language model or running a computer vision pipeline demands fundamentally different resources than serving a web application. Where a web server might need two CPU cores and a few hundred megabytes of memory, a distributed training job might need dozens of GPUs, terabytes of RAM, and high-bandwidth interconnects simultaneously.

Kubernetes serves as an air-traffic control system for your compute cluster. Through plugins such as the NVIDIA device plugin, the scheduler gains awareness of GPU hardware on each node, placing GPU-hungry training jobs only on nodes that have GPUs available and reserving exactly the number requested.

Beyond GPU placement, Kubernetes enforces resource quotas — hard limits on how much CPU, memory, or GPU a team or namespace can consume. This is critical for multi-tenant clusters where one team's poorly-tuned job could starve every other team's workloads.

flowchart TD A["User submits YAML manifest\nrequesting 4 GPUs"] --> B["Kubernetes API Server\nreceives request"] B --> C["Scheduler queries\nnode resource state"] C --> D{"Node has\n4+ GPUs available?"} D -- No --> E["Node skipped"] D -- Yes --> F["Pod bound to\nGPU-equipped node"] F --> G["NVIDIA Device Plugin\nexposes nvidia.com/gpu resources"] G --> H["Training job runs\non reserved GPUs"] E --> C

Portability Across Cloud and On-Premises Environments

AI teams frequently move workloads between environments: from local development to staging to production across different cloud providers or on-premises data centers. Kubernetes solves this through containerization — packaging an application and all its dependencies into a portable, self-contained image. Because Kubernetes runs on AWS (EKS), Google Cloud (GKE), Azure (AKS), and on-premises infrastructure using the same API surface, a container image that works in development works in production without modification.

This consistency is not merely convenience — it is a correctness guarantee. AI models are notoriously sensitive to environment variations: a different version of NumPy or a different CUDA toolkit can produce different numerical outputs. Containers eliminate that class of problem by making the environment a reproducible artifact.

Ecosystem Maturity and Community Momentum

According to the CNCF Annual Survey 2023, more than 96% of organizations are using or evaluating Kubernetes. Every major AI framework — PyTorch, TensorFlow, JAX, Hugging Face Transformers — has Kubernetes-native tooling. Platforms like Kubeflow, Ray, and MLflow have all converged on Kubernetes as their deployment substrate.

FactorKubernetesTraditional HPC (Slurm)Cloud-Managed VMs
GPU schedulingNative via device pluginsNative, matureManual or autoscaling groups
PortabilityExcellent (cloud + on-prem)Limited (usually on-prem)Cloud-vendor specific
Ecosystem toolingVery rich (Kubeflow, KServe, Kueue)Limited ML-specific toolingVendor-specific services
Multi-tenancyNamespaces + quotasFair-share queuesSeparate accounts/projects
Learning curveSteepModerate (for HPC teams)Low initially, high at scale

Key Points: Why Kubernetes for AI/ML

Post-Quiz: Why Kubernetes for AI/ML

1. A distributed training job needs 8 GPUs simultaneously, but only some cluster nodes have GPUs. What Kubernetes mechanism ensures the job lands on appropriate hardware?

Resource quotas reject the job if GPU nodes are busy
The NVIDIA device plugin registers GPUs as schedulable resources, enabling the scheduler to place pods on GPU-equipped nodes
Container images automatically detect and bind to available GPUs at runtime
Kubernetes evicts CPU workloads to make room for GPU requests

2. An AI team trains a model on a local cluster, then deploys it to a cloud provider. Slightly different CUDA toolkit versions cause numerical discrepancies. Which Kubernetes concept most directly prevents this problem?

Namespace isolation ensures environment separation
Kubernetes autoscaling normalizes resource differences across environments
Containerization packages the exact CUDA version and dependencies into a portable image
The Kubernetes API server translates provider-specific configurations automatically

3. What is the primary practical advantage of Kubernetes having 96%+ industry adoption for AI teams specifically?

Higher adoption means Kubernetes clusters run faster
It creates a self-reinforcing ecosystem where every major AI framework and tool builds Kubernetes-native integrations
It eliminates the need for specialized GPU hardware
Adoption levels directly correlate with improved model accuracy

4. Two data science teams share a Kubernetes cluster. One team launches a poorly-tuned training job that consumes all available GPUs. Which Kubernetes feature prevents this from starving the other team?

Pod priority classes automatically kill lower-priority jobs
The container runtime throttles excessive GPU usage
Resource quotas on namespaces cap the total GPU consumption per team
Kubernetes distributes GPU resources equally among all pods by default

5. Compared to traditional HPC systems like Slurm, what is a key trade-off when using Kubernetes for AI workloads?

Kubernetes cannot schedule GPU workloads at all
Kubernetes has a steeper learning curve and more verbose manifests, but offers superior portability and richer ML-specific ecosystem tooling
Slurm supports containerization while Kubernetes does not
Kubernetes is limited to cloud environments only

Section 2: AI/ML Workload Characteristics

Pre-Quiz: AI/ML Workload Characteristics

1. A team is deciding which Kubernetes resource type to use for their model training pipeline. Training runs for 12 hours and should stop when complete. Which resource type is the best fit, and why?

Deployment, because it ensures the training process is always running
Job, because it runs pods to completion and does not restart them afterward
CronJob, because long-running tasks should be scheduled
StatefulSet, because training requires persistent state

2. An inference service experiences a sudden 50x traffic spike. What characteristic of real-time inference workloads makes this scenario fundamentally different from a batch training spike?

Inference requires more total GPU memory than training
Inference must maintain low latency under unpredictable load, requiring horizontal scale-out rather than scale-up
Training spikes are more expensive than inference spikes
Inference workloads cannot use GPUs during traffic spikes

3. A training pod's GPU is spending 60% of its time idle, waiting for data from storage. What is this problem called and what storage strategy addresses it?

GPU fragmentation; solved by using fractional GPU allocation
The I/O bottleneck; solved by using high-throughput storage like NVMe-backed PersistentVolumes for hot data
Memory thrashing; solved by increasing pod memory limits
Scheduler contention; solved by increasing the number of training replicas

4. Why does Kubernetes need additional tools like Kueue for AI batch workloads, even though it already supports Job resources?

Kubernetes Jobs cannot run on GPU nodes
The default scheduler lacks queue management, preemption, and fair-share scheduling needed for concurrent batch GPU jobs
Kueue replaces the Kubernetes scheduler entirely
Jobs in Kubernetes automatically restart on failure, which wastes GPU resources

5. Which combination correctly describes the GPU utilization pattern for batch training versus real-time inference?

Training: bursty and underutilized; Inference: continuously saturated
Training: continuously saturated; Inference: bursty and often underutilized
Both are continuously saturated during operation
Neither uses GPUs — they rely on CPUs for computation

Batch Training vs Real-Time Inference Patterns

Batch training is a compute marathon: a job starts, runs for hours or days consuming enormous GPU resources, and produces a saved model artifact. Training is not user-facing; latency is irrelevant.

Real-time inference (online serving) is the opposite. A trained model sits behind an API endpoint, and users expect predictions in milliseconds. Inference workloads must handle unpredictable traffic spikes — scaling out horizontally by adding replicas rather than scaling up with more GPUs per pod.

DimensionBatch TrainingReal-Time Inference
DurationHours to daysMilliseconds per request
GPU utilizationContinuously saturatedBursty, often underutilized
Latency sensitivityLowVery high
Traffic patternPredictable (job-scheduled)Unpredictable spikes
Scaling strategyScale-up (more GPUs per job)Scale-out (more replicas)
Failure toleranceRestart job or checkpointMust be highly available
Batch Training vs Real-Time Inference
BATCH TRAINING (Job) Dataset 100GB+ in storage Training Pods 4x GPU saturated Hours to days Model Artifact GPU Utilization Over Time: ~95% Saturated Done Predictable schedule Scale UP (more GPUs per job) Failure: restart from checkpoint K8s resource: Job REAL-TIME INFERENCE (Deployment) User Requests Unpredictable spikes Inference Pods 1 GPU each, autoscaled Response in ms Prediction (< 100ms) GPU Utilization Over Time: Bursty ~30-70% Unpredictable traffic spikes Scale OUT (more replicas) Failure: must be highly available K8s resource: Deployment + Service

GPU and Accelerator Requirements

Kubernetes does not natively understand GPUs out of the box. Hardware vendors publish device plugins that extend the Kubernetes API to expose accelerators as schedulable resources. The NVIDIA device plugin registers each GPU as a nvidia.com/gpu resource unit. A pod can request nvidia.com/gpu: 2 in its resource spec, and the scheduler places it only on a node with at least 2 available NVIDIA GPUs.

GPU resources are expensive — a cloud A100 instance can cost $30/hour or more. Setting precise requests and limits in your pod spec is not optional for cost control.

Data-Intensive I/O Patterns

Training a large model requires moving enormous volumes of data into GPU memory efficiently. A GPU that spends more time waiting for data from disk than computing gradients is an expensive idle resource. This is the I/O bottleneck. In Kubernetes, choosing the right PersistentVolume type — fast NVMe-backed storage for hot data, object storage for cold datasets — is critical.

Long-Running and Ephemeral Job Types

AI workloads span both extremes: a multi-day distributed training run must tolerate node failures gracefully via checkpointing, while a feature engineering step might be ephemeral — a pod spins up, processes data, writes results, and exits. The infrastructure must support tens to thousands of concurrent long-running batch jobs, which is a different operational profile from what Kubernetes was originally optimized for.

Key Points: AI/ML Workload Characteristics

Post-Quiz: AI/ML Workload Characteristics

1. A team is deciding which Kubernetes resource type to use for their model training pipeline. Training runs for 12 hours and should stop when complete. Which resource type is the best fit, and why?

Deployment, because it ensures the training process is always running
Job, because it runs pods to completion and does not restart them afterward
CronJob, because long-running tasks should be scheduled
StatefulSet, because training requires persistent state

2. An inference service experiences a sudden 50x traffic spike. What characteristic of real-time inference workloads makes this scenario fundamentally different from a batch training spike?

Inference requires more total GPU memory than training
Inference must maintain low latency under unpredictable load, requiring horizontal scale-out rather than scale-up
Training spikes are more expensive than inference spikes
Inference workloads cannot use GPUs during traffic spikes

3. A training pod's GPU is spending 60% of its time idle, waiting for data from storage. What is this problem called and what storage strategy addresses it?

GPU fragmentation; solved by using fractional GPU allocation
The I/O bottleneck; solved by using high-throughput storage like NVMe-backed PersistentVolumes for hot data
Memory thrashing; solved by increasing pod memory limits
Scheduler contention; solved by increasing the number of training replicas

4. Why does Kubernetes need additional tools like Kueue for AI batch workloads, even though it already supports Job resources?

Kubernetes Jobs cannot run on GPU nodes
The default scheduler lacks queue management, preemption, and fair-share scheduling needed for concurrent batch GPU jobs
Kueue replaces the Kubernetes scheduler entirely
Jobs in Kubernetes automatically restart on failure, which wastes GPU resources

5. Which combination correctly describes the GPU utilization pattern for batch training versus real-time inference?

Training: bursty and underutilized; Inference: continuously saturated
Training: continuously saturated; Inference: bursty and often underutilized
Both are continuously saturated during operation
Neither uses GPUs — they rely on CPUs for computation

Section 3: The AI/ML Lifecycle on Kubernetes

Pre-Quiz: The AI/ML Lifecycle on Kubernetes

1. A model deployed in production starts returning less accurate predictions over months because user behavior has changed. What is this phenomenon called, and which lifecycle stage addresses it?

Overfitting; addressed by collecting more training data
Model drift; addressed by the monitoring and retraining stage that detects distribution shifts and triggers automated retraining
Data leakage; addressed by fixing the feature engineering pipeline
Underfitting; addressed by increasing model complexity

2. In a distributed training job with 8 worker pods, 7 are running but the 8th cannot be scheduled due to insufficient GPU resources. What scheduling problem does this illustrate?

Resource fragmentation across namespaces
The lack of gang scheduling in the default Kubernetes scheduler, where all workers must start simultaneously to avoid wasting GPU resources
Insufficient resource quotas on the namespace
The NVIDIA device plugin failing to register GPUs

3. Which Kubernetes-native tool provides a unified InferenceService custom resource that abstracts over multiple serving runtimes and supports autoscaling to zero?

Kubeflow Trainer
Kueue
KServe
Kubeflow Pipelines

4. What is the correct ordering of AI/ML lifecycle stages, and how does the cycle close?

Training, Data Prep, Serving, Monitoring — loops back to Training
Data Preparation, Model Training, Model Serving, Monitoring — monitoring triggers retraining when drift is detected
Serving, Training, Monitoring, Data Prep — loops back to Serving
Data Preparation, Monitoring, Training, Serving — no loop

5. NVIDIA's disaggregated inference architecture separates prefill and decode into independent Kubernetes services. Why is this beneficial?

It reduces the total number of GPUs needed to zero
Because prefill is compute-bound and decode is memory-bandwidth-bound, separating them enables fine-grained resource allocation and better GPU utilization
It eliminates the need for model artifacts
It makes Kubernetes scheduling unnecessary

Data Preparation and Feature Engineering

Every AI project begins with data. Data preparation covers ingestion, cleaning, normalization, and transformation. Feature engineering extracts or constructs the numerical representations a model trains on. On Kubernetes, this stage typically runs as batch jobs using distributed data-processing frameworks like Apache Spark, Dask, or Ray, distributed across many pods for parallelism.

Model Training and Experimentation

This is where most GPU spend happens. Teams run many experiments in parallel, varying hyperparameters, architectures, or training data. Distributed training extends a single run across multiple pods. Kubeflow Trainer provides Kubernetes-native custom resources (PyTorchJob, JAXJob) to manage distributed training pod lifecycles.

The key scheduling challenge is gang scheduling: all worker pods must start simultaneously, because a job where 7 of 8 workers are running but waiting for the 8th is consuming 7 GPUs while doing zero useful work. Kueue addresses this with atomic job admission.

Model Serving and Inference

KServe is the Kubernetes-native standard for model serving. It provides a unified InferenceService custom resource that abstracts over multiple serving runtimes (Triton, TorchServe, ONNX Runtime, vLLM), handles autoscaling, and can scale to zero when no traffic is present.

For large language models, NVIDIA has introduced disaggregated inference that separates prefill (compute-bound) and decode (memory-bandwidth-bound) into independent Kubernetes services for better GPU utilization.

Monitoring and Retraining Loops

A deployed model is not static. Model drift — degradation of model performance as real-world data distributions shift — requires continuous monitoring and automated retraining. On Kubernetes, this involves metrics collection via Prometheus, drift detection via scheduled jobs, and automated retraining triggered when performance drops below thresholds.

AI/ML Lifecycle on Kubernetes
AI/ML LIFECYCLE on Kubernetes Data Preparation Spark / Dask / Ray K8s: Job Model Training PyTorch / DeepSpeed K8s: Kubeflow Trainer Job Model Serving KServe / Triton / vLLM K8s: InferenceService (CRD) Monitoring Prometheus / Grafana Drift Detection Drift detected: retrain loop Orchestrated by Kubeflow Pipelines (DAGs on K8s)
flowchart TD A["Raw Data\nsensor logs, text, images"] --> B["Data Preparation\nSpark / Dask / Ray\nKubernetes Job"] B --> C["Feature Store\nor Object Storage"] C --> D["Model Training\nPyTorch / DeepSpeed\nKubeflow Trainer Job"] D --> E["Model Artifact\nsaved weights"] E --> F["Model Serving\nKServe InferenceService\nautoscaling Deployment"] F --> G["Production Traffic\nuser predictions"] G --> H["Monitoring\nPrometheus / Grafana\nDrift Detection"] H -- drift detected --> D
Lifecycle StagePrimary ToolsKubernetes Resource Type
Data preparationSpark, Dask, RayJob, StatefulSet
Model trainingPyTorch, DeepSpeed, JAXJob (via Kubeflow Trainer)
Experiment trackingMLflow, Weights & BiasesDeployment
Model servingKServe, Triton, vLLMInferenceService (CRD)
Pipeline orchestrationKubeflow PipelinesCustom resources
MonitoringPrometheus, GrafanaDeployment, DaemonSet

Key Points: The AI/ML Lifecycle on Kubernetes

Post-Quiz: The AI/ML Lifecycle on Kubernetes

1. A model deployed in production starts returning less accurate predictions over months because user behavior has changed. What is this phenomenon called, and which lifecycle stage addresses it?

Overfitting; addressed by collecting more training data
Model drift; addressed by the monitoring and retraining stage that detects distribution shifts and triggers automated retraining
Data leakage; addressed by fixing the feature engineering pipeline
Underfitting; addressed by increasing model complexity

2. In a distributed training job with 8 worker pods, 7 are running but the 8th cannot be scheduled due to insufficient GPU resources. What scheduling problem does this illustrate?

Resource fragmentation across namespaces
The lack of gang scheduling in the default Kubernetes scheduler, where all workers must start simultaneously to avoid wasting GPU resources
Insufficient resource quotas on the namespace
The NVIDIA device plugin failing to register GPUs

3. Which Kubernetes-native tool provides a unified InferenceService custom resource that abstracts over multiple serving runtimes and supports autoscaling to zero?

Kubeflow Trainer
Kueue
KServe
Kubeflow Pipelines

4. What is the correct ordering of AI/ML lifecycle stages, and how does the cycle close?

Training, Data Prep, Serving, Monitoring — loops back to Training
Data Preparation, Model Training, Model Serving, Monitoring — monitoring triggers retraining when drift is detected
Serving, Training, Monitoring, Data Prep — loops back to Serving
Data Preparation, Monitoring, Training, Serving — no loop

5. NVIDIA's disaggregated inference architecture separates prefill and decode into independent Kubernetes services. Why is this beneficial?

It reduces the total number of GPUs needed to zero
Because prefill is compute-bound and decode is memory-bandwidth-bound, separating them enables fine-grained resource allocation and better GPU utilization
It eliminates the need for model artifacts
It makes Kubernetes scheduling unnecessary

Section 4: Kubernetes Architecture Refresher for AI Practitioners

Pre-Quiz: Kubernetes Architecture for AI

1. Which Kubernetes component is responsible for deciding which worker node a new pod should run on?

etcd
The kubelet
The scheduler
The controller manager

2. A research team accidentally submits 100 GPU pods simultaneously in their namespace. With a 16-GPU resource quota, what happens?

All 100 pods are scheduled and compete for GPUs at runtime
Kubernetes admits only as many pods as fit within the 16-GPU quota and rejects the rest
The entire namespace is shut down for exceeding limits
Kubernetes automatically increases the quota to accommodate the request

3. An inference server should always have at least 3 replicas running and automatically replace crashed pods. Which Kubernetes resource type provides this behavior?

Job
CronJob
Deployment
Bare Pod

4. What happens when a pod exceeds its memory limit in Kubernetes?

The pod is throttled to use less memory
The pod is killed (OOMKilled)
The memory limit is automatically increased
Other pods on the same node are evicted to make room

5. In a multi-team AI cluster, the NLP team and computer vision team each have their own namespace. What is the primary purpose of this namespace separation?

Namespaces make pods run faster by reducing scheduling overhead
Namespaces provide logical isolation and enable per-team resource quotas to prevent one team from consuming all cluster resources
Namespaces ensure pods from different teams run on different physical nodes
Namespaces automatically encrypt communication between teams

Control Plane and Worker Node Roles

A Kubernetes cluster consists of control plane nodes and worker nodes. The control plane is the brain: it runs the API server (single entry point for all operations), the scheduler (decides where pods run), the controller manager (reconciles desired vs. actual state), and etcd (distributed key-value store for all cluster configuration).

Worker nodes are where workloads actually run. Each worker runs a kubelet (agent communicating with the control plane) and a container runtime (typically containerd). For AI, worker nodes are the GPU-equipped machines where training and inference pods are scheduled.

Kubernetes Architecture for AI Workloads
CONTROL PLANE API Server Entry point for all operations Scheduler Pod placement + GPU awareness Controller Manager Reconciles desired vs actual state etcd Cluster config key-value store kubelet communication GPU Worker Node 1 kubelet + containerd NVIDIA Device Plugin Training Pod nvidia.com/gpu: 4 PyTorch DDP GPU0 GPU1 GPU2 GPU3 Status: Training (12h remaining) GPU Worker Node 2 kubelet + containerd NVIDIA Device Plugin Inference Pod nvidia.com/gpu: 1 KServe / vLLM GPU0 idle idle idle Status: Serving (autoscaled) CPU Worker Node kubelet + containerd Preprocessing Pod CPU only (Spark/Dask) Monitoring Pod Prometheus / Grafana Status: Running (no GPU)

Pods, Deployments, Jobs, and CronJobs

Pod: the smallest schedulable unit. Every AI workload runs inside a pod. Pods are ephemeral — they do not restart unless managed by a higher-level resource.

Deployment: manages a set of identical, long-running pods and ensures the desired replica count is always running. The right resource for inference servers that should always be available.

Job: runs one or more pods to completion. The natural fit for training runs, preprocessing, and evaluation scripts. A completed Job is not restarted.

CronJob: a Job that runs on a schedule. Useful for periodic tasks: nightly feature recomputation, weekly retraining triggers, or hourly data ingestion.

ResourceUse Case in AIRestart Behavior
Pod (bare)Rarely used directlyNever restarted
DeploymentInference serving, MLflow serverRestarted on failure
JobTraining run, preprocessing, evaluationRuns to completion
CronJobScheduled retraining, data ingestionRuns on schedule
StatefulSetDistributed databases, feature storesRestarted with stable identity

Namespaces and Resource Quotas

A namespace is a virtual partition within a cluster. Resources in one namespace are logically isolated from resources in another. In multi-team AI environments, namespaces are typically assigned per team (e.g., nlp-team, cv-team).

Resource quotas cap total resource consumption per namespace. Resource requests are what the scheduler reserves; limits are the maximum a pod can consume. For CPU, exceeding limits causes throttling. For memory, exceeding limits causes OOMKill. For GPUs, Kubernetes does not currently support fractional allocation — a pod gets whole GPUs.

Key Points: Kubernetes Architecture for AI

Post-Quiz: Kubernetes Architecture for AI

1. Which Kubernetes component is responsible for deciding which worker node a new pod should run on?

etcd
The kubelet
The scheduler
The controller manager

2. A research team accidentally submits 100 GPU pods simultaneously in their namespace. With a 16-GPU resource quota, what happens?

All 100 pods are scheduled and compete for GPUs at runtime
Kubernetes admits only as many pods as fit within the 16-GPU quota and rejects the rest
The entire namespace is shut down for exceeding limits
Kubernetes automatically increases the quota to accommodate the request

3. An inference server should always have at least 3 replicas running and automatically replace crashed pods. Which Kubernetes resource type provides this behavior?

Job
CronJob
Deployment
Bare Pod

4. What happens when a pod exceeds its memory limit in Kubernetes?

The pod is throttled to use less memory
The pod is killed (OOMKilled)
The memory limit is automatically increased
Other pods on the same node are evicted to make room

5. In a multi-team AI cluster, the NLP team and computer vision team each have their own namespace. What is the primary purpose of this namespace separation?

Namespaces make pods run faster by reducing scheduling overhead
Namespaces provide logical isolation and enable per-team resource quotas to prevent one team from consuming all cluster resources
Namespaces ensure pods from different teams run on different physical nodes
Namespaces automatically encrypt communication between teams

Your Progress

Answer Explanations