Configure GPU scheduling and resource requests in Kubernetes pods
Deploy and manage NVIDIA device plugins and GPU operators
Implement GPU sharing strategies including MIG and time-slicing
Troubleshoot common GPU allocation and driver issues
Section 1: GPU Fundamentals for Kubernetes
Pre-Quiz — What do you already know?
1. What is the primary role of a Kubernetes device plugin for GPUs?
It compiles CUDA code inside containers at runtime
It discovers GPU hardware, advertises it as extended resources, and handles device allocation to pods
It replaces the container runtime to support GPU workloads
It installs NVIDIA drivers directly onto worker node operating systems
2. When a pod requests nvidia.com/gpu: 1, what component is responsible for mounting the actual GPU device files into the container?
The Kubernetes API server
The kube-scheduler
The NVIDIA Container Toolkit, after the device plugin allocates the device
The etcd database
3. How does the NVIDIA device plugin communicate with the kubelet?
Through the Kubernetes API server using REST calls
Via gRPC over a Unix socket at /var/lib/kubelet/device-plugins/
Through environment variables set in the pod spec
By writing directly to the container filesystem
4. What is the key advantage of the Container Device Interface (CDI) over the traditional device plugin approach?
CDI provides faster GPU computation speeds
CDI decouples device injection from the container runtime, improving portability across containerd and CRI-O
CDI automatically installs GPU drivers on new nodes
CDI replaces the need for CUDA libraries entirely
5. Why does the NVIDIA Container Toolkit inject driver libraries at runtime rather than requiring them in every container image?
Container images cannot contain binary libraries
It decouples the workload image from the host driver version, so a single image works across nodes with different driver versions
Runtime injection is faster than pre-baked libraries
NVIDIA licensing prevents including drivers in container images
The GPU Landscape for AI
GPUs were originally designed for rendering pixels in parallel. That same parallel architecture — thousands of smaller compute cores working simultaneously — is exactly what neural network training and inference require. Three major vendors compete in the Kubernetes AI workload space:
Vendor
Key Products
Kubernetes Support
Notes
NVIDIA
A100, H100, L40, RTX 40-series
Mature (device plugin + GPU Operator)
Dominant; CUDA ecosystem widely supported
AMD
Instinct MI300, RX 7900 XTX
Stable (ROCm device plugin)
Growing adoption; ROCm is the CUDA equivalent
Intel
Gaudi 2/3, Arc GPUs
Emerging (Intel Device Plugin)
Strong for data center inference; OpenVINO
NVIDIA dominates Kubernetes AI workloads because the CUDA (Compute Unified Device Architecture) ecosystem has been the default target for nearly every major AI framework (PyTorch, TensorFlow, JAX). However, the architectural patterns — device plugins, resource requests, sharing strategies — apply across all vendors.
Device Plugin Architecture and Discovery
Kubernetes was designed around a generic resource model. GPUs are not built-in concepts — they are extended resources: arbitrary named quantities a node advertises and a pod can request. The device plugin framework bridges physical GPU hardware and the Kubernetes resource model.
A device plugin is a small program running as a DaemonSet on each GPU node. It communicates with the kubelet over a Unix socket using gRPC and performs three core functions:
Discovery — Detects available devices using NVML (NVIDIA Management Library)
Advertisement — Registers devices as extended resources (e.g., nvidia.com/gpu: 4) so the scheduler is aware of them
Allocation — When a pod needs a GPU, the plugin injects the correct device files and environment variables into the container
Animation: Device Plugin Architecture — Discovery, Advertisement, and Allocation
Container Runtime GPU Passthrough
Even with a device plugin advertising GPU resources, the container itself still needs to use those GPUs. The NVIDIA Container Toolkit (formerly nvidia-docker2) hooks into the container runtime (containerd or CRI-O) and, at container launch time, injects the correct GPU devices and driver libraries without requiring those libraries in every container image.
What happens when a GPU pod starts:
Pod spec requests nvidia.com/gpu: 1
Scheduler finds a node with available nvidia.com/gpu capacity
kubelet calls the NVIDIA device plugin to allocate one GPU
Device plugin returns device paths (/dev/nvidia0) and environment variables
NVIDIA Container Toolkit intercepts the container start, mounts /dev/nvidia0, and injects CUDA libraries
Container starts with full GPU access
Key Takeaway
GPUs enter Kubernetes as extended resources via the device plugin framework
The NVIDIA device plugin runs as a DaemonSet, discovers GPUs with NVML, and advertises them as nvidia.com/gpu
The device plugin communicates with kubelet via gRPC over a Unix socket
The Container Toolkit handles runtime injection of GPU drivers and device files into containers — decoupling workload images from host driver versions
CDI (Container Device Interface) is an emerging standard for more portable device injection
Post-Quiz — What did you learn?
1. What is the primary role of a Kubernetes device plugin for GPUs?
It compiles CUDA code inside containers at runtime
It discovers GPU hardware, advertises it as extended resources, and handles device allocation to pods
It replaces the container runtime to support GPU workloads
It installs NVIDIA drivers directly onto worker node operating systems
2. When a pod requests nvidia.com/gpu: 1, what component is responsible for mounting the actual GPU device files into the container?
The Kubernetes API server
The kube-scheduler
The NVIDIA Container Toolkit, after the device plugin allocates the device
The etcd database
3. How does the NVIDIA device plugin communicate with the kubelet?
Through the Kubernetes API server using REST calls
Via gRPC over a Unix socket at /var/lib/kubelet/device-plugins/
Through environment variables set in the pod spec
By writing directly to the container filesystem
4. What is the key advantage of the Container Device Interface (CDI) over the traditional device plugin approach?
CDI provides faster GPU computation speeds
CDI decouples device injection from the container runtime, improving portability across containerd and CRI-O
CDI automatically installs GPU drivers on new nodes
CDI replaces the need for CUDA libraries entirely
5. Why does the NVIDIA Container Toolkit inject driver libraries at runtime rather than requiring them in every container image?
Container images cannot contain binary libraries
It decouples the workload image from the host driver version, so a single image works across nodes with different driver versions
Runtime injection is faster than pre-baked libraries
NVIDIA licensing prevents including drivers in container images
Section 2: NVIDIA GPU Operator
Pre-Quiz — What do you already know?
1. What problem does the NVIDIA GPU Operator solve that manual GPU management does not?
It provides GPU hardware at a lower cost
It automates the entire GPU software lifecycle — drivers, runtime config, device plugin, monitoring — using a reconciliation loop
It increases the number of CUDA cores available per GPU
It replaces Kubernetes scheduling entirely for GPU workloads
2. Why does the GPU Operator deploy NVIDIA drivers as a container rather than installing them directly on the host?
Containerized drivers run faster than native kernel modules
It avoids host OS version lock-in and simplifies node upgrades by keeping driver lifecycle independent of the host
Host-installed drivers are not compatible with Kubernetes
Container images are required by NVIDIA licensing
3. What happens if the CUDA Validator component in the GPU Operator's provisioning workflow fails?
The GPU Operator ignores the failure and proceeds to schedule workloads
The node is immediately drained and removed from the cluster
The node remains unschedulable for GPU workloads; the Operator logs the error and retries provisioning
All other GPU nodes in the cluster are also paused until the issue is resolved
4. What is the purpose of GPU Feature Discovery (GFD)?
It discovers new GPU hardware models before they are released
It automatically labels nodes with detailed GPU metadata (model, memory, CUDA version, MIG capability) for scheduling
It discovers and installs missing CUDA libraries on worker nodes
It monitors GPU temperature and automatically throttles workloads
5. A pod requires an 80 GB A100 GPU with MIG support. How can this requirement be expressed in Kubernetes without manually maintaining node pools?
By setting a CPU request equal to the number of GPU cores needed
By using node affinity rules that match GFD-generated labels like nvidia.com/gpu.product and nvidia.com/mig.capable
By deploying a separate scheduler exclusively for A100 nodes
By adding a ConfigMap that maps pod names to specific nodes
The Operator Pattern for GPU Management
Managing GPU nodes manually — installing drivers, configuring the container toolkit, deploying the device plugin, setting up monitoring — is a fragile, error-prone process. The NVIDIA GPU Operator encodes this operational knowledge into a Kubernetes Operator: a controller that continuously watches cluster state and reconciles it toward the desired GPU software configuration.
Components Managed by the GPU Operator
Component
Purpose
NVIDIA Drivers
Kernel module enabling CUDA; deployed as a privileged container
Container Toolkit
Hooks the container runtime to inject GPU access into pods
Device Plugin
Advertises nvidia.com/gpu extended resources to the scheduler
GPU Feature Discovery
Labels nodes with GPU metadata (model, memory, CUDA version)
DCGM Exporter
Exposes GPU metrics (utilization, temperature, memory) for Prometheus
MIG Manager
Configures MIG partitioning on supported hardware
CUDA Validator
Runs a test workload to confirm CUDA is functional before scheduling
Installation via Helm
# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Create the namespace and label it for privileged pod security
kubectl create namespace gpu-operator
kubectl label --overwrite namespace gpu-operator \
pod-security.kubernetes.io/enforce=privileged
# Install the GPU Operator
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --wait
Automated Provisioning Workflow
For each new GPU node, the Operator follows a three-step workflow:
Discovery — Identifies nodes with NVIDIA GPUs using labels and hardware detection
Installation and Configuration — Deploys the driver container, configures the Container Toolkit, starts the device plugin and monitoring stack
Validation — The CUDA Validator runs a test workload; only after it passes does the node accept GPU workloads
The validation gate is critical: it prevents misconfigured nodes from silently accepting GPU pods that would then fail at runtime.
Node Labeling and GPU Feature Discovery
GPU Feature Discovery (GFD) automatically labels nodes with rich GPU metadata, enabling fine-grained scheduling decisions. Example labels on a node with an A100:
With these labels, pods can express hardware requirements as node affinity rules rather than relying on manually maintained node pools.
Key Takeaway
The GPU Operator automates the entire GPU software lifecycle — drivers, runtime, device plugin, monitoring — using the Kubernetes Operator pattern
Drivers are deployed as containerized privileged workloads, avoiding host OS version lock-in
The CUDA Validator is a critical gate that prevents misconfigured nodes from receiving workloads
GFD labels provide rich node metadata (GPU model, memory, CUDA version, MIG capability) enabling intelligent scheduling
Post-Quiz — What did you learn?
1. What problem does the NVIDIA GPU Operator solve that manual GPU management does not?
It provides GPU hardware at a lower cost
It automates the entire GPU software lifecycle — drivers, runtime config, device plugin, monitoring — using a reconciliation loop
It increases the number of CUDA cores available per GPU
It replaces Kubernetes scheduling entirely for GPU workloads
2. Why does the GPU Operator deploy NVIDIA drivers as a container rather than installing them directly on the host?
Containerized drivers run faster than native kernel modules
It avoids host OS version lock-in and simplifies node upgrades by keeping driver lifecycle independent of the host
Host-installed drivers are not compatible with Kubernetes
Container images are required by NVIDIA licensing
3. What happens if the CUDA Validator component in the GPU Operator's provisioning workflow fails?
The GPU Operator ignores the failure and proceeds to schedule workloads
The node is immediately drained and removed from the cluster
The node remains unschedulable for GPU workloads; the Operator logs the error and retries provisioning
All other GPU nodes in the cluster are also paused until the issue is resolved
4. What is the purpose of GPU Feature Discovery (GFD)?
It discovers new GPU hardware models before they are released
It automatically labels nodes with detailed GPU metadata (model, memory, CUDA version, MIG capability) for scheduling
It discovers and installs missing CUDA libraries on worker nodes
It monitors GPU temperature and automatically throttles workloads
5. A pod requires an 80 GB A100 GPU with MIG support. How can this requirement be expressed in Kubernetes without manually maintaining node pools?
By setting a CPU request equal to the number of GPU cores needed
By using node affinity rules that match GFD-generated labels like nvidia.com/gpu.product and nvidia.com/mig.capable
By deploying a separate scheduler exclusively for A100 nodes
By adding a ConfigMap that maps pod names to specific nodes
Section 3: GPU Scheduling and Resource Requests
Pre-Quiz — What do you already know?
1. A pod spec requests nvidia.com/gpu: 1 but does not include a toleration for the nvidia.com/gpu:NoSchedule taint. What happens?
The pod is scheduled on the GPU node but runs without GPU access
The pod remains Pending because it cannot be scheduled on any tainted GPU node
The scheduler automatically adds the missing toleration
The pod is placed on a CPU-only node with emulated GPU access
2. Why must GPU resource requests and limits be equal in a Kubernetes pod spec?
Because GPUs are too expensive to overcommit — Kubernetes enforces 1:1 allocation for extended resources
This is just a recommendation, not a requirement
Because GPU memory cannot be measured in fractional units
Because the device plugin does not support the requests field
3. A cluster has both A100 nodes (for training) and T4 nodes (for inference). What mechanism prevents a training pod from landing on a T4 node?
The Kubernetes scheduler automatically detects training workloads and routes them to powerful GPUs
Node affinity rules matching GFD labels like nvidia.com/gpu.product to the required GPU model
Setting the GPU request to a number larger than T4 node capacity
Using a ConfigMap that lists approved nodes per workload type
4. NVLink provides up to 900 GB/s bandwidth between GPUs. Why does this matter for multi-GPU training jobs?
NVLink increases the number of CUDA cores available per GPU
NVLink reduces the time to copy gradient tensors between GPUs during distributed training, potentially 20x faster than PCIe
NVLink allows GPUs to share the same VRAM pool
NVLink eliminates the need for a scheduler by connecting GPUs directly
5. What limitation of the standard Kubernetes scheduler makes topology-aware scheduling necessary?
The scheduler cannot count GPU resources correctly
The scheduler treats extended resources as opaque integer counts and cannot reason about GPU topology, memory bandwidth, or NUMA locality
The scheduler does not support multi-GPU pods
The scheduler requires a GPU-specific API extension to function
Requesting GPUs in Pod Specs
Requesting a GPU follows the same resources.limits pattern as CPU and memory, with one constraint: GPU requests and limits must match, and they are whole numbers (no fractional GPUs by default).
The tolerations section is critical: GPU nodes are typically tainted with nvidia.com/gpu:NoSchedule to prevent non-GPU workloads from consuming expensive GPU node resources. A pod must explicitly tolerate this taint.
Node Affinity and GPU Topology Awareness
When a cluster has multiple GPU types, raw GPU count is insufficient. Node affinity rules allow targeting specific GPU models using GFD-generated labels:
Animation: GPU Scheduling Flow — Pod Spec to GPU Allocation
GPU Interconnect Bandwidth Comparison
Interconnect
Bandwidth
Typical Use Case
NVLink (4th gen, H100)
900 GB/s
Multi-GPU LLM training on a single node
NVLink (3rd gen, A100)
600 GB/s
Distributed training, large model sharding
PCIe 4.0 x16
32 GB/s
Inference, single-GPU training
InfiniBand HDR (inter-node)
200 Gb/s
Multi-node distributed training
Extended Resources and Custom Schedulers
The standard Kubernetes scheduler treats extended resources as opaque integer counts. It can count and subtract, but it cannot reason about GPU topology, memory bandwidth, or NUMA locality. For topology-aware placement, the GPU Operator integrates with the Topology Manager and NUMA-aware scheduler to co-locate GPU and CPU resources on the same NUMA node.
Key Takeaway
GPU resources use nvidia.com/gpu in pod specs; requests and limits must match
Tolerations are required to schedule on tainted GPU nodes
Topology-aware scheduling maximizes NVLink bandwidth (up to 28x faster than PCIe) for multi-GPU training
The Topology Manager and NUMA-aware scheduler co-locate GPU and CPU on the same NUMA domain
Post-Quiz — What did you learn?
1. A pod spec requests nvidia.com/gpu: 1 but does not include a toleration for the nvidia.com/gpu:NoSchedule taint. What happens?
The pod is scheduled on the GPU node but runs without GPU access
The pod remains Pending because it cannot be scheduled on any tainted GPU node
The scheduler automatically adds the missing toleration
The pod is placed on a CPU-only node with emulated GPU access
2. Why must GPU resource requests and limits be equal in a Kubernetes pod spec?
Because GPUs are too expensive to overcommit — Kubernetes enforces 1:1 allocation for extended resources
This is just a recommendation, not a requirement
Because GPU memory cannot be measured in fractional units
Because the device plugin does not support the requests field
3. A cluster has both A100 nodes (for training) and T4 nodes (for inference). What mechanism prevents a training pod from landing on a T4 node?
The Kubernetes scheduler automatically detects training workloads and routes them to powerful GPUs
Node affinity rules matching GFD labels like nvidia.com/gpu.product to the required GPU model
Setting the GPU request to a number larger than T4 node capacity
Using a ConfigMap that lists approved nodes per workload type
4. NVLink provides up to 900 GB/s bandwidth between GPUs. Why does this matter for multi-GPU training jobs?
NVLink increases the number of CUDA cores available per GPU
NVLink reduces the time to copy gradient tensors between GPUs during distributed training, potentially 20x faster than PCIe
NVLink allows GPUs to share the same VRAM pool
NVLink eliminates the need for a scheduler by connecting GPUs directly
5. What limitation of the standard Kubernetes scheduler makes topology-aware scheduling necessary?
The scheduler cannot count GPU resources correctly
The scheduler treats extended resources as opaque integer counts and cannot reason about GPU topology, memory bandwidth, or NUMA locality
The scheduler does not support multi-GPU pods
The scheduler requires a GPU-specific API extension to function
Section 4: GPU Sharing and Multi-Tenancy
Pre-Quiz — What do you already know?
1. What is the fundamental difference between MIG partitioning and time-slicing?
MIG is software-based while time-slicing uses hardware isolation
MIG creates hardware-enforced partitions with dedicated compute and memory, while time-slicing shares everything via rapid context switching
MIG works on any GPU while time-slicing requires Ampere hardware
MIG and time-slicing are identical in isolation; they differ only in configuration
2. An A100 is configured with 7 MIG instances at the 1g.10gb profile, and each instance is time-sliced into 4 replicas. How many total pod allocations does this produce?
7
11
28
56
3. Why is MPS (Multi-Process Service) preferred over time-slicing for replicated inference workloads?
MPS provides hardware-level isolation between processes
MPS funnels multiple CUDA processes through a single shared context, reducing context-switching overhead and improving throughput
MPS can split a GPU into more partitions than time-slicing
MPS is the only sharing method that works on pre-Ampere hardware
4. Which combination of GPU sharing strategies is NOT supported?
MIG + time-slicing
MIG + MPS on the same GPU
Time-slicing on a non-MIG GPU
MIG alone without time-slicing
5. A naively scheduled GPU cluster typically achieves about 13% GPU utilization. With advanced GPU sharing strategies, what utilization level can production clusters achieve?
About 25%
About 45%
Over 80%
Close to 100%
The GPU Utilization Problem
A single NVIDIA A100 costs $10,000–$30,000. Running one small inference workload at 5% utilization is financially untenable. GPU sharing runs multiple workloads on a single GPU to increase utilization and reduce per-workload cost. NVIDIA provides three distinct sharing mechanisms:
Strategy
Isolation
Hardware Req.
Latency Impact
Best For
MIG
Hardware (hard)
Ampere+ (A100, H100)
None
Production inference with SLAs
Time-Slicing
None (soft)
Any NVIDIA GPU
Context switch jitter
Dev/test, bursty workloads
MPS
Process-level (soft)
Any NVIDIA GPU
Low overhead
Throughput-focused inference replicas
MIG (Multi-Instance GPU) Partitioning
MIG is a hardware feature on Ampere+ architecture that divides one physical GPU into up to 7 independent instances, each with dedicated compute engines, memory bandwidth, L2 cache, and DRAM. One workload in a MIG instance cannot read memory from another instance or interfere with its performance.
Animation: MIG Partitioning — One A100 GPU, Seven Isolated Instances
MIG Partition Profiles (A100 80GB)
Profile
Compute Fraction
Memory
Max Instances
1g.10gb
1/7
~10 GB
7
2g.20gb
2/7
~20 GB
3
3g.40gb
3/7
~40 GB
2
4g.40gb
4/7
~40 GB
1
7g.80gb
7/7
~80 GB
1 (whole GPU)
Enabling MIG with the GPU Operator
# Label the node with the desired MIG strategy
kubectl label node <gpu-node> nvidia.com/mig.config=all-1g.10gb
# The GPU Operator's MIG Manager detects the label change and
# reconfigures the GPU automatically. Verify the result:
kubectl describe node <gpu-node> | grep nvidia.com/mig
After configuration, the node advertises nvidia.com/mig-1g.10gb: 7 as an extended resource. Pods request MIG slices explicitly:
resources:
limits:
nvidia.com/mig-1g.10gb: 1
Time-Slicing GPUs Across Pods
Time-slicing works on any NVIDIA GPU. It configures the device plugin to advertise each physical GPU as multiple virtual GPUs via a ConfigMap. The GPU rapidly context-switches between active workloads.
Critical limitations: No memory isolation (all pods share VRAM), no compute isolation, latency jitter from context switching, no fair scheduling guarantees. Time-slicing is appropriate for dev/test clusters and bursty workloads, not production inference with SLAs.
MPS (Multi-Process Service)
MPS funnels multiple CUDA processes through a single shared CUDA context via a server daemon, reducing context-switching overhead. Ideal for replicated inference: running multiple copies of the same model with higher aggregate throughput than time-slicing.
Constraints: MPS and time-slicing cannot coexist. MPS is not supported on MIG-enabled devices. A crash in one CUDA process can destabilize the shared context — best for trusted workloads only.
Combining Strategies: MIG + Time-Slicing
For maximum density on A100/H100 hardware, MIG and time-slicing can be combined: 7 MIG instances × 4 time-sliced replicas = 28 pod allocations from a single physical GPU. Each MIG instance provides hardware memory isolation, while time-slicing adds density within each partition.
Cost Optimization Impact
Production clusters using advanced GPU sharing can move utilization from the typical 13% baseline to over 80%. The decision framework: isolation requirements first, hardware capabilities second, workload patterns third.
Key Takeaway
MIG provides hardware-enforced isolation (dedicated compute + memory) — right for production workloads with SLAs; requires Ampere+ hardware
Time-slicing is simplest but offers no isolation and introduces latency jitter — best for dev/test
MPS reduces sharing overhead for throughput-oriented inference replicas but provides weaker fault isolation
MIG + time-slicing can produce up to 28 allocations per A100
MIG + MPS is not supported; time-slicing + MPS cannot coexist
Proper GPU sharing can increase cluster utilization from 13% to over 80%
Post-Quiz — What did you learn?
1. What is the fundamental difference between MIG partitioning and time-slicing?
MIG is software-based while time-slicing uses hardware isolation
MIG creates hardware-enforced partitions with dedicated compute and memory, while time-slicing shares everything via rapid context switching
MIG works on any GPU while time-slicing requires Ampere hardware
MIG and time-slicing are identical in isolation; they differ only in configuration
2. An A100 is configured with 7 MIG instances at the 1g.10gb profile, and each instance is time-sliced into 4 replicas. How many total pod allocations does this produce?
7
11
28
56
3. Why is MPS (Multi-Process Service) preferred over time-slicing for replicated inference workloads?
MPS provides hardware-level isolation between processes
MPS funnels multiple CUDA processes through a single shared context, reducing context-switching overhead and improving throughput
MPS can split a GPU into more partitions than time-slicing
MPS is the only sharing method that works on pre-Ampere hardware
4. Which combination of GPU sharing strategies is NOT supported?
MIG + time-slicing
MIG + MPS on the same GPU
Time-slicing on a non-MIG GPU
MIG alone without time-slicing
5. A naively scheduled GPU cluster typically achieves about 13% GPU utilization. With advanced GPU sharing strategies, what utilization level can production clusters achieve?