Chapter 2: GPU and Accelerator Management

Learning Objectives

Section 1: GPU Fundamentals for Kubernetes

Pre-Quiz — What do you already know?

1. What is the primary role of a Kubernetes device plugin for GPUs?

It compiles CUDA code inside containers at runtime
It discovers GPU hardware, advertises it as extended resources, and handles device allocation to pods
It replaces the container runtime to support GPU workloads
It installs NVIDIA drivers directly onto worker node operating systems

2. When a pod requests nvidia.com/gpu: 1, what component is responsible for mounting the actual GPU device files into the container?

The Kubernetes API server
The kube-scheduler
The NVIDIA Container Toolkit, after the device plugin allocates the device
The etcd database

3. How does the NVIDIA device plugin communicate with the kubelet?

Through the Kubernetes API server using REST calls
Via gRPC over a Unix socket at /var/lib/kubelet/device-plugins/
Through environment variables set in the pod spec
By writing directly to the container filesystem

4. What is the key advantage of the Container Device Interface (CDI) over the traditional device plugin approach?

CDI provides faster GPU computation speeds
CDI decouples device injection from the container runtime, improving portability across containerd and CRI-O
CDI automatically installs GPU drivers on new nodes
CDI replaces the need for CUDA libraries entirely

5. Why does the NVIDIA Container Toolkit inject driver libraries at runtime rather than requiring them in every container image?

Container images cannot contain binary libraries
It decouples the workload image from the host driver version, so a single image works across nodes with different driver versions
Runtime injection is faster than pre-baked libraries
NVIDIA licensing prevents including drivers in container images

The GPU Landscape for AI

GPUs were originally designed for rendering pixels in parallel. That same parallel architecture — thousands of smaller compute cores working simultaneously — is exactly what neural network training and inference require. Three major vendors compete in the Kubernetes AI workload space:

VendorKey ProductsKubernetes SupportNotes
NVIDIAA100, H100, L40, RTX 40-seriesMature (device plugin + GPU Operator)Dominant; CUDA ecosystem widely supported
AMDInstinct MI300, RX 7900 XTXStable (ROCm device plugin)Growing adoption; ROCm is the CUDA equivalent
IntelGaudi 2/3, Arc GPUsEmerging (Intel Device Plugin)Strong for data center inference; OpenVINO

NVIDIA dominates Kubernetes AI workloads because the CUDA (Compute Unified Device Architecture) ecosystem has been the default target for nearly every major AI framework (PyTorch, TensorFlow, JAX). However, the architectural patterns — device plugins, resource requests, sharing strategies — apply across all vendors.

Device Plugin Architecture and Discovery

Kubernetes was designed around a generic resource model. GPUs are not built-in concepts — they are extended resources: arbitrary named quantities a node advertises and a pod can request. The device plugin framework bridges physical GPU hardware and the Kubernetes resource model.

A device plugin is a small program running as a DaemonSet on each GPU node. It communicates with the kubelet over a Unix socket using gRPC and performs three core functions:

  1. Discovery — Detects available devices using NVML (NVIDIA Management Library)
  2. Advertisement — Registers devices as extended resources (e.g., nvidia.com/gpu: 4) so the scheduler is aware of them
  3. Allocation — When a pod needs a GPU, the plugin injects the correct device files and environment variables into the container
Animation: Device Plugin Architecture — Discovery, Advertisement, and Allocation
Physical GPUs 4x NVIDIA A100 on worker node NVML Management Library 1. Discover Device Plugin DaemonSet on each GPU node kubelet gRPC over Unix socket 2. Advertise nvidia.com/gpu: 4 API Server registers node capacity Scheduler finds node with available GPUs 3. Allocate /dev/nvidia0 + env Container Toolkit mounts /dev/nvidia0 injects CUDA libs AI Container Full GPU access nvidia.com/gpu: 1 GPU pod running

Container Runtime GPU Passthrough

Even with a device plugin advertising GPU resources, the container itself still needs to use those GPUs. The NVIDIA Container Toolkit (formerly nvidia-docker2) hooks into the container runtime (containerd or CRI-O) and, at container launch time, injects the correct GPU devices and driver libraries without requiring those libraries in every container image.

What happens when a GPU pod starts:

  1. Pod spec requests nvidia.com/gpu: 1
  2. Scheduler finds a node with available nvidia.com/gpu capacity
  3. kubelet calls the NVIDIA device plugin to allocate one GPU
  4. Device plugin returns device paths (/dev/nvidia0) and environment variables
  5. NVIDIA Container Toolkit intercepts the container start, mounts /dev/nvidia0, and injects CUDA libraries
  6. Container starts with full GPU access

Key Takeaway

Post-Quiz — What did you learn?

1. What is the primary role of a Kubernetes device plugin for GPUs?

It compiles CUDA code inside containers at runtime
It discovers GPU hardware, advertises it as extended resources, and handles device allocation to pods
It replaces the container runtime to support GPU workloads
It installs NVIDIA drivers directly onto worker node operating systems

2. When a pod requests nvidia.com/gpu: 1, what component is responsible for mounting the actual GPU device files into the container?

The Kubernetes API server
The kube-scheduler
The NVIDIA Container Toolkit, after the device plugin allocates the device
The etcd database

3. How does the NVIDIA device plugin communicate with the kubelet?

Through the Kubernetes API server using REST calls
Via gRPC over a Unix socket at /var/lib/kubelet/device-plugins/
Through environment variables set in the pod spec
By writing directly to the container filesystem

4. What is the key advantage of the Container Device Interface (CDI) over the traditional device plugin approach?

CDI provides faster GPU computation speeds
CDI decouples device injection from the container runtime, improving portability across containerd and CRI-O
CDI automatically installs GPU drivers on new nodes
CDI replaces the need for CUDA libraries entirely

5. Why does the NVIDIA Container Toolkit inject driver libraries at runtime rather than requiring them in every container image?

Container images cannot contain binary libraries
It decouples the workload image from the host driver version, so a single image works across nodes with different driver versions
Runtime injection is faster than pre-baked libraries
NVIDIA licensing prevents including drivers in container images

Section 2: NVIDIA GPU Operator

Pre-Quiz — What do you already know?

1. What problem does the NVIDIA GPU Operator solve that manual GPU management does not?

It provides GPU hardware at a lower cost
It automates the entire GPU software lifecycle — drivers, runtime config, device plugin, monitoring — using a reconciliation loop
It increases the number of CUDA cores available per GPU
It replaces Kubernetes scheduling entirely for GPU workloads

2. Why does the GPU Operator deploy NVIDIA drivers as a container rather than installing them directly on the host?

Containerized drivers run faster than native kernel modules
It avoids host OS version lock-in and simplifies node upgrades by keeping driver lifecycle independent of the host
Host-installed drivers are not compatible with Kubernetes
Container images are required by NVIDIA licensing

3. What happens if the CUDA Validator component in the GPU Operator's provisioning workflow fails?

The GPU Operator ignores the failure and proceeds to schedule workloads
The node is immediately drained and removed from the cluster
The node remains unschedulable for GPU workloads; the Operator logs the error and retries provisioning
All other GPU nodes in the cluster are also paused until the issue is resolved

4. What is the purpose of GPU Feature Discovery (GFD)?

It discovers new GPU hardware models before they are released
It automatically labels nodes with detailed GPU metadata (model, memory, CUDA version, MIG capability) for scheduling
It discovers and installs missing CUDA libraries on worker nodes
It monitors GPU temperature and automatically throttles workloads

5. A pod requires an 80 GB A100 GPU with MIG support. How can this requirement be expressed in Kubernetes without manually maintaining node pools?

By setting a CPU request equal to the number of GPU cores needed
By using node affinity rules that match GFD-generated labels like nvidia.com/gpu.product and nvidia.com/mig.capable
By deploying a separate scheduler exclusively for A100 nodes
By adding a ConfigMap that maps pod names to specific nodes

The Operator Pattern for GPU Management

Managing GPU nodes manually — installing drivers, configuring the container toolkit, deploying the device plugin, setting up monitoring — is a fragile, error-prone process. The NVIDIA GPU Operator encodes this operational knowledge into a Kubernetes Operator: a controller that continuously watches cluster state and reconciles it toward the desired GPU software configuration.

Components Managed by the GPU Operator

ComponentPurpose
NVIDIA DriversKernel module enabling CUDA; deployed as a privileged container
Container ToolkitHooks the container runtime to inject GPU access into pods
Device PluginAdvertises nvidia.com/gpu extended resources to the scheduler
GPU Feature DiscoveryLabels nodes with GPU metadata (model, memory, CUDA version)
DCGM ExporterExposes GPU metrics (utilization, temperature, memory) for Prometheus
MIG ManagerConfigures MIG partitioning on supported hardware
CUDA ValidatorRuns a test workload to confirm CUDA is functional before scheduling

Installation via Helm

# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Create the namespace and label it for privileged pod security
kubectl create namespace gpu-operator
kubectl label --overwrite namespace gpu-operator \
  pod-security.kubernetes.io/enforce=privileged

# Install the GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator --wait

Automated Provisioning Workflow

For each new GPU node, the Operator follows a three-step workflow:

  1. Discovery — Identifies nodes with NVIDIA GPUs using labels and hardware detection
  2. Installation and Configuration — Deploys the driver container, configures the Container Toolkit, starts the device plugin and monitoring stack
  3. Validation — The CUDA Validator runs a test workload; only after it passes does the node accept GPU workloads

The validation gate is critical: it prevents misconfigured nodes from silently accepting GPU pods that would then fail at runtime.

Node Labeling and GPU Feature Discovery

GPU Feature Discovery (GFD) automatically labels nodes with rich GPU metadata, enabling fine-grained scheduling decisions. Example labels on a node with an A100:

nvidia.com/gpu.present=true
nvidia.com/gpu.product=A100-SXM4-80GB
nvidia.com/gpu.memory=81920
nvidia.com/gpu.count=8
nvidia.com/cuda.driver.major=525
nvidia.com/mig.capable=true

With these labels, pods can express hardware requirements as node affinity rules rather than relying on manually maintained node pools.

Key Takeaway

Post-Quiz — What did you learn?

1. What problem does the NVIDIA GPU Operator solve that manual GPU management does not?

It provides GPU hardware at a lower cost
It automates the entire GPU software lifecycle — drivers, runtime config, device plugin, monitoring — using a reconciliation loop
It increases the number of CUDA cores available per GPU
It replaces Kubernetes scheduling entirely for GPU workloads

2. Why does the GPU Operator deploy NVIDIA drivers as a container rather than installing them directly on the host?

Containerized drivers run faster than native kernel modules
It avoids host OS version lock-in and simplifies node upgrades by keeping driver lifecycle independent of the host
Host-installed drivers are not compatible with Kubernetes
Container images are required by NVIDIA licensing

3. What happens if the CUDA Validator component in the GPU Operator's provisioning workflow fails?

The GPU Operator ignores the failure and proceeds to schedule workloads
The node is immediately drained and removed from the cluster
The node remains unschedulable for GPU workloads; the Operator logs the error and retries provisioning
All other GPU nodes in the cluster are also paused until the issue is resolved

4. What is the purpose of GPU Feature Discovery (GFD)?

It discovers new GPU hardware models before they are released
It automatically labels nodes with detailed GPU metadata (model, memory, CUDA version, MIG capability) for scheduling
It discovers and installs missing CUDA libraries on worker nodes
It monitors GPU temperature and automatically throttles workloads

5. A pod requires an 80 GB A100 GPU with MIG support. How can this requirement be expressed in Kubernetes without manually maintaining node pools?

By setting a CPU request equal to the number of GPU cores needed
By using node affinity rules that match GFD-generated labels like nvidia.com/gpu.product and nvidia.com/mig.capable
By deploying a separate scheduler exclusively for A100 nodes
By adding a ConfigMap that maps pod names to specific nodes

Section 3: GPU Scheduling and Resource Requests

Pre-Quiz — What do you already know?

1. A pod spec requests nvidia.com/gpu: 1 but does not include a toleration for the nvidia.com/gpu:NoSchedule taint. What happens?

The pod is scheduled on the GPU node but runs without GPU access
The pod remains Pending because it cannot be scheduled on any tainted GPU node
The scheduler automatically adds the missing toleration
The pod is placed on a CPU-only node with emulated GPU access

2. Why must GPU resource requests and limits be equal in a Kubernetes pod spec?

Because GPUs are too expensive to overcommit — Kubernetes enforces 1:1 allocation for extended resources
This is just a recommendation, not a requirement
Because GPU memory cannot be measured in fractional units
Because the device plugin does not support the requests field

3. A cluster has both A100 nodes (for training) and T4 nodes (for inference). What mechanism prevents a training pod from landing on a T4 node?

The Kubernetes scheduler automatically detects training workloads and routes them to powerful GPUs
Node affinity rules matching GFD labels like nvidia.com/gpu.product to the required GPU model
Setting the GPU request to a number larger than T4 node capacity
Using a ConfigMap that lists approved nodes per workload type

4. NVLink provides up to 900 GB/s bandwidth between GPUs. Why does this matter for multi-GPU training jobs?

NVLink increases the number of CUDA cores available per GPU
NVLink reduces the time to copy gradient tensors between GPUs during distributed training, potentially 20x faster than PCIe
NVLink allows GPUs to share the same VRAM pool
NVLink eliminates the need for a scheduler by connecting GPUs directly

5. What limitation of the standard Kubernetes scheduler makes topology-aware scheduling necessary?

The scheduler cannot count GPU resources correctly
The scheduler treats extended resources as opaque integer counts and cannot reason about GPU topology, memory bandwidth, or NUMA locality
The scheduler does not support multi-GPU pods
The scheduler requires a GPU-specific API extension to function

Requesting GPUs in Pod Specs

Requesting a GPU follows the same resources.limits pattern as CPU and memory, with one constraint: GPU requests and limits must match, and they are whole numbers (no fractional GPUs by default).

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training-job
spec:
  containers:
  - name: trainer
    image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
    resources:
      limits:
        nvidia.com/gpu: 1
    command: ["python", "train.py"]
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

The tolerations section is critical: GPU nodes are typically tainted with nvidia.com/gpu:NoSchedule to prevent non-GPU workloads from consuming expensive GPU node resources. A pod must explicitly tolerate this taint.

Node Affinity and GPU Topology Awareness

When a cluster has multiple GPU types, raw GPU count is insufficient. Node affinity rules allow targeting specific GPU models using GFD-generated labels:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.product
            operator: In
            values:
            - A100-SXM4-80GB
Animation: GPU Scheduling Flow — Pod Spec to GPU Allocation
Pod Spec nvidia.com/gpu: 2 affinity: A100-SXM4-80GB toleration: NoSchedule Scheduler 1. Check node labels 2. Check GPU capacity 3. Check tolerations Filtering nodes by nvidia.com/gpu.product T4 Node gpu.product: T4 nvidia.com/gpu: 4 Inference-class Wrong GPU model A100 Node 1 gpu.product: A100-SXM4-80GB nvidia.com/gpu: 4 (2 free) NVLink connected A100 Node 2 gpu.product: A100-SXM4-80GB nvidia.com/gpu: 0 free All GPUs allocated No capacity Match! POD 2 GPUs allocated via NVLink Topology Manager co-locates GPUs on same NUMA node NVLink: 900 GB/s vs PCIe: 32 GB/s (28x faster)

GPU Interconnect Bandwidth Comparison

InterconnectBandwidthTypical Use Case
NVLink (4th gen, H100)900 GB/sMulti-GPU LLM training on a single node
NVLink (3rd gen, A100)600 GB/sDistributed training, large model sharding
PCIe 4.0 x1632 GB/sInference, single-GPU training
InfiniBand HDR (inter-node)200 Gb/sMulti-node distributed training

Extended Resources and Custom Schedulers

The standard Kubernetes scheduler treats extended resources as opaque integer counts. It can count and subtract, but it cannot reason about GPU topology, memory bandwidth, or NUMA locality. For topology-aware placement, the GPU Operator integrates with the Topology Manager and NUMA-aware scheduler to co-locate GPU and CPU resources on the same NUMA node.

Key Takeaway

Post-Quiz — What did you learn?

1. A pod spec requests nvidia.com/gpu: 1 but does not include a toleration for the nvidia.com/gpu:NoSchedule taint. What happens?

The pod is scheduled on the GPU node but runs without GPU access
The pod remains Pending because it cannot be scheduled on any tainted GPU node
The scheduler automatically adds the missing toleration
The pod is placed on a CPU-only node with emulated GPU access

2. Why must GPU resource requests and limits be equal in a Kubernetes pod spec?

Because GPUs are too expensive to overcommit — Kubernetes enforces 1:1 allocation for extended resources
This is just a recommendation, not a requirement
Because GPU memory cannot be measured in fractional units
Because the device plugin does not support the requests field

3. A cluster has both A100 nodes (for training) and T4 nodes (for inference). What mechanism prevents a training pod from landing on a T4 node?

The Kubernetes scheduler automatically detects training workloads and routes them to powerful GPUs
Node affinity rules matching GFD labels like nvidia.com/gpu.product to the required GPU model
Setting the GPU request to a number larger than T4 node capacity
Using a ConfigMap that lists approved nodes per workload type

4. NVLink provides up to 900 GB/s bandwidth between GPUs. Why does this matter for multi-GPU training jobs?

NVLink increases the number of CUDA cores available per GPU
NVLink reduces the time to copy gradient tensors between GPUs during distributed training, potentially 20x faster than PCIe
NVLink allows GPUs to share the same VRAM pool
NVLink eliminates the need for a scheduler by connecting GPUs directly

5. What limitation of the standard Kubernetes scheduler makes topology-aware scheduling necessary?

The scheduler cannot count GPU resources correctly
The scheduler treats extended resources as opaque integer counts and cannot reason about GPU topology, memory bandwidth, or NUMA locality
The scheduler does not support multi-GPU pods
The scheduler requires a GPU-specific API extension to function

Section 4: GPU Sharing and Multi-Tenancy

Pre-Quiz — What do you already know?

1. What is the fundamental difference between MIG partitioning and time-slicing?

MIG is software-based while time-slicing uses hardware isolation
MIG creates hardware-enforced partitions with dedicated compute and memory, while time-slicing shares everything via rapid context switching
MIG works on any GPU while time-slicing requires Ampere hardware
MIG and time-slicing are identical in isolation; they differ only in configuration

2. An A100 is configured with 7 MIG instances at the 1g.10gb profile, and each instance is time-sliced into 4 replicas. How many total pod allocations does this produce?

7
11
28
56

3. Why is MPS (Multi-Process Service) preferred over time-slicing for replicated inference workloads?

MPS provides hardware-level isolation between processes
MPS funnels multiple CUDA processes through a single shared context, reducing context-switching overhead and improving throughput
MPS can split a GPU into more partitions than time-slicing
MPS is the only sharing method that works on pre-Ampere hardware

4. Which combination of GPU sharing strategies is NOT supported?

MIG + time-slicing
MIG + MPS on the same GPU
Time-slicing on a non-MIG GPU
MIG alone without time-slicing

5. A naively scheduled GPU cluster typically achieves about 13% GPU utilization. With advanced GPU sharing strategies, what utilization level can production clusters achieve?

About 25%
About 45%
Over 80%
Close to 100%

The GPU Utilization Problem

A single NVIDIA A100 costs $10,000–$30,000. Running one small inference workload at 5% utilization is financially untenable. GPU sharing runs multiple workloads on a single GPU to increase utilization and reduce per-workload cost. NVIDIA provides three distinct sharing mechanisms:

StrategyIsolationHardware Req.Latency ImpactBest For
MIGHardware (hard)Ampere+ (A100, H100)NoneProduction inference with SLAs
Time-SlicingNone (soft)Any NVIDIA GPUContext switch jitterDev/test, bursty workloads
MPSProcess-level (soft)Any NVIDIA GPULow overheadThroughput-focused inference replicas

MIG (Multi-Instance GPU) Partitioning

MIG is a hardware feature on Ampere+ architecture that divides one physical GPU into up to 7 independent instances, each with dedicated compute engines, memory bandwidth, L2 cache, and DRAM. One workload in a MIG instance cannot read memory from another instance or interfere with its performance.

Animation: MIG Partitioning — One A100 GPU, Seven Isolated Instances
NVIDIA A100 80GB Single Physical GPU — 80 GB HBM2e, 6912 CUDA Cores Before MIG: 1 workload gets all resources Step 1: Identify physical GPU Step 2: MIG partitions GPU into 7 isolated instances (1g.10gb profile) MIG 1 1g.10gb 1/7 compute ~10 GB mem MIG 2 1g.10gb 1/7 compute ~10 GB mem MIG 3 1g.10gb 1/7 compute ~10 GB mem MIG 4 1g.10gb 1/7 compute ~10 GB mem MIG 5 1g.10gb 1/7 compute ~10 GB mem MIG 6 1g.10gb 1/7 compute ~10 GB mem MIG 7 1g.10gb 1/7 compute ~10 GB mem Pod A inference Pod B inference Pod C fine-tune Pod D inference Pod E test Pod F inference Pod G inference Step 3: Assign one pod per MIG instance Hardware isolation walls — no cross-instance memory access or interference Pods request: nvidia.com/mig-1g.10gb: 1

MIG Partition Profiles (A100 80GB)

ProfileCompute FractionMemoryMax Instances
1g.10gb1/7~10 GB7
2g.20gb2/7~20 GB3
3g.40gb3/7~40 GB2
4g.40gb4/7~40 GB1
7g.80gb7/7~80 GB1 (whole GPU)

Enabling MIG with the GPU Operator

# Label the node with the desired MIG strategy
kubectl label node <gpu-node> nvidia.com/mig.config=all-1g.10gb

# The GPU Operator's MIG Manager detects the label change and
# reconfigures the GPU automatically. Verify the result:
kubectl describe node <gpu-node> | grep nvidia.com/mig

After configuration, the node advertises nvidia.com/mig-1g.10gb: 7 as an extended resource. Pods request MIG slices explicitly:

resources:
  limits:
    nvidia.com/mig-1g.10gb: 1

Time-Slicing GPUs Across Pods

Time-slicing works on any NVIDIA GPU. It configures the device plugin to advertise each physical GPU as multiple virtual GPUs via a ConfigMap. The GPU rapidly context-switches between active workloads.

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        replicas: 4

Critical limitations: No memory isolation (all pods share VRAM), no compute isolation, latency jitter from context switching, no fair scheduling guarantees. Time-slicing is appropriate for dev/test clusters and bursty workloads, not production inference with SLAs.

MPS (Multi-Process Service)

MPS funnels multiple CUDA processes through a single shared CUDA context via a server daemon, reducing context-switching overhead. Ideal for replicated inference: running multiple copies of the same model with higher aggregate throughput than time-slicing.

Constraints: MPS and time-slicing cannot coexist. MPS is not supported on MIG-enabled devices. A crash in one CUDA process can destabilize the shared context — best for trusted workloads only.

Combining Strategies: MIG + Time-Slicing

For maximum density on A100/H100 hardware, MIG and time-slicing can be combined: 7 MIG instances × 4 time-sliced replicas = 28 pod allocations from a single physical GPU. Each MIG instance provides hardware memory isolation, while time-slicing adds density within each partition.

Cost Optimization Impact

Production clusters using advanced GPU sharing can move utilization from the typical 13% baseline to over 80%. The decision framework: isolation requirements first, hardware capabilities second, workload patterns third.

Key Takeaway

Post-Quiz — What did you learn?

1. What is the fundamental difference between MIG partitioning and time-slicing?

MIG is software-based while time-slicing uses hardware isolation
MIG creates hardware-enforced partitions with dedicated compute and memory, while time-slicing shares everything via rapid context switching
MIG works on any GPU while time-slicing requires Ampere hardware
MIG and time-slicing are identical in isolation; they differ only in configuration

2. An A100 is configured with 7 MIG instances at the 1g.10gb profile, and each instance is time-sliced into 4 replicas. How many total pod allocations does this produce?

7
11
28
56

3. Why is MPS (Multi-Process Service) preferred over time-slicing for replicated inference workloads?

MPS provides hardware-level isolation between processes
MPS funnels multiple CUDA processes through a single shared context, reducing context-switching overhead and improving throughput
MPS can split a GPU into more partitions than time-slicing
MPS is the only sharing method that works on pre-Ampere hardware

4. Which combination of GPU sharing strategies is NOT supported?

MIG + time-slicing
MIG + MPS on the same GPU
Time-slicing on a non-MIG GPU
MIG alone without time-slicing

5. A naively scheduled GPU cluster typically achieves about 13% GPU utilization. With advanced GPU sharing strategies, what utilization level can production clusters achieve?

About 25%
About 45%
Over 80%
Close to 100%

Your Progress

Answer Explanations