Chapter 9: Monitoring, Observability, and Troubleshooting

Learning Objectives

Pre-Study Assessment

1. What is the primary purpose of deploying DCGM Exporter as a DaemonSet?

To centrally query GPU metrics from one pod
To ensure one metrics-exporting pod runs on every GPU node
To replace the NVIDIA device plugin
To automatically restart failed GPU workloads

2. Why can't infrastructure metrics alone tell you if a model is converging?

Infrastructure metrics are sampled too infrequently
A GPU can run at high utilization while training loss has plateaued
Infrastructure metrics only work for inference workloads
Prometheus cannot scrape GPU metrics

3. Which Xid error code indicates an uncorrectable GPU memory fault requiring immediate node cordoning?

Xid 13
Xid 31
Xid 48
Xid 79

4. What is the most effective way to detect a training stall via Prometheus?

Alert when GPU utilization drops to 0%
Alert when the training_last_step_timestamp is older than a threshold
Alert when the pod restarts
Alert when Grafana dashboard panels show no data

5. What environment variable should you set first when debugging NCCL communication failures?

CUDA_VISIBLE_DEVICES
NCCL_DEBUG=INFO
NVIDIA_DRIVER_CAPABILITIES=all
PYTORCH_CUDA_ALLOC_CONF

6. For inference observability, why are p99 latency metrics more important than mean latency?

Mean latency is not supported by Prometheus
p99 defines the worst experience for almost all users and is the SLO boundary
p99 is always lower than the mean, so it is more useful
Mean latency includes network overhead while p99 does not

7. What is the correct Prometheus metric type for tracking inference latency distributions?

Counter
Gauge
Histogram
Summary

8. A pod is running but training throughput is ~100x lower than expected. What is the most likely cause?

The learning rate is too low
The pod spec is missing nvidia.com/gpu resource requests
The Prometheus scrape interval is too long
The model has too many parameters

9. What does torch.distributed.monitored_barrier() help you diagnose?

GPU memory leaks during training
Which rank failed to arrive at a synchronization point
Whether the model has converged
The optimal batch size for distributed training

10. How do you detect GPU thermal throttling using DCGM metrics?

Alert when GPU power draw exceeds rated TDP
Correlate SM clock frequency drops with rising GPU temperature
Monitor the DCGM_FI_DEV_ECC_SBE_VOL_TOTAL metric
Check the pod restart count in Kubernetes

11. What Kubernetes custom resource connects DCGM Exporter to Prometheus?

PrometheusRule
ServiceMonitor
PodMonitor
ScrapeConfig

12. A distributed training job hangs for 1800 seconds and then fails. What is the default NCCL timeout behavior?

NCCL retries the collective operation indefinitely
NCCL waits for the default timeout period, then raises an error
NCCL kills the straggler rank and continues
NCCL logs a warning but does not fail

Section 1: GPU Monitoring with DCGM and Prometheus

The Three-Layer Observability Model

Effective AI workload observability requires monitoring three interlocking layers simultaneously: infrastructure (GPU hardware), platform (Kubernetes scheduling), and application (model metrics). No single layer tells the full story.

LayerWhat It MeasuresKey Tools
InfrastructureGPU utilization, memory, temperature, power, ECC errorsDCGM Exporter, node_exporter
PlatformPod health, scheduling events, resource quotas, job statuskube-state-metrics, Kubernetes Events
ApplicationTraining loss, learning rate, inference latency, throughputCustom Prometheus metrics, MLflow, W&B

NVIDIA DCGM Exporter Deployment

DCGM (Data Center GPU Manager) is NVIDIA's suite for managing and monitoring GPUs at scale. The DCGM Exporter component collects GPU telemetry and exposes it at an HTTP /metrics endpoint that Prometheus can scrape.

DCGM Exporter is deployed as a Kubernetes DaemonSet, meaning one pod runs on every GPU-equipped node. GPU metrics are node-local and cannot be collected from a central location without per-node agents.

Installation uses the official Helm chart:

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install --generate-name gpu-helm-charts/dcgm-exporter

Key GPU Metrics

Metric NameDescriptionAlert Threshold
DCGM_FI_DEV_GPU_UTILGPU compute utilization (%)Alert if < 20% for sustained training
DCGM_FI_DEV_FB_USEDFramebuffer memory used (MiB)Alert if > 95% of total
DCGM_FI_DEV_GPU_TEMPGPU core temperatureAlert if > 83 C
DCGM_FI_DEV_SM_CLOCKStreaming Multiprocessor clock (MHz)Alert on sustained drop during training
DCGM_FI_DEV_ECC_DBE_VOL_TOTALVolatile double-bit ECC errorsAlert immediately on any occurrence
DCGM_FI_DEV_XID_ERRORSXid error count (hardware/driver faults)Alert on any increment

Double-bit ECC errors are particularly critical: they indicate uncorrectable GPU memory faults and should immediately trigger node cordoning.

Prometheus ServiceMonitor Configuration

A ServiceMonitor custom resource tells Prometheus how to discover and scrape the DCGM Exporter service. The release label must match the Prometheus Operator's serviceMonitorSelector -- this is the most common misconfiguration.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  namespaceSelector:
    matchNames:
    - gpu-operator-resources
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics
Animation: GPU Monitoring Pipeline
GPU Node GPU 0 (A100) GPU 1 (A100) DCGM Host Engine DCGM Exporter :9400/metrics Prometheus Scrape + Store Alert Rules Grafana Dashboards Dashboard ID 12239 Alertmanager Route + Notify ServiceMonitor gpu_util=92% temp=71C mem=78%

Key Points: GPU Monitoring

Section 2: Training and Inference Observability

Custom Metrics for Training Workloads

Infrastructure metrics tell you whether the GPU is busy; they do not tell you whether the model is converging. A GPU can run at 95% utilization while training loss has plateaued. Application-layer metrics close this gap.

MetricWhat It RevealsAlert Condition
Training loss (per step)Model convergence directionLoss not decreasing over N steps = stall
Validation loss (per epoch)Overfitting detectionVal loss diverging from train loss
Samples per secondGPU throughput efficiency> 20% drop vs. baseline
Gradient normExploding/vanishing gradient healthNorm > 100 or < 1e-7
Checkpoint write successRecovery point availabilityAlert on write failure

Expose these from your training code using the Prometheus Python client:

from prometheus_client import Gauge, start_http_server
start_http_server(8080)

training_loss = Gauge('training_loss', 'Current training loss',
                      ['job_name', 'rank'])

for step, batch in enumerate(dataloader):
    # ... forward pass, backward pass ...
    training_loss.labels(job_name=job_name, rank=rank).set(loss.item())

Inference Observability: Latency Distributions

Inference favors latency over throughput. Tail latency (p99) matters most because it defines the worst experience a user receives. Use Prometheus Histograms for latency -- they allow computing arbitrary percentiles at query time:

from prometheus_client import Histogram

inference_latency = Histogram(
    'inference_request_duration_seconds',
    'End-to-end inference latency',
    ['model_name', 'model_version'],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# Query p99 in PromQL:
# histogram_quantile(0.99,
#   sum(rate(inference_request_duration_seconds_bucket[5m])) by (le, model_name)
# )
MetricDescriptionWhy It Matters
p50 latencyMedian request latencyBaseline serving speed
p95 latency95th percentileTypical worst-case experience
p99 latency99th percentileSLO boundary for most services
Queue depthPending requestsScaling trigger
TTFTTime to First TokenCritical for streaming LLM UX

Distributed Training Communication Monitoring

In multi-node distributed training, collective communication operations (AllReduce, AllGather) are synchronization points where all GPU ranks must meet. A single slow or dead GPU causes every other rank to stall. Use torch.distributed.monitored_barrier() to identify which rank failed to arrive:

import torch.distributed as dist
import datetime

dist.monitored_barrier(
    timeout=datetime.timedelta(seconds=60),
    wait_all_ranks=True
)
# Raises RuntimeError identifying which rank failed

Key Points: Training and Inference Observability

Section 3: Alerting and SLOs for AI Workloads

Designing Alerts That Matter

Alerts should page someone who can act on them immediately. Every alert should answer three questions: what is broken, where is it broken, and what is the immediate remediation step.

GPU Failure and Xid Error Alerts

Xid errors are the GPU driver's hardware fault codes. Common codes for AI workloads:

Xid CodeMeaningAction
Xid 13Graphics ExceptionInvestigate workload, restart if persistent
Xid 31GPU memory page faultCheck out-of-bounds CUDA memory access
Xid 48Double-bit ECC errorCordon node immediately, replace GPU
Xid 63Row remapping failureSchedule GPU replacement
Xid 79GPU has fallen off the busNode reboot required, hardware failure likely

Training Stall Detection

A training stall is one of the most expensive failure modes: the cluster continues consuming expensive GPU-hours while producing no useful model updates. Detect stalls by alerting on the absence of training progress:

# Prometheus alert rule
- alert: TrainingJobStalled
  expr: (time() - training_last_step_timestamp_seconds) > 600
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Training job {{ $labels.job_name }} stalled"

SLO Definitions for Inference Services

SLOObjectiveAlert if
p99 latency< 2 seconds> 2s for 5m
p95 latency< 500ms> 500ms for 5m
Availability> 99.9%< 99.9% over 1h
TTFT< 300ms> 300ms for 5m
Error rate< 0.1%> 0.1% for 5m

KEDA can use these same Prometheus metrics as scaling signals, automatically adding inference replicas when the p99 SLO is at risk.

Animation: Alert Escalation Timeline
Normal GPU Temp: 68C SM Clock: 1.9GHz metric value Threshold Breach GPU Temp: 84C SM Clock: 1.5GHz threshold Alert Fires Prometheus rule for: 5m satisfied FIRING Notification Alertmanager routes to Slack / PagerDuty PAGE SRE Remediation kubectl cordon node Reschedule workload t=0 t=+3m t=+8m t=+9m t=+12m

Key Points: Alerting and SLOs

Section 4: Troubleshooting Common Issues

GPU Not Detected or Driver Mismatch

Symptom: Pod runs but training throughput is ~100x lower than expected (CPU-only execution). This is a particularly dangerous silent failure that has burned weeks of researcher time.

Root Causes:

# Diagnostic sequence:
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].resources}'
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds -o wide
kubectl get node <node> -o jsonpath='{.status.capacity}'
kubectl exec -it <pod> -- nvidia-smi

NCCL Communication Failures

Symptom: Distributed training hangs indefinitely, then fails after 1800 seconds (default NCCL timeout).

CauseDiagnostic SignalFix
Wrong network interfaceNCCL logs show wrong IPSet NCCL_SOCKET_IFNAME=eth0
Clock skew > 1msNTP logs show driftSynchronize NTP/chrony
Mismatched tensor shapesStack trace at AllReduceAudit model code
Dead GPU (silent failure)One rank stops progressingCheck DCGM Xid metrics

Always set NCCL_DEBUG=INFO and NCCL_ASYNC_ERROR_HANDLING=1 as the first diagnostic step. Run nccl-tests to validate baseline inter-node bandwidth before large training runs.

Pod Scheduling Failures

Symptom: Pods remain in Pending state. Events show Insufficient nvidia.com/gpu.

GPU nodes commonly carry a NoSchedule taint. If your training pod is missing the matching toleration, it will never be scheduled:

tolerations:
- key: nvidia.com/gpu
  operator: Exists
  effect: NoSchedule

Thermal Throttling

Symptom: Training throughput gradually decreases over a long run with no errors logged. When GPU temperature exceeds ~83C, the hardware automatically reduces SM clock frequency -- a GPU throttling from 1.95 GHz to 1.2 GHz loses approximately 38% of compute throughput.

Detection: Correlate DCGM_FI_DEV_GPU_TEMP with DCGM_FI_DEV_SM_CLOCK. A drop in SM clock correlated with rising temperature is the definitive signature.

Animation: GPU Scheduling Failure — Diagnostic Decision Tree
Pod running but slow? GPU in pod spec resources.limits? Add nvidia.com/gpu: 1 No Device plugin pod running on target node? Yes Restart device plugin DS No Node capacity shows nvidia.com/gpu? Yes Check NVIDIA driver + reboot No Pod events show scheduling error? Yes Add NoSchedule toleration Yes nvidia-smi works inside container? No Fix NVIDIA Container Runtime No GPU detected -- check CUDA compat Yes

Key Points: Troubleshooting

Post-Study Assessment

1. What is the primary purpose of deploying DCGM Exporter as a DaemonSet?

To centrally query GPU metrics from one pod
To ensure one metrics-exporting pod runs on every GPU node
To replace the NVIDIA device plugin
To automatically restart failed GPU workloads

2. Why can't infrastructure metrics alone tell you if a model is converging?

Infrastructure metrics are sampled too infrequently
A GPU can run at high utilization while training loss has plateaued
Infrastructure metrics only work for inference workloads
Prometheus cannot scrape GPU metrics

3. Which Xid error code indicates an uncorrectable GPU memory fault requiring immediate node cordoning?

Xid 13
Xid 31
Xid 48
Xid 79

4. What is the most effective way to detect a training stall via Prometheus?

Alert when GPU utilization drops to 0%
Alert when the training_last_step_timestamp is older than a threshold
Alert when the pod restarts
Alert when Grafana dashboard panels show no data

5. What environment variable should you set first when debugging NCCL communication failures?

CUDA_VISIBLE_DEVICES
NCCL_DEBUG=INFO
NVIDIA_DRIVER_CAPABILITIES=all
PYTORCH_CUDA_ALLOC_CONF

6. For inference observability, why are p99 latency metrics more important than mean latency?

Mean latency is not supported by Prometheus
p99 defines the worst experience for almost all users and is the SLO boundary
p99 is always lower than the mean, so it is more useful
Mean latency includes network overhead while p99 does not

7. What is the correct Prometheus metric type for tracking inference latency distributions?

Counter
Gauge
Histogram
Summary

8. A pod is running but training throughput is ~100x lower than expected. What is the most likely cause?

The learning rate is too low
The pod spec is missing nvidia.com/gpu resource requests
The Prometheus scrape interval is too long
The model has too many parameters

9. What does torch.distributed.monitored_barrier() help you diagnose?

GPU memory leaks during training
Which rank failed to arrive at a synchronization point
Whether the model has converged
The optimal batch size for distributed training

10. How do you detect GPU thermal throttling using DCGM metrics?

Alert when GPU power draw exceeds rated TDP
Correlate SM clock frequency drops with rising GPU temperature
Monitor the DCGM_FI_DEV_ECC_SBE_VOL_TOTAL metric
Check the pod restart count in Kubernetes

11. What Kubernetes custom resource connects DCGM Exporter to Prometheus?

PrometheusRule
ServiceMonitor
PodMonitor
ScrapeConfig

12. A distributed training job hangs for 1800 seconds and then fails. What is the default NCCL timeout behavior?

NCCL retries the collective operation indefinitely
NCCL waits for the default timeout period, then raises an error
NCCL kills the straggler rank and continues
NCCL logs a warning but does not fail

Your Progress

Answer Explanations