Study Guide: Chapter 9 — Monitoring, Observability, and Troubleshooting

Pre-Study Assessment

1. What is the primary purpose of deploying DCGM Exporter as a DaemonSet?

To centrally query GPU metrics from one pod

To ensure one metrics-exporting pod runs on every GPU node

To replace the NVIDIA device plugin

To automatically restart failed GPU workloads

2. Why can't infrastructure metrics alone tell you if a model is converging?

Infrastructure metrics are sampled too infrequently

A GPU can run at high utilization while training loss has plateaued

Infrastructure metrics only work for inference workloads

Prometheus cannot scrape GPU metrics

3. Which Xid error code indicates an uncorrectable GPU memory fault requiring immediate node cordoning?

Xid 13

Xid 31

Xid 48

Xid 79

4. What is the most effective way to detect a training stall via Prometheus?

Alert when GPU utilization drops to 0%

Alert when the training_last_step_timestamp is older than a threshold

Alert when the pod restarts

Alert when Grafana dashboard panels show no data

5. What environment variable should you set first when debugging NCCL communication failures?

CUDA_VISIBLE_DEVICES

NCCL_DEBUG=INFO

NVIDIA_DRIVER_CAPABILITIES=all

PYTORCH_CUDA_ALLOC_CONF

6. For inference observability, why are p99 latency metrics more important than mean latency?

Mean latency is not supported by Prometheus

p99 defines the worst experience for almost all users and is the SLO boundary

p99 is always lower than the mean, so it is more useful

Mean latency includes network overhead while p99 does not

7. What is the correct Prometheus metric type for tracking inference latency distributions?

Counter

Gauge

Histogram

Summary

8. A pod is running but training throughput is ~100x lower than expected. What is the most likely cause?

The learning rate is too low

The pod spec is missing nvidia.com/gpu resource requests

The Prometheus scrape interval is too long

The model has too many parameters

9. What does torch.distributed.monitored_barrier() help you diagnose?

GPU memory leaks during training

Which rank failed to arrive at a synchronization point

Whether the model has converged

The optimal batch size for distributed training

10. How do you detect GPU thermal throttling using DCGM metrics?

Alert when GPU power draw exceeds rated TDP

Correlate SM clock frequency drops with rising GPU temperature

Monitor the DCGM_FI_DEV_ECC_SBE_VOL_TOTAL metric

Check the pod restart count in Kubernetes

11. What Kubernetes custom resource connects DCGM Exporter to Prometheus?

PrometheusRule

ServiceMonitor

PodMonitor

ScrapeConfig

12. A distributed training job hangs for 1800 seconds and then fails. What is the default NCCL timeout behavior?

NCCL retries the collective operation indefinitely

NCCL waits for the default timeout period, then raises an error

NCCL kills the straggler rank and continues

NCCL logs a warning but does not fail

Section 1: GPU Monitoring with DCGM and Prometheus

The Three-Layer Observability Model

Effective AI workload observability requires monitoring three interlocking layers simultaneously: infrastructure (GPU hardware), platform (Kubernetes scheduling), and application (model metrics). No single layer tells the full story.

Layer	What It Measures	Key Tools
Infrastructure	GPU utilization, memory, temperature, power, ECC errors	DCGM Exporter, node_exporter
Platform	Pod health, scheduling events, resource quotas, job status	kube-state-metrics, Kubernetes Events
Application	Training loss, learning rate, inference latency, throughput	Custom Prometheus metrics, MLflow, W&B

NVIDIA DCGM Exporter Deployment

DCGM (Data Center GPU Manager) is NVIDIA's suite for managing and monitoring GPUs at scale. The DCGM Exporter component collects GPU telemetry and exposes it at an HTTP /metrics endpoint that Prometheus can scrape.

DCGM Exporter is deployed as a Kubernetes DaemonSet, meaning one pod runs on every GPU-equipped node. GPU metrics are node-local and cannot be collected from a central location without per-node agents.

Installation uses the official Helm chart:

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install --generate-name gpu-helm-charts/dcgm-exporter

Key GPU Metrics

Metric Name	Description	Alert Threshold
`DCGM_FI_DEV_GPU_UTIL`	GPU compute utilization (%)	Alert if < 20% for sustained training
`DCGM_FI_DEV_FB_USED`	Framebuffer memory used (MiB)	Alert if > 95% of total
`DCGM_FI_DEV_GPU_TEMP`	GPU core temperature	Alert if > 83 C
`DCGM_FI_DEV_SM_CLOCK`	Streaming Multiprocessor clock (MHz)	Alert on sustained drop during training
`DCGM_FI_DEV_ECC_DBE_VOL_TOTAL`	Volatile double-bit ECC errors	Alert immediately on any occurrence
`DCGM_FI_DEV_XID_ERRORS`	Xid error count (hardware/driver faults)	Alert on any increment

Double-bit ECC errors are particularly critical: they indicate uncorrectable GPU memory faults and should immediately trigger node cordoning.

Prometheus ServiceMonitor Configuration

A ServiceMonitor custom resource tells Prometheus how to discover and scrape the DCGM Exporter service. The release label must match the Prometheus Operator's serviceMonitorSelector -- this is the most common misconfiguration.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  namespaceSelector:
    matchNames:
    - gpu-operator-resources
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics

Section 2: Training and Inference Observability

Custom Metrics for Training Workloads

Infrastructure metrics tell you whether the GPU is busy; they do not tell you whether the model is converging. A GPU can run at 95% utilization while training loss has plateaued. Application-layer metrics close this gap.

Metric	What It Reveals	Alert Condition
Training loss (per step)	Model convergence direction	Loss not decreasing over N steps = stall
Validation loss (per epoch)	Overfitting detection	Val loss diverging from train loss
Samples per second	GPU throughput efficiency	> 20% drop vs. baseline
Gradient norm	Exploding/vanishing gradient health	Norm > 100 or < 1e-7
Checkpoint write success	Recovery point availability	Alert on write failure

Expose these from your training code using the Prometheus Python client:

from prometheus_client import Gauge, start_http_server
start_http_server(8080)

training_loss = Gauge('training_loss', 'Current training loss',
                      ['job_name', 'rank'])

for step, batch in enumerate(dataloader):
    # ... forward pass, backward pass ...
    training_loss.labels(job_name=job_name, rank=rank).set(loss.item())

Inference Observability: Latency Distributions

Inference favors latency over throughput. Tail latency (p99) matters most because it defines the worst experience a user receives. Use Prometheus Histograms for latency -- they allow computing arbitrary percentiles at query time:

from prometheus_client import Histogram

inference_latency = Histogram(
    'inference_request_duration_seconds',
    'End-to-end inference latency',
    ['model_name', 'model_version'],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# Query p99 in PromQL:
# histogram_quantile(0.99,
#   sum(rate(inference_request_duration_seconds_bucket[5m])) by (le, model_name)
# )

Metric	Description	Why It Matters
p50 latency	Median request latency	Baseline serving speed
p95 latency	95th percentile	Typical worst-case experience
p99 latency	99th percentile	SLO boundary for most services
Queue depth	Pending requests	Scaling trigger
TTFT	Time to First Token	Critical for streaming LLM UX

Distributed Training Communication Monitoring

In multi-node distributed training, collective communication operations (AllReduce, AllGather) are synchronization points where all GPU ranks must meet. A single slow or dead GPU causes every other rank to stall. Use torch.distributed.monitored_barrier() to identify which rank failed to arrive:

import torch.distributed as dist
import datetime

dist.monitored_barrier(
    timeout=datetime.timedelta(seconds=60),
    wait_all_ranks=True
)
# Raises RuntimeError identifying which rank failed

Section 3: Alerting and SLOs for AI Workloads

Designing Alerts That Matter

Alerts should page someone who can act on them immediately. Every alert should answer three questions: what is broken, where is it broken, and what is the immediate remediation step.

GPU Failure and Xid Error Alerts

Xid errors are the GPU driver's hardware fault codes. Common codes for AI workloads:

Xid Code	Meaning	Action
Xid 13	Graphics Exception	Investigate workload, restart if persistent
Xid 31	GPU memory page fault	Check out-of-bounds CUDA memory access
Xid 48	Double-bit ECC error	Cordon node immediately, replace GPU
Xid 63	Row remapping failure	Schedule GPU replacement
Xid 79	GPU has fallen off the bus	Node reboot required, hardware failure likely

Training Stall Detection

A training stall is one of the most expensive failure modes: the cluster continues consuming expensive GPU-hours while producing no useful model updates. Detect stalls by alerting on the absence of training progress:

# Prometheus alert rule
- alert: TrainingJobStalled
  expr: (time() - training_last_step_timestamp_seconds) > 600
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Training job {{ $labels.job_name }} stalled"

SLO Definitions for Inference Services

SLO	Objective	Alert if
p99 latency	< 2 seconds	> 2s for 5m
p95 latency	< 500ms	> 500ms for 5m
Availability	> 99.9%	< 99.9% over 1h
TTFT	< 300ms	> 300ms for 5m
Error rate	< 0.1%	> 0.1% for 5m

KEDA can use these same Prometheus metrics as scaling signals, automatically adding inference replicas when the p99 SLO is at risk.

Section 4: Troubleshooting Common Issues

GPU Not Detected or Driver Mismatch

Symptom: Pod runs but training throughput is ~100x lower than expected (CPU-only execution). This is a particularly dangerous silent failure that has burned weeks of researcher time.

Root Causes:

Missing nvidia.com/gpu resource request in the pod spec
NVIDIA Container Runtime not configured as default runtime on the node
NVIDIA device plugin DaemonSet not running on the target node
Node driver version mismatch with the container's CUDA version

# Diagnostic sequence:
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].resources}'
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds -o wide
kubectl get node <node> -o jsonpath='{.status.capacity}'
kubectl exec -it <pod> -- nvidia-smi

NCCL Communication Failures

Symptom: Distributed training hangs indefinitely, then fails after 1800 seconds (default NCCL timeout).

Cause	Diagnostic Signal	Fix
Wrong network interface	NCCL logs show wrong IP	Set `NCCL_SOCKET_IFNAME=eth0`
Clock skew > 1ms	NTP logs show drift	Synchronize NTP/chrony
Mismatched tensor shapes	Stack trace at AllReduce	Audit model code
Dead GPU (silent failure)	One rank stops progressing	Check DCGM Xid metrics

Always set NCCL_DEBUG=INFO and NCCL_ASYNC_ERROR_HANDLING=1 as the first diagnostic step. Run nccl-tests to validate baseline inter-node bandwidth before large training runs.

Pod Scheduling Failures

Symptom: Pods remain in Pending state. Events show Insufficient nvidia.com/gpu.

GPU nodes commonly carry a NoSchedule taint. If your training pod is missing the matching toleration, it will never be scheduled:

tolerations:
- key: nvidia.com/gpu
  operator: Exists
  effect: NoSchedule

Thermal Throttling

Symptom: Training throughput gradually decreases over a long run with no errors logged. When GPU temperature exceeds ~83C, the hardware automatically reduces SM clock frequency -- a GPU throttling from 1.95 GHz to 1.2 GHz loses approximately 38% of compute throughput.

Detection: Correlate DCGM_FI_DEV_GPU_TEMP with DCGM_FI_DEV_SM_CLOCK. A drop in SM clock correlated with rising temperature is the definitive signature.

Post-Study Assessment