1. What is the primary purpose of deploying DCGM Exporter as a DaemonSet?
To centrally query GPU metrics from one pod
To ensure one metrics-exporting pod runs on every GPU node
To replace the NVIDIA device plugin
To automatically restart failed GPU workloads
2. Why can't infrastructure metrics alone tell you if a model is converging?
Infrastructure metrics are sampled too infrequently
A GPU can run at high utilization while training loss has plateaued
Infrastructure metrics only work for inference workloads
Prometheus cannot scrape GPU metrics
3. Which Xid error code indicates an uncorrectable GPU memory fault requiring immediate node cordoning?
Xid 13
Xid 31
Xid 48
Xid 79
4. What is the most effective way to detect a training stall via Prometheus?
Alert when GPU utilization drops to 0%
Alert when the training_last_step_timestamp is older than a threshold
Alert when the pod restarts
Alert when Grafana dashboard panels show no data
5. What environment variable should you set first when debugging NCCL communication failures?
CUDA_VISIBLE_DEVICES
NCCL_DEBUG=INFO
NVIDIA_DRIVER_CAPABILITIES=all
PYTORCH_CUDA_ALLOC_CONF
6. For inference observability, why are p99 latency metrics more important than mean latency?
Mean latency is not supported by Prometheus
p99 defines the worst experience for almost all users and is the SLO boundary
p99 is always lower than the mean, so it is more useful
Mean latency includes network overhead while p99 does not
7. What is the correct Prometheus metric type for tracking inference latency distributions?
Counter
Gauge
Histogram
Summary
8. A pod is running but training throughput is ~100x lower than expected. What is the most likely cause?
The learning rate is too low
The pod spec is missing nvidia.com/gpu resource requests
The Prometheus scrape interval is too long
The model has too many parameters
9. What does torch.distributed.monitored_barrier() help you diagnose?
GPU memory leaks during training
Which rank failed to arrive at a synchronization point
Whether the model has converged
The optimal batch size for distributed training
10. How do you detect GPU thermal throttling using DCGM metrics?
Alert when GPU power draw exceeds rated TDP
Correlate SM clock frequency drops with rising GPU temperature
Monitor the DCGM_FI_DEV_ECC_SBE_VOL_TOTAL metric
Check the pod restart count in Kubernetes
11. What Kubernetes custom resource connects DCGM Exporter to Prometheus?
PrometheusRule
ServiceMonitor
PodMonitor
ScrapeConfig
12. A distributed training job hangs for 1800 seconds and then fails. What is the default NCCL timeout behavior?
NCCL retries the collective operation indefinitely
NCCL waits for the default timeout period, then raises an error
NCCL kills the straggler rank and continues
NCCL logs a warning but does not fail
The Three-Layer Observability Model
Effective AI workload observability requires monitoring three interlocking layers simultaneously: infrastructure (GPU hardware), platform (Kubernetes scheduling), and application (model metrics). No single layer tells the full story.
| Layer | What It Measures | Key Tools |
| Infrastructure | GPU utilization, memory, temperature, power, ECC errors | DCGM Exporter, node_exporter |
| Platform | Pod health, scheduling events, resource quotas, job status | kube-state-metrics, Kubernetes Events |
| Application | Training loss, learning rate, inference latency, throughput | Custom Prometheus metrics, MLflow, W&B |
NVIDIA DCGM Exporter Deployment
DCGM (Data Center GPU Manager) is NVIDIA's suite for managing and monitoring GPUs at scale. The DCGM Exporter component collects GPU telemetry and exposes it at an HTTP /metrics endpoint that Prometheus can scrape.
DCGM Exporter is deployed as a Kubernetes DaemonSet, meaning one pod runs on every GPU-equipped node. GPU metrics are node-local and cannot be collected from a central location without per-node agents.
Installation uses the official Helm chart:
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm install --generate-name gpu-helm-charts/dcgm-exporter
Key GPU Metrics
| Metric Name | Description | Alert Threshold |
DCGM_FI_DEV_GPU_UTIL | GPU compute utilization (%) | Alert if < 20% for sustained training |
DCGM_FI_DEV_FB_USED | Framebuffer memory used (MiB) | Alert if > 95% of total |
DCGM_FI_DEV_GPU_TEMP | GPU core temperature | Alert if > 83 C |
DCGM_FI_DEV_SM_CLOCK | Streaming Multiprocessor clock (MHz) | Alert on sustained drop during training |
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL | Volatile double-bit ECC errors | Alert immediately on any occurrence |
DCGM_FI_DEV_XID_ERRORS | Xid error count (hardware/driver faults) | Alert on any increment |
Double-bit ECC errors are particularly critical: they indicate uncorrectable GPU memory faults and should immediately trigger node cordoning.
Prometheus ServiceMonitor Configuration
A ServiceMonitor custom resource tells Prometheus how to discover and scrape the DCGM Exporter service. The release label must match the Prometheus Operator's serviceMonitorSelector -- this is the most common misconfiguration.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
namespaceSelector:
matchNames:
- gpu-operator-resources
selector:
matchLabels:
app: dcgm-exporter
endpoints:
- port: metrics
interval: 15s
path: /metrics
Custom Metrics for Training Workloads
Infrastructure metrics tell you whether the GPU is busy; they do not tell you whether the model is converging. A GPU can run at 95% utilization while training loss has plateaued. Application-layer metrics close this gap.
| Metric | What It Reveals | Alert Condition |
| Training loss (per step) | Model convergence direction | Loss not decreasing over N steps = stall |
| Validation loss (per epoch) | Overfitting detection | Val loss diverging from train loss |
| Samples per second | GPU throughput efficiency | > 20% drop vs. baseline |
| Gradient norm | Exploding/vanishing gradient health | Norm > 100 or < 1e-7 |
| Checkpoint write success | Recovery point availability | Alert on write failure |
Expose these from your training code using the Prometheus Python client:
from prometheus_client import Gauge, start_http_server
start_http_server(8080)
training_loss = Gauge('training_loss', 'Current training loss',
['job_name', 'rank'])
for step, batch in enumerate(dataloader):
# ... forward pass, backward pass ...
training_loss.labels(job_name=job_name, rank=rank).set(loss.item())
Inference Observability: Latency Distributions
Inference favors latency over throughput. Tail latency (p99) matters most because it defines the worst experience a user receives. Use Prometheus Histograms for latency -- they allow computing arbitrary percentiles at query time:
from prometheus_client import Histogram
inference_latency = Histogram(
'inference_request_duration_seconds',
'End-to-end inference latency',
['model_name', 'model_version'],
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
# Query p99 in PromQL:
# histogram_quantile(0.99,
# sum(rate(inference_request_duration_seconds_bucket[5m])) by (le, model_name)
# )
| Metric | Description | Why It Matters |
| p50 latency | Median request latency | Baseline serving speed |
| p95 latency | 95th percentile | Typical worst-case experience |
| p99 latency | 99th percentile | SLO boundary for most services |
| Queue depth | Pending requests | Scaling trigger |
| TTFT | Time to First Token | Critical for streaming LLM UX |
Distributed Training Communication Monitoring
In multi-node distributed training, collective communication operations (AllReduce, AllGather) are synchronization points where all GPU ranks must meet. A single slow or dead GPU causes every other rank to stall. Use torch.distributed.monitored_barrier() to identify which rank failed to arrive:
import torch.distributed as dist
import datetime
dist.monitored_barrier(
timeout=datetime.timedelta(seconds=60),
wait_all_ranks=True
)
# Raises RuntimeError identifying which rank failed
Designing Alerts That Matter
Alerts should page someone who can act on them immediately. Every alert should answer three questions: what is broken, where is it broken, and what is the immediate remediation step.
GPU Failure and Xid Error Alerts
Xid errors are the GPU driver's hardware fault codes. Common codes for AI workloads:
| Xid Code | Meaning | Action |
| Xid 13 | Graphics Exception | Investigate workload, restart if persistent |
| Xid 31 | GPU memory page fault | Check out-of-bounds CUDA memory access |
| Xid 48 | Double-bit ECC error | Cordon node immediately, replace GPU |
| Xid 63 | Row remapping failure | Schedule GPU replacement |
| Xid 79 | GPU has fallen off the bus | Node reboot required, hardware failure likely |
Training Stall Detection
A training stall is one of the most expensive failure modes: the cluster continues consuming expensive GPU-hours while producing no useful model updates. Detect stalls by alerting on the absence of training progress:
# Prometheus alert rule
- alert: TrainingJobStalled
expr: (time() - training_last_step_timestamp_seconds) > 600
for: 5m
labels:
severity: critical
annotations:
summary: "Training job {{ $labels.job_name }} stalled"
SLO Definitions for Inference Services
| SLO | Objective | Alert if |
| p99 latency | < 2 seconds | > 2s for 5m |
| p95 latency | < 500ms | > 500ms for 5m |
| Availability | > 99.9% | < 99.9% over 1h |
| TTFT | < 300ms | > 300ms for 5m |
| Error rate | < 0.1% | > 0.1% for 5m |
KEDA can use these same Prometheus metrics as scaling signals, automatically adding inference replicas when the p99 SLO is at risk.
GPU Not Detected or Driver Mismatch
Symptom: Pod runs but training throughput is ~100x lower than expected (CPU-only execution). This is a particularly dangerous silent failure that has burned weeks of researcher time.
Root Causes:
- Missing
nvidia.com/gpu resource request in the pod spec
- NVIDIA Container Runtime not configured as default runtime on the node
- NVIDIA device plugin DaemonSet not running on the target node
- Node driver version mismatch with the container's CUDA version
# Diagnostic sequence:
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].resources}'
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds -o wide
kubectl get node <node> -o jsonpath='{.status.capacity}'
kubectl exec -it <pod> -- nvidia-smi
NCCL Communication Failures
Symptom: Distributed training hangs indefinitely, then fails after 1800 seconds (default NCCL timeout).
| Cause | Diagnostic Signal | Fix |
| Wrong network interface | NCCL logs show wrong IP | Set NCCL_SOCKET_IFNAME=eth0 |
| Clock skew > 1ms | NTP logs show drift | Synchronize NTP/chrony |
| Mismatched tensor shapes | Stack trace at AllReduce | Audit model code |
| Dead GPU (silent failure) | One rank stops progressing | Check DCGM Xid metrics |
Always set NCCL_DEBUG=INFO and NCCL_ASYNC_ERROR_HANDLING=1 as the first diagnostic step. Run nccl-tests to validate baseline inter-node bandwidth before large training runs.
Pod Scheduling Failures
Symptom: Pods remain in Pending state. Events show Insufficient nvidia.com/gpu.
GPU nodes commonly carry a NoSchedule taint. If your training pod is missing the matching toleration, it will never be scheduled:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Thermal Throttling
Symptom: Training throughput gradually decreases over a long run with no errors logged. When GPU temperature exceeds ~83C, the hardware automatically reduces SM clock frequency -- a GPU throttling from 1.95 GHz to 1.2 GHz loses approximately 38% of compute throughput.
Detection: Correlate DCGM_FI_DEV_GPU_TEMP with DCGM_FI_DEV_SM_CLOCK. A drop in SM clock correlated with rising temperature is the definitive signature.
1. What is the primary purpose of deploying DCGM Exporter as a DaemonSet?
To centrally query GPU metrics from one pod
To ensure one metrics-exporting pod runs on every GPU node
To replace the NVIDIA device plugin
To automatically restart failed GPU workloads
2. Why can't infrastructure metrics alone tell you if a model is converging?
Infrastructure metrics are sampled too infrequently
A GPU can run at high utilization while training loss has plateaued
Infrastructure metrics only work for inference workloads
Prometheus cannot scrape GPU metrics
3. Which Xid error code indicates an uncorrectable GPU memory fault requiring immediate node cordoning?
Xid 13
Xid 31
Xid 48
Xid 79
4. What is the most effective way to detect a training stall via Prometheus?
Alert when GPU utilization drops to 0%
Alert when the training_last_step_timestamp is older than a threshold
Alert when the pod restarts
Alert when Grafana dashboard panels show no data
5. What environment variable should you set first when debugging NCCL communication failures?
CUDA_VISIBLE_DEVICES
NCCL_DEBUG=INFO
NVIDIA_DRIVER_CAPABILITIES=all
PYTORCH_CUDA_ALLOC_CONF
6. For inference observability, why are p99 latency metrics more important than mean latency?
Mean latency is not supported by Prometheus
p99 defines the worst experience for almost all users and is the SLO boundary
p99 is always lower than the mean, so it is more useful
Mean latency includes network overhead while p99 does not
7. What is the correct Prometheus metric type for tracking inference latency distributions?
Counter
Gauge
Histogram
Summary
8. A pod is running but training throughput is ~100x lower than expected. What is the most likely cause?
The learning rate is too low
The pod spec is missing nvidia.com/gpu resource requests
The Prometheus scrape interval is too long
The model has too many parameters
9. What does torch.distributed.monitored_barrier() help you diagnose?
GPU memory leaks during training
Which rank failed to arrive at a synchronization point
Whether the model has converged
The optimal batch size for distributed training
10. How do you detect GPU thermal throttling using DCGM metrics?
Alert when GPU power draw exceeds rated TDP
Correlate SM clock frequency drops with rising GPU temperature
Monitor the DCGM_FI_DEV_ECC_SBE_VOL_TOTAL metric
Check the pod restart count in Kubernetes
11. What Kubernetes custom resource connects DCGM Exporter to Prometheus?
PrometheusRule
ServiceMonitor
PodMonitor
ScrapeConfig
12. A distributed training job hangs for 1800 seconds and then fails. What is the default NCCL timeout behavior?
NCCL retries the collective operation indefinitely
NCCL waits for the default timeout period, then raises an error
NCCL kills the straggler rank and continues
NCCL logs a warning but does not fail