Chapter 13: Monitoring, Benchmarking, and Operations

Learning Objectives

Section 1: AI Infrastructure Benchmarking

Pre-Quiz: AI Infrastructure Benchmarking

1. What organization develops the MLPerf benchmark suite?

A) NVIDIA AI Labs
B) MLCommons
C) The Linux Foundation
D) IEEE Standards Association

2. What is the key metric for MLPerf Training?

A) Queries per second
B) Latency per query
C) Time-to-train
D) Storage throughput in GB/s

3. Which MLPerf Inference scenario emulates cloud-based deployments?

A) Single-stream
B) Multistream
C) Server
D) Offline

4. What does MLPerf Storage primarily measure?

A) GPU memory bandwidth
B) How fast storage can supply training data to accelerators
C) Network latency between storage nodes
D) Disk IOPS for random reads

5. In MLPerf submission categories, what does "Preview" mean?

A) Systems using experimental research hardware
B) Systems that must be submittable as Available in the next round
C) Systems only available through cloud rental
D) Systems that have been deprecated

Key Points

Why Benchmarking Matters

AI workloads stress every infrastructure layer -- compute, networking, storage, and memory. A single poorly performing component can bottleneck an entire distributed training job. Benchmarking establishes quantitative baselines that answer: Can this platform sustain the throughput needed? Where does performance degrade under scale? How does our infrastructure compare to industry peers?

MLPerf Benchmark Suite

graph TD A["MLPerf Benchmark Suite
(MLCommons)"] --> B["MLPerf Training"] A --> C["MLPerf Inference"] A --> D["MLPerf Storage"] B --> B1["Key Metric:
Time-to-Train"] B --> B2["Workloads: LLMs,
Text-to-Image,
Recommenders, GNNs"] C --> C1["Key Metrics:
Throughput & Latency"] C --> C2["Scenarios: Single-Stream,
Multistream, Server, Offline"] D --> D1["Key Metric:
Storage Throughput (GB/s)"] D --> D2["Focus: Data Supply Rate
to Accelerators"] style A fill:#1a5276,color:#fff style B fill:#2e86c1,color:#fff style C fill:#2e86c1,color:#fff style D fill:#2e86c1,color:#fff

MLPerf Inference Scenarios

ScenarioEmulatesPrimary Metric
Single-streamMobile device workloadsLatency per query
MultistreamAutonomous vehicle workloadsLatency across concurrent streams
ServerCloud-based setupsThroughput and latency (p99)
OfflineBatch processingMaximum throughput

Key Performance Metrics

MetricUnitSignificance
ThroughputQueries/sec or tokens/secInference requests or training samples processed per unit time
LatencyMilliseconds or secondsTime from request to result; server scenario enforces p99
Time-to-trainMinutes or hoursWall-clock time to reach target model accuracy
AccuracyModel-specific (mAP, BLEU)Output quality -- ensures systems don't trade quality for speed
Storage throughputGB/sRate storage delivers training data to accelerators

Cisco UCS and MLPerf

PlatformGPU ConfigurationBenchmark
Cisco UCS C845A M88x NVIDIA H200 NVL, 8x NVIDIA L40S PCIeMLPerf Inference 5.1: Datacenter
Cisco UCS C885A M8 HGX8x NVIDIA H100, 8x NVIDIA H200MLPerf Inference and Training
Cisco UCS X210c M8Intel Xeon 6 processorsMLPerf Inference v5.1 Datacenter
Cisco UCS C240 M8Intel Xeon 6 processorsMLPerf Inference Datacenter

Interpreting Results and Identifying Bottlenecks

  1. Compare within scenarios -- a system excelling in offline throughput may underperform in the server scenario where latency constraints apply
  2. Check scaling efficiency -- if doubling GPUs doesn't nearly double throughput, investigate network bandwidth, PCIe saturation, or NVLink topology
  3. Examine storage results separately -- high GPU throughput is meaningless if storage starves the pipeline
  4. Consider the full stack -- MLPerf results reflect specific software configurations; ensure your production stack is comparable
Key Takeaway: MLPerf provides standardized, reproducible benchmarks across Training, Inference, and Storage that enable objective comparison of AI platforms. Cisco UCS systems demonstrate near-linear multi-server scaling when paired with high-performance network fabrics.
Animation Slot: Interactive MLPerf scenario selector -- choose a scenario and see the metric, deployment pattern, and example workload
Post-Quiz: AI Infrastructure Benchmarking

1. Which MLPerf Inference scenario enforces p99 latency constraints and measures throughput?

A) Single-stream
B) Multistream
C) Server
D) Offline

2. If doubling GPUs does not nearly double throughput, which of these is NOT a likely bottleneck?

A) Network bandwidth
B) PCIe lane saturation
C) Model accuracy threshold
D) NVLink topology

3. What did Cisco demonstrate with the UCS C885A M8 in MLPerf submissions?

A) Lowest power consumption per GPU
B) Near-linear scaling for multi-server, multi-GPU inference
C) Highest single-GPU training speed
D) Best accuracy on text-to-image workloads

4. Why is MLPerf Storage important even when GPU throughput is high?

A) It measures GPU memory allocation efficiency
B) If storage can't feed GPUs fast enough, expensive accelerators sit idle
C) It replaces the need for network benchmarks
D) It only applies to inference workloads

5. An MLPerf "Available" submission means the system contains:

A) Experimental hardware not yet released
B) Only components available for purchase or cloud rental
C) Internal development hardware
D) Components that will be available next quarter

Section 2: Monitoring with Cisco Solutions

Pre-Quiz: Monitoring with Cisco Solutions

1. What is the primary role of Cisco Nexus Dashboard?

A) GPU driver management
B) Unified management console for monitoring, troubleshooting, and automating data center operations
C) Storage provisioning only
D) Cloud application deployment

2. What technology does Nexus Dashboard Insights use to establish performance baselines?

A) Static threshold rules only
B) AI/ML-powered dynamic baselining
C) Manual operator input
D) Vendor-provided default values

3. Which Cisco platform provides cloud-based infrastructure management and enriches Nexus Dashboard with defect/PSIRT data?

A) Cisco DNA Center
B) Cisco Intersight
C) Cisco Meraki
D) Cisco Umbrella

4. What AI-specific monitoring capability does Nexus Dashboard provide?

A) Model training code analysis
B) GPU utilization, memory, temperature, and distributed compute node monitoring
C) Automatic model hyperparameter tuning
D) Dataset quality scoring

5. What is a "threshold band" in Nexus Dashboard Insights?

A) A fixed limit set by the hardware manufacturer
B) The range around a dynamic baseline within which a KPI is considered normal
C) The maximum bandwidth of a network link
D) A frequency range for telemetry collection

Key Points

Nexus Dashboard Overview

Cisco Nexus Dashboard serves as "mission control" for the data center -- a single pane of glass aggregating data from every fabric, switch, and compute node into actionable intelligence. It hosts Nexus Dashboard Insights, which automates troubleshooting and enables rapid root-cause analysis.

Core capabilities include topology-aware visualization, real-time KPI monitoring, proactive troubleshooting, and predictive analytics with forecasting and optimization recommendations.

AI-Specific Monitoring

CapabilityDescription
GPU performance monitoringDeep visibility into GPU utilization, memory, temperature, and AI-specific performance demands
Distributed compute node monitoringReal-time monitoring of network interfaces, NICs, GPUs, and compute nodes in training jobs
AI fabric supportTelemetry for routed and VXLAN-based AI fabrics, including rail-optimized and full-mesh topologies
Latency anomaly detectionAutomatic detection of unusual delay spikes at flow granularity, correlated with burst events

Dynamic Baselining Process

stateDiagram-v2 [*] --> ObserveKPIs: Collect KPI data from fabric ObserveKPIs --> BuildBaseline: Analyze behavior patterns BuildBaseline --> MonitorAgainstBaseline: Baseline established MonitorAgainstBaseline --> MonitorAgainstBaseline: Metric within threshold band MonitorAgainstBaseline --> AnomalyDetected: Metric crosses threshold band AnomalyDetected --> GenerateAlert: Classify severity GenerateAlert --> CorrelateWithIntersight: Enrich with defect/PSIRT data CorrelateWithIntersight --> OperatorRemediation: Provide actionable guidance OperatorRemediation --> MonitorAgainstBaseline: Verify fix, resume monitoring MonitorAgainstBaseline --> UpdateBaseline: Network conditions change UpdateBaseline --> MonitorAgainstBaseline: Baseline recalculated

Rather than relying on static thresholds, Nexus Dashboard Insights creates network-specific baselines for each KPI based on observed behavior patterns, continuously updates them to reflect changing conditions, and generates anomaly alerts when network state crosses the threshold band. Administrators can also configure custom thresholds through global rules for fine-tuning alert sensitivity.

Cisco Intersight Integration

Intersight provides cloud-based infrastructure management extending visibility across compute, storage, and networking. Integration with Nexus Dashboard creates a closed-loop workflow:

  1. Nexus Dashboard detects anomalous behavior
  2. Intersight correlates it with known defects or security advisories
  3. The operator receives actionable remediation guidance

Intersight enriches data with the known defect database, field notices, PSIRT alerts, and sustainability/power metrics.

Dashboard Design Best Practices

Key Takeaway: Cisco Nexus Dashboard and Intersight together provide comprehensive AI infrastructure monitoring -- from dynamic baselines powered by AI/ML to proactive security and defect correlation. Dashboard design should follow a hierarchical drill-down model from cluster health to individual resource metrics.
Animation Slot: Interactive dashboard hierarchy -- click through Executive Summary, Fabric Overview, Compute/GPU View, and Network View layers
Post-Quiz: Monitoring with Cisco Solutions

1. What triggers an anomaly alert in Nexus Dashboard Insights?

A) Any change in network configuration
B) When network state crosses the dynamic threshold band around a baseline
C) When a device reboots
D) Every time a new workload starts

2. What does Intersight contribute when integrated with Nexus Dashboard?

A) GPU driver updates
B) Known defect database, field notices, PSIRT alerts, and sustainability metrics
C) Model training acceleration
D) Network topology auto-configuration

3. Why are dynamic baselines preferred over static thresholds for AI fabric monitoring?

A) They require less compute resources
B) They adapt to changing network conditions, reducing false alarms and catching genuine anomalies
C) They are simpler to configure
D) They only monitor GPU metrics

4. Which fabric types does Nexus Dashboard support for AI workloads?

A) Only Cisco ACI fabrics
B) ACI, NX-OS, VXLAN EVPN, routed and VXLAN-based AI fabrics, and external fabrics
C) Only VXLAN-based fabrics
D) Only third-party fabrics via OpenConfig

5. In the hierarchical dashboard design, what belongs at the "Resource level"?

A) Aggregate SLA compliance scores
B) Topology views and anomaly counts
C) Detailed drill-downs into GPU utilization, network latency per flow, and storage I/O rates
D) Executive summary and cluster health

Section 3: Operational Telemetry and System Health

Pre-Quiz: Operational Telemetry and System Health

1. How does Model-Driven Telemetry (MDT) differ from SNMP polling?

A) MDT uses a pull model while SNMP uses push
B) MDT is a push model that streams data from devices; SNMP is a pull model that polls devices
C) There is no difference; they are the same protocol
D) MDT only works with Cisco devices while SNMP is vendor-neutral

2. What is gRPC in the context of streaming telemetry?

A) A data encoding format
B) A high-performance transport protocol used as the primary transport for telemetry
C) A Cisco-proprietary monitoring tool
D) A YANG model specification

3. What does YANG define?

A) The transport protocol for telemetry
B) The structure and semantics of telemetry data
C) The encryption algorithm for gRPC
D) The physical layer protocol for switch interconnects

4. At what GPU temperature does throttling typically trigger?

A) 65 degrees C
B) 75 degrees C
C) 85 degrees C
D) 95 degrees C

5. What does gNMI provide?

A) A proprietary Cisco management interface
B) A standardized, vendor-neutral interface for telemetry using gRPC and YANG
C) A replacement for syslog
D) A GPU monitoring library

Key Points

SNMP Polling vs. Streaming Telemetry

Traditional SNMP polling is like checking your mailbox every hour. Streaming telemetry is like receiving push notifications -- data arrives when it matters, without waiting for the next polling cycle.

sequenceDiagram participant M as Management Station participant D as Network Device rect rgb(220, 230, 240) Note over M,D: SNMP Polling (Pull Model) M->>D: SNMP GET Request D-->>M: SNMP Response (data) Note over M: Wait for next poll interval... M->>D: SNMP GET Request D-->>M: SNMP Response (data) end rect rgb(210, 240, 210) Note over M,D: Streaming Telemetry (Push Model) D->>M: gRPC stream: metrics update (T=0s) D->>M: gRPC stream: metrics update (T=10s) D->>M: gRPC stream: metrics update (T=20s) D->>M: gRPC stream: event-based alert Note over M: Near-real-time, no polling delay end

MDT Data Collection and Transport

ModeBehaviorUse Case
Frequency-based (periodic)Data collected at regular intervalsContinuous resource utilization monitoring
Event-basedData collected only on changeInterface state changes, threshold violations
TransportEncodingNotes
gRPCGPBPrimary for high-performance telemetry; chunking for payloads > 12 MB
HTTPJSONSimpler setup; suitable for lower-volume streams
TCP dialoutGPB/JSONAlternative when gRPC is not supported

YANG Models and gNMI

YANG ("Yet Another Next Generation") defines the structure and semantics of telemetry data. Cisco NX-OS supports two types:

gNMI (gRPC Network Management Interface) provides a standardized interface using gRPC transport and YANG-modeled data, enabling vendor-neutral monitoring pipelines across Cisco, Arista, Juniper, and other platforms.

Typical Telemetry Pipeline

flowchart LR A["Nexus 9000
NX-OS MDT Sensors"] -->|gRPC / GPB| B["Telegraf or
gNMI Collector"] B --> C["InfluxDB /
Prometheus"] C --> D["Grafana
Dashboard"]

System Health Thresholds

DimensionKey MetricsCritical Thresholds
GPUUtilization %, memory, temperature, ECC errorsUtilization < 80% during training = bottleneck elsewhere; temp > 85C = throttling
NetworkInterface utilization, drops, CRC errors, latencyAny drops on AI fabric links; latency spikes > 2x baseline
CPU/MemorySystem CPU, memory utilization, process countsCPU > 90% sustained; memory > 85%
StorageIOPS, throughput (GB/s), queue depth, latencyLatency > 10ms may starve GPU pipeline
Power/ThermalPower draw (W), inlet temp, fan speedPower approaching PSU capacity; inlet temp > rated max

Alert Severity Levels

SeverityColorTypical Use
CriticalRedImmediate action required; service-impacting event
MajorOrangeSignificant degradation; investigation needed within minutes
MinorYellowDeviation from baseline; schedule investigation
WarningBlueInformational; trend approaching threshold

Alert Configuration Best Practices

  1. Start with dynamic baselines and tune custom thresholds based on observed false-positive rates
  2. Set tighter thresholds for AI fabric links where small latency increases cascade into training slowdowns
  3. Configure event-based telemetry for link state changes on GPU-to-switch connections
  4. Use frequency-based telemetry at 10-30 second intervals for utilization metrics during training
  5. Implement alert suppression during planned maintenance windows to avoid alarm fatigue
Key Takeaway: Model-Driven Telemetry on Cisco NX-OS provides near-real-time, push-based monitoring using gRPC/GPB transport and YANG data models. Combined with gNMI for vendor-neutral collection, this approach far exceeds traditional SNMP polling for AI infrastructure.
Animation Slot: Animated comparison of SNMP polling intervals vs. streaming telemetry continuous data flow, showing latency difference in anomaly detection
Post-Quiz: Operational Telemetry and System Health

1. What encoding format does gRPC use for high-performance telemetry on NX-OS?

A) XML
B) JSON
C) Google Protocol Buffers (GPB)
D) ASN.1/BER

2. What collection mode should be used for interface state changes on GPU-to-switch connections?

A) Frequency-based at 60-second intervals
B) Event-based telemetry
C) SNMP polling every 5 minutes
D) Manual log review

3. What does GPU utilization below 80% during an active training job typically indicate?

A) The model is too small for the GPU
B) A bottleneck exists elsewhere (network, storage, or data pipeline)
C) The GPU hardware is defective
D) Training is complete

4. What is the advantage of OpenConfig YANG models over Device YANG models?

A) They are faster to process
B) They are vendor-neutral, enabling multi-vendor interoperability
C) They provide more detailed Cisco-specific data
D) They require less bandwidth

5. At what storage latency threshold may the GPU pipeline begin to starve?

A) 1ms
B) 5ms
C) 10ms
D) 100ms

Section 4: Log Correlation and Performance Analysis

Pre-Quiz: Log Correlation and Performance Analysis

1. How many severity levels does syslog define?

A) 5 (0-4)
B) 8 (0-7)
C) 10 (0-9)
D) 3 (Low, Medium, High)

2. What is the key difference between SNMP traps and informs?

A) Traps use TCP while informs use UDP
B) Informs require manager acknowledgment and are retransmitted if unacknowledged
C) Traps are encrypted but informs are not
D) There is no difference

3. What does log correlation accomplish?

A) It encrypts log messages for security
B) It connects events across multiple devices to reconstruct the full picture of an incident
C) It deletes duplicate log entries
D) It converts logs to a standard format

4. Which SNMP version is required for production data center environments?

A) SNMPv1
B) SNMPv2c
C) SNMPv3
D) Any version is acceptable

5. What metric best indicates training efficiency for AI workloads?

A) Network packet count
B) Throughput (samples/sec)
C) Disk space remaining
D) Number of active processes

Key Points

Syslog Severity Levels

LevelNameDescriptionExample
0EmergencySystem unusableHardware failure
1AlertImmediate action neededPower supply failure
2CriticalCritical conditionsMemory allocation failure
3ErrorError conditionsInterface down
4WarningWarning conditionsTemperature approaching limit
5NotificationNormal but significantInterface up/down
6InformationalInformational messagesConfiguration change
7DebugDebug-level messagesPacket trace output

Debug-level logging (severity 7) should only be enabled temporarily during troubleshooting as it can generate enormous data volumes and impact device performance.

SNMP: Traps vs. Informs

TypeAcknowledgmentReliabilityUse Case
TrapsNone (fire-and-forget)Lower -- no retry if lostHigh-volume, non-critical notifications
InformsManager must acknowledgeHigher -- retransmitted if unacknowledgedCritical alerts requiring guaranteed delivery

Example SNMPv3 host configuration:

snmp-server host 192.0.2.1 informs version 3 auth NMS

Streaming Telemetry vs. SNMP Comparison

CharacteristicStreaming Telemetry (MDT)SNMP
Data modelPush (device initiates)Pull (manager polls)
LatencyNear-real-time (seconds)Polling interval dependent (minutes)
ScalabilityHigh -- no polling overheadDegrades with device count
Data richnessFull YANG model pathsMIB-constrained
EncodingGPB (efficient) or JSONASN.1/BER
Best forHigh-frequency AI fabric monitoringDevice discovery, capacity planning, legacy systems

Log Correlation Workflow

flowchart TD A["Correlation Rule Configured
(root-cause + related messages)"] --> B["Correlator Captures
Matching Message"] B --> C["Start Timeout Timer"] C --> D["Continue Capturing
Matching Messages"] D --> E{"Timer Expired?"} E -- No --> D E -- Yes --> F{"Root-Cause Message
Received?"} F -- Yes --> G["Correlation Confirmed:
Group All Related Messages"] G --> H["Suppress Duplicates &
Highlight Root Cause"] H --> I["Deliver Correlated Alert
to Operator"] F -- No --> J["No Correlation:
Forward Messages Individually"] style G fill:#27ae60,color:#fff style J fill:#e74c3c,color:#fff

Cross-System Correlation Example

A distributed training job stalls. The correlation timeline:

TimeDeviceEventSeverity
10:01Nexus 9364CInterface Eth1/49 CRC errorsWarning
10:01Nexus 9364CInterface Eth1/49 flapError
10:01GPU Server 3NCCL timeout on rank 12Error
10:02Nexus DashboardLatency anomaly: Spine-Leaf 3Major
10:02GPU Servers 1-8Training job checkpoint failCritical
10:03IntersightKnown defect CSCxx12345 matchInfo

Root cause: a failing transceiver on Eth1/49 caused CRC errors, triggering a link flap, disrupting the NCCL collective operation, and stalling the entire training job. Intersight identified a matching known defect.

AI/ML Performance Optimization Workflow

flowchart LR A["1. Establish
Baselines"] --> B["2. Monitor
Continuously"] B --> C["3. Detect
Deviations"] C --> D["4. Correlate
Across Layers"] D --> E["5. Remediate
& Verify"] E --> B style A fill:#1a5276,color:#fff style B fill:#2e86c1,color:#fff style C fill:#e67e22,color:#fff style D fill:#8e44ad,color:#fff style E fill:#27ae60,color:#fff

AI/ML Workload Performance Metrics

MetricWhat It MeasuresOptimization Signal
GPU utilization% of GPU compute cycles in useLow utilization = data pipeline or network bottleneck
GPU memory utilization% of GPU HBM in useNear 100% = batch size at maximum; OOM = oversized model
Iteration timeTime per training stepIncreasing = degradation in compute, network, or storage
Throughput (samples/sec)Training samples per secondPrimary measure of training efficiency
Network throughput per GPUBandwidth for collective opsShould approach theoretical max during all-reduce
Data loading timeTime waiting for next batchHigh = storage or data pipeline bottleneck
Key Takeaway: Effective AI infrastructure monitoring combines syslog for event logging, SNMP for structured queries, and streaming telemetry for real-time metrics. Log correlation across systems transforms isolated events into actionable incident narratives for rapid root-cause analysis.
Animation Slot: Interactive timeline showing cross-system log correlation -- click events to see how CRC errors cascade through the stack to training job failure
Post-Quiz: Log Correlation and Performance Analysis

1. In the log correlation process, what happens if the root-cause message is NOT received before the timeout expires?

A) All messages are suppressed
B) The timer restarts
C) No correlation occurs; messages are forwarded individually
D) A critical alert is generated

2. What syslog severity level should be the minimum for production switch logging?

A) 3 (Error)
B) 5 (Notification)
C) 7 (Debug)
D) 0 (Emergency)

3. In the cross-system correlation example, what was the root cause of the training job failure?

A) GPU memory overflow
B) A failing transceiver causing CRC errors and link flap
C) Storage subsystem latency
D) NCCL software bug

4. Which SNMPv3 security feature ensures packets have not been tampered with in transit?

A) Encryption
B) Authentication
C) Message integrity
D) Access control lists

5. What does increasing iteration time during a training job signal?

A) The model is converging faster
B) Degradation in compute, network, or storage performance
C) The batch size was reduced
D) Training is nearly complete

Your Progress

Answer Explanations