Study Guide: Chapter 13 — Monitoring, Benchmarking, and Operations

Pre-Quiz: AI Infrastructure Benchmarking

1. What organization develops the MLPerf benchmark suite?

A) NVIDIA AI Labs

B) MLCommons

C) The Linux Foundation

D) IEEE Standards Association

2. What is the key metric for MLPerf Training?

A) Queries per second

B) Latency per query

C) Time-to-train

D) Storage throughput in GB/s

3. Which MLPerf Inference scenario emulates cloud-based deployments?

A) Single-stream

B) Multistream

C) Server

D) Offline

4. What does MLPerf Storage primarily measure?

A) GPU memory bandwidth

B) How fast storage can supply training data to accelerators

C) Network latency between storage nodes

D) Disk IOPS for random reads

5. In MLPerf submission categories, what does "Preview" mean?

A) Systems using experimental research hardware

B) Systems that must be submittable as Available in the next round

C) Systems only available through cloud rental

D) Systems that have been deprecated

Why Benchmarking Matters

AI workloads stress every infrastructure layer -- compute, networking, storage, and memory. A single poorly performing component can bottleneck an entire distributed training job. Benchmarking establishes quantitative baselines that answer: Can this platform sustain the throughput needed? Where does performance degrade under scale? How does our infrastructure compare to industry peers?

MLPerf Benchmark Suite

graph TD A["MLPerf Benchmark Suite
(MLCommons)"] --> B["MLPerf Training"] A --> C["MLPerf Inference"] A --> D["MLPerf Storage"] B --> B1["Key Metric:
Time-to-Train"] B --> B2["Workloads: LLMs,
Text-to-Image,
Recommenders, GNNs"] C --> C1["Key Metrics:
Throughput & Latency"] C --> C2["Scenarios: Single-Stream,
Multistream, Server, Offline"] D --> D1["Key Metric:
Storage Throughput (GB/s)"] D --> D2["Focus: Data Supply Rate
to Accelerators"] style A fill:#1a5276,color:#fff style B fill:#2e86c1,color:#fff style C fill:#2e86c1,color:#fff style D fill:#2e86c1,color:#fff

MLPerf Inference Scenarios

Scenario	Emulates	Primary Metric
Single-stream	Mobile device workloads	Latency per query
Multistream	Autonomous vehicle workloads	Latency across concurrent streams
Server	Cloud-based setups	Throughput and latency (p99)
Offline	Batch processing	Maximum throughput

Key Performance Metrics

Metric	Unit	Significance
Throughput	Queries/sec or tokens/sec	Inference requests or training samples processed per unit time
Latency	Milliseconds or seconds	Time from request to result; server scenario enforces p99
Time-to-train	Minutes or hours	Wall-clock time to reach target model accuracy
Accuracy	Model-specific (mAP, BLEU)	Output quality -- ensures systems don't trade quality for speed
Storage throughput	GB/s	Rate storage delivers training data to accelerators

Cisco UCS and MLPerf

Platform	GPU Configuration	Benchmark
Cisco UCS C845A M8	8x NVIDIA H200 NVL, 8x NVIDIA L40S PCIe	MLPerf Inference 5.1: Datacenter
Cisco UCS C885A M8 HGX	8x NVIDIA H100, 8x NVIDIA H200	MLPerf Inference and Training
Cisco UCS X210c M8	Intel Xeon 6 processors	MLPerf Inference v5.1 Datacenter
Cisco UCS C240 M8	Intel Xeon 6 processors	MLPerf Inference Datacenter

Interpreting Results and Identifying Bottlenecks

Compare within scenarios -- a system excelling in offline throughput may underperform in the server scenario where latency constraints apply
Check scaling efficiency -- if doubling GPUs doesn't nearly double throughput, investigate network bandwidth, PCIe saturation, or NVLink topology
Examine storage results separately -- high GPU throughput is meaningless if storage starves the pipeline
Consider the full stack -- MLPerf results reflect specific software configurations; ensure your production stack is comparable

Key Takeaway: MLPerf provides standardized, reproducible benchmarks across Training, Inference, and Storage that enable objective comparison of AI platforms. Cisco UCS systems demonstrate near-linear multi-server scaling when paired with high-performance network fabrics.

Post-Quiz: AI Infrastructure Benchmarking

1. Which MLPerf Inference scenario enforces p99 latency constraints and measures throughput?

A) Single-stream

B) Multistream

C) Server

D) Offline

2. If doubling GPUs does not nearly double throughput, which of these is NOT a likely bottleneck?

A) Network bandwidth

B) PCIe lane saturation

C) Model accuracy threshold

D) NVLink topology

3. What did Cisco demonstrate with the UCS C885A M8 in MLPerf submissions?

A) Lowest power consumption per GPU

B) Near-linear scaling for multi-server, multi-GPU inference

C) Highest single-GPU training speed

D) Best accuracy on text-to-image workloads

4. Why is MLPerf Storage important even when GPU throughput is high?

A) It measures GPU memory allocation efficiency

B) If storage can't feed GPUs fast enough, expensive accelerators sit idle

C) It replaces the need for network benchmarks

D) It only applies to inference workloads

5. An MLPerf "Available" submission means the system contains:

A) Experimental hardware not yet released

B) Only components available for purchase or cloud rental

C) Internal development hardware

D) Components that will be available next quarter

Section 2: Monitoring with Cisco Solutions

Pre-Quiz: Monitoring with Cisco Solutions

1. What is the primary role of Cisco Nexus Dashboard?

A) GPU driver management

B) Unified management console for monitoring, troubleshooting, and automating data center operations

C) Storage provisioning only

D) Cloud application deployment

2. What technology does Nexus Dashboard Insights use to establish performance baselines?

A) Static threshold rules only

B) AI/ML-powered dynamic baselining

C) Manual operator input

D) Vendor-provided default values

3. Which Cisco platform provides cloud-based infrastructure management and enriches Nexus Dashboard with defect/PSIRT data?

A) Cisco DNA Center

B) Cisco Intersight

C) Cisco Meraki

D) Cisco Umbrella

4. What AI-specific monitoring capability does Nexus Dashboard provide?

A) Model training code analysis

B) GPU utilization, memory, temperature, and distributed compute node monitoring

C) Automatic model hyperparameter tuning

D) Dataset quality scoring

5. What is a "threshold band" in Nexus Dashboard Insights?

A) A fixed limit set by the hardware manufacturer

B) The range around a dynamic baseline within which a KPI is considered normal

C) The maximum bandwidth of a network link

D) A frequency range for telemetry collection

Nexus Dashboard Overview

Cisco Nexus Dashboard serves as "mission control" for the data center -- a single pane of glass aggregating data from every fabric, switch, and compute node into actionable intelligence. It hosts Nexus Dashboard Insights, which automates troubleshooting and enables rapid root-cause analysis.

Core capabilities include topology-aware visualization, real-time KPI monitoring, proactive troubleshooting, and predictive analytics with forecasting and optimization recommendations.

AI-Specific Monitoring

Capability	Description
GPU performance monitoring	Deep visibility into GPU utilization, memory, temperature, and AI-specific performance demands
Distributed compute node monitoring	Real-time monitoring of network interfaces, NICs, GPUs, and compute nodes in training jobs
AI fabric support	Telemetry for routed and VXLAN-based AI fabrics, including rail-optimized and full-mesh topologies
Latency anomaly detection	Automatic detection of unusual delay spikes at flow granularity, correlated with burst events

Dynamic Baselining Process

stateDiagram-v2 [*] --> ObserveKPIs: Collect KPI data from fabric ObserveKPIs --> BuildBaseline: Analyze behavior patterns BuildBaseline --> MonitorAgainstBaseline: Baseline established MonitorAgainstBaseline --> MonitorAgainstBaseline: Metric within threshold band MonitorAgainstBaseline --> AnomalyDetected: Metric crosses threshold band AnomalyDetected --> GenerateAlert: Classify severity GenerateAlert --> CorrelateWithIntersight: Enrich with defect/PSIRT data CorrelateWithIntersight --> OperatorRemediation: Provide actionable guidance OperatorRemediation --> MonitorAgainstBaseline: Verify fix, resume monitoring MonitorAgainstBaseline --> UpdateBaseline: Network conditions change UpdateBaseline --> MonitorAgainstBaseline: Baseline recalculated

Rather than relying on static thresholds, Nexus Dashboard Insights creates network-specific baselines for each KPI based on observed behavior patterns, continuously updates them to reflect changing conditions, and generates anomaly alerts when network state crosses the threshold band. Administrators can also configure custom thresholds through global rules for fine-tuning alert sensitivity.

Cisco Intersight Integration

Intersight provides cloud-based infrastructure management extending visibility across compute, storage, and networking. Integration with Nexus Dashboard creates a closed-loop workflow:

Nexus Dashboard detects anomalous behavior
Intersight correlates it with known defects or security advisories
The operator receives actionable remediation guidance

Intersight enriches data with the known defect database, field notices, PSIRT alerts, and sustainability/power metrics.

Dashboard Design Best Practices

Top level: Aggregate health scores and SLA compliance across all AI clusters
Fabric level: Topology views, link states, anomaly counts, trend indicators
Resource level: Detailed drill-downs into GPU utilization, network latency per flow, storage I/O rates
Historical views: Baseline comparisons showing whether current performance is within normal operating ranges

Key Takeaway: Cisco Nexus Dashboard and Intersight together provide comprehensive AI infrastructure monitoring -- from dynamic baselines powered by AI/ML to proactive security and defect correlation. Dashboard design should follow a hierarchical drill-down model from cluster health to individual resource metrics.

Post-Quiz: Monitoring with Cisco Solutions

1. What triggers an anomaly alert in Nexus Dashboard Insights?

A) Any change in network configuration

B) When network state crosses the dynamic threshold band around a baseline

C) When a device reboots

D) Every time a new workload starts

2. What does Intersight contribute when integrated with Nexus Dashboard?

A) GPU driver updates

B) Known defect database, field notices, PSIRT alerts, and sustainability metrics

C) Model training acceleration

D) Network topology auto-configuration

3. Why are dynamic baselines preferred over static thresholds for AI fabric monitoring?

A) They require less compute resources

B) They adapt to changing network conditions, reducing false alarms and catching genuine anomalies

C) They are simpler to configure

D) They only monitor GPU metrics

4. Which fabric types does Nexus Dashboard support for AI workloads?

A) Only Cisco ACI fabrics

B) ACI, NX-OS, VXLAN EVPN, routed and VXLAN-based AI fabrics, and external fabrics

C) Only VXLAN-based fabrics

D) Only third-party fabrics via OpenConfig

5. In the hierarchical dashboard design, what belongs at the "Resource level"?

A) Aggregate SLA compliance scores

B) Topology views and anomaly counts

C) Detailed drill-downs into GPU utilization, network latency per flow, and storage I/O rates

D) Executive summary and cluster health

Section 3: Operational Telemetry and System Health

Pre-Quiz: Operational Telemetry and System Health

1. How does Model-Driven Telemetry (MDT) differ from SNMP polling?

A) MDT uses a pull model while SNMP uses push

B) MDT is a push model that streams data from devices; SNMP is a pull model that polls devices

C) There is no difference; they are the same protocol

D) MDT only works with Cisco devices while SNMP is vendor-neutral

2. What is gRPC in the context of streaming telemetry?

A) A data encoding format

B) A high-performance transport protocol used as the primary transport for telemetry

C) A Cisco-proprietary monitoring tool

D) A YANG model specification

3. What does YANG define?

A) The transport protocol for telemetry

B) The structure and semantics of telemetry data

C) The encryption algorithm for gRPC

D) The physical layer protocol for switch interconnects

4. At what GPU temperature does throttling typically trigger?

A) 65 degrees C

B) 75 degrees C

C) 85 degrees C

D) 95 degrees C

5. What does gNMI provide?

A) A proprietary Cisco management interface

B) A standardized, vendor-neutral interface for telemetry using gRPC and YANG

C) A replacement for syslog

D) A GPU monitoring library

SNMP Polling vs. Streaming Telemetry

Traditional SNMP polling is like checking your mailbox every hour. Streaming telemetry is like receiving push notifications -- data arrives when it matters, without waiting for the next polling cycle.

sequenceDiagram participant M as Management Station participant D as Network Device rect rgb(220, 230, 240) Note over M,D: SNMP Polling (Pull Model) M->>D: SNMP GET Request D-->>M: SNMP Response (data) Note over M: Wait for next poll interval... M->>D: SNMP GET Request D-->>M: SNMP Response (data) end rect rgb(210, 240, 210) Note over M,D: Streaming Telemetry (Push Model) D->>M: gRPC stream: metrics update (T=0s) D->>M: gRPC stream: metrics update (T=10s) D->>M: gRPC stream: metrics update (T=20s) D->>M: gRPC stream: event-based alert Note over M: Near-real-time, no polling delay end

MDT Data Collection and Transport

Mode	Behavior	Use Case
Frequency-based (periodic)	Data collected at regular intervals	Continuous resource utilization monitoring
Event-based	Data collected only on change	Interface state changes, threshold violations

Transport	Encoding	Notes
gRPC	GPB	Primary for high-performance telemetry; chunking for payloads > 12 MB
HTTP	JSON	Simpler setup; suitable for lower-volume streams
TCP dialout	GPB/JSON	Alternative when gRPC is not supported

YANG Models and gNMI

YANG ("Yet Another Next Generation") defines the structure and semantics of telemetry data. Cisco NX-OS supports two types:

Device YANG models -- Cisco-specific, map directly to the NX-OS DME object tree
OpenConfig YANG models -- vendor-neutral for multi-vendor interoperability

gNMI (gRPC Network Management Interface) provides a standardized interface using gRPC transport and YANG-modeled data, enabling vendor-neutral monitoring pipelines across Cisco, Arista, Juniper, and other platforms.

Typical Telemetry Pipeline

flowchart LR A["Nexus 9000
NX-OS MDT Sensors"] -->|gRPC / GPB| B["Telegraf or
gNMI Collector"] B --> C["InfluxDB /
Prometheus"] C --> D["Grafana
Dashboard"]

System Health Thresholds

Dimension	Key Metrics	Critical Thresholds
GPU	Utilization %, memory, temperature, ECC errors	Utilization < 80% during training = bottleneck elsewhere; temp > 85C = throttling
Network	Interface utilization, drops, CRC errors, latency	Any drops on AI fabric links; latency spikes > 2x baseline
CPU/Memory	System CPU, memory utilization, process counts	CPU > 90% sustained; memory > 85%
Storage	IOPS, throughput (GB/s), queue depth, latency	Latency > 10ms may starve GPU pipeline
Power/Thermal	Power draw (W), inlet temp, fan speed	Power approaching PSU capacity; inlet temp > rated max

Alert Severity Levels

Severity	Color	Typical Use
Critical	Red	Immediate action required; service-impacting event
Major	Orange	Significant degradation; investigation needed within minutes
Minor	Yellow	Deviation from baseline; schedule investigation
Warning	Blue	Informational; trend approaching threshold

Alert Configuration Best Practices

Start with dynamic baselines and tune custom thresholds based on observed false-positive rates
Set tighter thresholds for AI fabric links where small latency increases cascade into training slowdowns
Configure event-based telemetry for link state changes on GPU-to-switch connections
Use frequency-based telemetry at 10-30 second intervals for utilization metrics during training
Implement alert suppression during planned maintenance windows to avoid alarm fatigue

Key Takeaway: Model-Driven Telemetry on Cisco NX-OS provides near-real-time, push-based monitoring using gRPC/GPB transport and YANG data models. Combined with gNMI for vendor-neutral collection, this approach far exceeds traditional SNMP polling for AI infrastructure.

Post-Quiz: Operational Telemetry and System Health

1. What encoding format does gRPC use for high-performance telemetry on NX-OS?

A) XML

B) JSON

C) Google Protocol Buffers (GPB)

D) ASN.1/BER

2. What collection mode should be used for interface state changes on GPU-to-switch connections?

A) Frequency-based at 60-second intervals

B) Event-based telemetry

C) SNMP polling every 5 minutes

D) Manual log review

3. What does GPU utilization below 80% during an active training job typically indicate?

A) The model is too small for the GPU

B) A bottleneck exists elsewhere (network, storage, or data pipeline)

C) The GPU hardware is defective

D) Training is complete

4. What is the advantage of OpenConfig YANG models over Device YANG models?

A) They are faster to process

B) They are vendor-neutral, enabling multi-vendor interoperability

C) They provide more detailed Cisco-specific data

D) They require less bandwidth

5. At what storage latency threshold may the GPU pipeline begin to starve?

A) 1ms

B) 5ms

C) 10ms

D) 100ms

Section 4: Log Correlation and Performance Analysis

Pre-Quiz: Log Correlation and Performance Analysis

1. How many severity levels does syslog define?

A) 5 (0-4)

B) 8 (0-7)

C) 10 (0-9)

D) 3 (Low, Medium, High)

2. What is the key difference between SNMP traps and informs?

A) Traps use TCP while informs use UDP

B) Informs require manager acknowledgment and are retransmitted if unacknowledged

C) Traps are encrypted but informs are not

D) There is no difference

3. What does log correlation accomplish?

A) It encrypts log messages for security

B) It connects events across multiple devices to reconstruct the full picture of an incident

C) It deletes duplicate log entries

D) It converts logs to a standard format

4. Which SNMP version is required for production data center environments?

A) SNMPv1

B) SNMPv2c

C) SNMPv3

D) Any version is acceptable

5. What metric best indicates training efficiency for AI workloads?

A) Network packet count

B) Throughput (samples/sec)

C) Disk space remaining

D) Number of active processes

Syslog Severity Levels

Level	Name	Description	Example
0	Emergency	System unusable	Hardware failure
1	Alert	Immediate action needed	Power supply failure
2	Critical	Critical conditions	Memory allocation failure
3	Error	Error conditions	Interface down
4	Warning	Warning conditions	Temperature approaching limit
5	Notification	Normal but significant	Interface up/down
6	Informational	Informational messages	Configuration change
7	Debug	Debug-level messages	Packet trace output

Debug-level logging (severity 7) should only be enabled temporarily during troubleshooting as it can generate enormous data volumes and impact device performance.

SNMP: Traps vs. Informs

Type	Acknowledgment	Reliability	Use Case
Traps	None (fire-and-forget)	Lower -- no retry if lost	High-volume, non-critical notifications
Informs	Manager must acknowledge	Higher -- retransmitted if unacknowledged	Critical alerts requiring guaranteed delivery

Example SNMPv3 host configuration:

snmp-server host 192.0.2.1 informs version 3 auth NMS

Streaming Telemetry vs. SNMP Comparison

Characteristic	Streaming Telemetry (MDT)	SNMP
Data model	Push (device initiates)	Pull (manager polls)
Latency	Near-real-time (seconds)	Polling interval dependent (minutes)
Scalability	High -- no polling overhead	Degrades with device count
Data richness	Full YANG model paths	MIB-constrained
Encoding	GPB (efficient) or JSON	ASN.1/BER
Best for	High-frequency AI fabric monitoring	Device discovery, capacity planning, legacy systems

Log Correlation Workflow

flowchart TD A["Correlation Rule Configured
(root-cause + related messages)"] --> B["Correlator Captures
Matching Message"] B --> C["Start Timeout Timer"] C --> D["Continue Capturing
Matching Messages"] D --> E{"Timer Expired?"} E -- No --> D E -- Yes --> F{"Root-Cause Message
Received?"} F -- Yes --> G["Correlation Confirmed:
Group All Related Messages"] G --> H["Suppress Duplicates &
Highlight Root Cause"] H --> I["Deliver Correlated Alert
to Operator"] F -- No --> J["No Correlation:
Forward Messages Individually"] style G fill:#27ae60,color:#fff style J fill:#e74c3c,color:#fff

Cross-System Correlation Example

A distributed training job stalls. The correlation timeline:

Time	Device	Event	Severity
10:01	Nexus 9364C	Interface Eth1/49 CRC errors	Warning
10:01	Nexus 9364C	Interface Eth1/49 flap	Error
10:01	GPU Server 3	NCCL timeout on rank 12	Error
10:02	Nexus Dashboard	Latency anomaly: Spine-Leaf 3	Major
10:02	GPU Servers 1-8	Training job checkpoint fail	Critical
10:03	Intersight	Known defect CSCxx12345 match	Info

Root cause: a failing transceiver on Eth1/49 caused CRC errors, triggering a link flap, disrupting the NCCL collective operation, and stalling the entire training job. Intersight identified a matching known defect.

AI/ML Performance Optimization Workflow

flowchart LR A["1. Establish
Baselines"] --> B["2. Monitor
Continuously"] B --> C["3. Detect
Deviations"] C --> D["4. Correlate
Across Layers"] D --> E["5. Remediate
& Verify"] E --> B style A fill:#1a5276,color:#fff style B fill:#2e86c1,color:#fff style C fill:#e67e22,color:#fff style D fill:#8e44ad,color:#fff style E fill:#27ae60,color:#fff

AI/ML Workload Performance Metrics

Metric	What It Measures	Optimization Signal
GPU utilization	% of GPU compute cycles in use	Low utilization = data pipeline or network bottleneck
GPU memory utilization	% of GPU HBM in use	Near 100% = batch size at maximum; OOM = oversized model
Iteration time	Time per training step	Increasing = degradation in compute, network, or storage
Throughput (samples/sec)	Training samples per second	Primary measure of training efficiency
Network throughput per GPU	Bandwidth for collective ops	Should approach theoretical max during all-reduce
Data loading time	Time waiting for next batch	High = storage or data pipeline bottleneck

Key Takeaway: Effective AI infrastructure monitoring combines syslog for event logging, SNMP for structured queries, and streaming telemetry for real-time metrics. Log correlation across systems transforms isolated events into actionable incident narratives for rapid root-cause analysis.

Post-Quiz: Log Correlation and Performance Analysis

1. In the log correlation process, what happens if the root-cause message is NOT received before the timeout expires?

A) All messages are suppressed

B) The timer restarts

C) No correlation occurs; messages are forwarded individually

D) A critical alert is generated

2. What syslog severity level should be the minimum for production switch logging?

A) 3 (Error)

B) 5 (Notification)

C) 7 (Debug)

D) 0 (Emergency)

3. In the cross-system correlation example, what was the root cause of the training job failure?

A) GPU memory overflow

B) A failing transceiver causing CRC errors and link flap

C) Storage subsystem latency

D) NCCL software bug

4. Which SNMPv3 security feature ensures packets have not been tampered with in transit?

A) Encryption

B) Authentication

C) Message integrity

D) Access control lists

5. What does increasing iteration time during a training job signal?

A) The model is converging faster

B) Degradation in compute, network, or storage performance

C) The batch size was reduced

D) Training is nearly complete

Chapter 13: Monitoring, Benchmarking, and Operations

Learning Objectives

Section 1: AI Infrastructure Benchmarking

Key Points

Why Benchmarking Matters

MLPerf Benchmark Suite

MLPerf Inference Scenarios

Key Performance Metrics

Cisco UCS and MLPerf

Interpreting Results and Identifying Bottlenecks

Section 2: Monitoring with Cisco Solutions

Key Points

Nexus Dashboard Overview

AI-Specific Monitoring

Dynamic Baselining Process

Cisco Intersight Integration

Dashboard Design Best Practices

Section 3: Operational Telemetry and System Health

Key Points

SNMP Polling vs. Streaming Telemetry

MDT Data Collection and Transport

YANG Models and gNMI

Typical Telemetry Pipeline

System Health Thresholds

Alert Severity Levels

Alert Configuration Best Practices

Section 4: Log Correlation and Performance Analysis

Key Points

Syslog Severity Levels

SNMP: Traps vs. Informs

Streaming Telemetry vs. SNMP Comparison

Log Correlation Workflow

Cross-System Correlation Example

AI/ML Performance Optimization Workflow

AI/ML Workload Performance Metrics

Your Progress

Answer Explanations