1. What happens when a packet is dropped in an RoCEv2 flow?
A) A simple TCP-style retransmission occursB) The packet is silently ignored by the receiverC) A go-back-N recovery stalls the entire collective operation across all participating GPUsD) The switch automatically re-queues the packet
2. Which four constraints must a high-throughput AI fabric satisfy simultaneously?
A) Encryption, compression, deduplication, and cachingB) Non-blocking bandwidth, ultra-low latency, zero packet loss, and deterministic performance at scaleC) High availability, redundancy, encryption, and multitenancyD) QoS, VLAN segmentation, spanning tree, and link aggregation
3. What is a "rail-optimized" topology?
A) A ring topology connecting all GPUs in a loopB) A topology with a dedicated spine per GPU rail to minimize east-west hopsC) A flat Layer 2 topology for maximum broadcast efficiencyD) A topology that uses railroad-style serial links between switches
4. What is the primary benefit of Dynamic Packet Prioritization (DPP)?
A) Encrypts packets based on priority levelB) Classifies flows as mice or elephant in hardware and maps short flows to a low-latency queueC) Compresses high-priority packets for faster deliveryD) Duplicates packets across multiple paths for redundancy
5. Why is standard 5-tuple ECMP hashing problematic for RoCEv2 traffic?
A) RoCEv2 uses TCP, which does not support ECMPB) RoCEv2 traffic shares the same UDP destination port (4791), reducing hash entropyC) ECMP cannot be used on leaf-spine topologiesD) RoCEv2 packets are too large for ECMP to handle
Key Points
A dropped RoCEv2 packet triggers go-back-N recovery that can stall an entire collective operation across all participating GPUs -- lossless transport is the foundational requirement.
AI fabrics must satisfy four constraints simultaneously: non-blocking bandwidth, ultra-low latency, zero packet loss under congestion, and deterministic performance at scale.
Leaf-spine (Clos) topologies scale from 2-tier (up to ~512 GPUs) to 3-tier with superspines (10,000+ GPUs). Rail-optimized designs dedicate a spine per GPU rail.
High-radix switches like the Nexus 9364E-SG2 (64x 800GbE, 51.2 Tbps) reduce tier count, lowering latency and simplifying design.
Buffer management features -- DPP, AFD, shared memory, and HBM deep buffers -- absorb synchronized incast bursts from collective operations.
ECMP for RoCEv2 must parse BTH fields (opcode, destination QP, packet sequence number) to improve hashing entropy beyond the standard 5-tuple.
Why Lossless Matters for AI/ML
When hundreds or thousands of GPUs exchange gradients during a training run, the network is the single largest determinant of job completion time. RoCEv2 relies on lossless Ethernet because a dropped packet does not trigger a simple retransmission -- it triggers a go-back-N recovery at the transport layer that can stall an entire collective operation across every participating GPU.
Design Principles for AI Fabrics
A high-throughput AI fabric must satisfy four simultaneous constraints:
Non-blocking bandwidth -- every port-to-port path must sustain line-rate forwarding with no oversubscription in GPU-to-GPU switching tiers.
Ultra-low latency -- collective communication operations (all-reduce, all-gather, reduce-scatter) are latency-sensitive; even microseconds of additional delay compound across thousands of iterations.
Zero packet loss under congestion -- the fabric must absorb bursts and signal congestion without discarding frames.
Deterministic performance at scale -- these guarantees must hold whether the cluster contains 32 GPUs or 32,000.
Topologies for AI/ML Workloads
Topology
Scale
Tiers
Typical Use Case
2-tier leaf-spine
Up to ~512 GPUs
Leaf + Spine
Single-pod training clusters
3-tier leaf-spine-superspine
512 -- 10,000+ GPUs
Leaf + Spine + Superspine
Multi-pod, large-scale training
Rail-optimized
Varies
Dedicated spine per GPU rail
Minimizes east-west hops for collective ops
Multi-Tier Leaf-Spine Topology
graph TD
subgraph Superspine Tier
SS1[Superspine-1]
SS2[Superspine-2]
end
subgraph Spine Tier
S1[Spine-1]
S2[Spine-2]
S3[Spine-3]
S4[Spine-4]
end
subgraph Leaf Tier - Pod A
L1[Leaf-1]
L2[Leaf-2]
end
subgraph Leaf Tier - Pod B
L3[Leaf-3]
L4[Leaf-4]
end
subgraph GPU Servers
G1[GPU Server 1]
G2[GPU Server 2]
G3[GPU Server 3]
G4[GPU Server 4]
end
SS1 --- S1
SS1 --- S2
SS1 --- S3
SS1 --- S4
SS2 --- S1
SS2 --- S2
SS2 --- S3
SS2 --- S4
S1 --- L1
S1 --- L2
S2 --- L1
S2 --- L2
S3 --- L3
S3 --- L4
S4 --- L3
S4 --- L4
L1 --- G1
L2 --- G2
L3 --- G3
L4 --- G4
Buffer Management and Queue Design
Buffers are the shock absorbers of a data center switch. When multiple flows converge on the same egress port simultaneously -- a pattern called incast -- the switch must temporarily store excess packets in buffers rather than dropping them.
Feature
Function
Benefit
DPP
Classifies flows as mice (short) or elephant (long); maps mice to low-latency queue
Low latency for short flows even during port congestion
AFD
Identifies individual 5-tuple flows and assigns per-flow drop probabilities
Prevents elephant flows from monopolizing shared buffers
Shared Memory
On-die packet buffer shared across all ports (e.g., 256 MB on Nexus 9364E-SG2)
Consistent burst absorption without per-port buffer starvation
Deep Buffer / HBM
Extended buffer using High Bandwidth Memory (e.g., 80 MB on-die + 8 GB HBM on Nexus 9332D-H2R)
Absorbs sustained congestion from inference workloads and large-scale incast
ECMP Enhancements for RoCEv2: Standard 5-tuple ECMP hashing suffers from low entropy with RoCEv2 because all traffic shares UDP destination port 4791. Nexus 9000 switches with CloudScale H2R and H1 ASICs parse Base Transport Header (BTH) fields -- including opcode, destination queue pair, and packet sequence number -- to distribute flows more evenly across spine uplinks.
Key Takeaway: A lossless AI fabric requires non-blocking leaf-spine topologies, high-radix switches that minimize tier count, and intelligent buffer management that can absorb synchronized incast bursts without dropping packets. ECMP must be enhanced with RoCEv2-aware hashing to avoid flow polarization.
Animation Slot: Incast burst arriving at a switch egress port -- packets queued in shared buffer, DPP classifying mice vs. elephant flows, AFD assigning per-flow drop probabilities.
Post-Quiz -- Verify Your Understanding
1. Why does a single dropped RoCEv2 packet have such a severe impact on AI training?
A) It causes the entire job to restart from checkpointB) It triggers go-back-N recovery that stalls the entire collective operation across all participating GPUsC) It corrupts the model weights permanentlyD) It only affects the single GPU that sent the packet
2. The Nexus 9364E-SG2 has 64 ports of 800 GbE. How does its high-radix design benefit AI fabrics?
A) It allows more VLANs per portB) It reduces the number of switching tiers needed, lowering latency and simplifying designC) It increases the encryption throughput per portD) It eliminates the need for ECMP
3. Which Nexus 9000 buffer feature uses High Bandwidth Memory to absorb sustained congestion?
A) Dynamic Packet Prioritization (DPP)B) Approximate Fair Drop (AFD)C) Deep Buffer / HBM (e.g., Nexus 9332D-H2R with 8 GB HBM)D) Shared Memory Architecture
4. What BTH fields do Nexus 9000 CloudScale ASICs parse to improve RoCEv2 ECMP hashing?
A) Source MAC address and VLAN tagB) Opcode, destination queue pair, and packet sequence numberC) TTL and IP identification fieldsD) TCP window size and sequence number
5. What traffic pattern makes AI training particularly prone to buffer overflow?
A) Unicast streaming from a single sourceB) Incast -- hundreds of GPUs transmitting simultaneously to overlapping destinations during collective operationsC) Multicast replication across VLANsD) Broadcast storms from spanning tree convergence
Section 2: Lossless Fabric Design and Configuration
Pre-Quiz -- Test Your Existing Knowledge
1. What does Priority Flow Control (PFC) do differently from legacy IEEE 802.3x PAUSE?
A) PFC pauses all traffic on a link, while 802.3x is per-priorityB) PFC pauses traffic on a per-Class-of-Service basis, while 802.3x halts all trafficC) PFC only works on 10 GbE linksD) PFC drops packets instead of pausing them
2. Why is ECN preferred over PFC as a congestion signal?
A) ECN drops packets faster than PFC can pause themB) ECN marks packets to signal senders to reduce rate, avoiding the latency penalty of pausingC) ECN requires no switch configurationD) ECN is only used for best-effort traffic
3. What is the role of WRED when combined with ECN?
A) WRED encrypts packets before ECN markingB) WRED probabilistically marks packets before queue saturation, providing early congestion signalingC) WRED replaces ECN entirely in modern fabricsD) WRED only operates on best-effort queues
4. What does PFC Watchdog (PFCWD) protect against?
A) Unauthorized access to switch management interfacesB) PFC storms where a malfunctioning NIC continuously sends pause frames, cascading through the fabricC) Spanning tree loops in Layer 2 networksD) Excessive multicast traffic on lossless queues
5. Which CoS value is commonly used for the lossless RoCEv2 traffic class?
A) CoS 0B) CoS 1C) CoS 3D) CoS 7
Key Points
Lossless Ethernet rests on three pillars: PFC (IEEE 802.1Qbb), ECN (RFC 3168), and WRED -- configured consistently on every hop.
ECN/WRED is the first line of defense (early congestion signaling); PFC is the safety net (prevents drops when ECN cannot keep up).
If PFC is triggering frequently, it indicates ECN thresholds or fabric capacity need adjustment.
PFC must be enabled on every interface in the lossless path -- missing a single hop breaks the end-to-end lossless guarantee.
Fabric-wide consistency requirements: matching CoS markings, PFC on every hop, jumbo MTU (9216) end-to-end, DCBX negotiation, and uniform WRED/ECN thresholds.
PFC Watchdog (default 100 ms interval) must always be enabled to prevent a single malfunctioning NIC from cascading PFC storms across the entire fabric.
The Three Pillars of Lossless Ethernet
Achieving lossless Ethernet is a fabric-wide property that must be configured consistently on every hop between source NIC and destination NIC.
Priority Flow Control (PFC) -- IEEE 802.1Qbb: Pauses a specific Class of Service on an ingress port when receive buffers approach capacity. Unlike legacy 802.3x PAUSE (which halts all traffic), PFC is per-priority -- only the lossless class is paused while best-effort continues.
Explicit Congestion Notification (ECN) -- RFC 3168: A switch marks the IP ECN field (CE -- Congestion Experienced) in transit packets rather than dropping them. The receiver echoes this back, and the sender reduces its rate. ECN is preferred because it avoids the latency penalty of PFC pauses.
Weighted Random Early Detection (WRED): Probabilistically marks (ECN) or drops packets before queue saturation. Combined with ECN, WRED provides early congestion signaling that keeps queues short.
The following steps enable lossless transport for RoCEv2 on Cisco Nexus 9000 using CoS 3 as the no-drop class and CoS 0 as best-effort.
Configuration Workflow
flowchart TD
A["Step 1: Enable PFC\nfeature priority-flow-control\nPFC mode on per-interface"] --> B["Step 2: Define QoS Classification\nclass-map: match CoS 3\npolicy-map: set qos-group 3"]
B --> C["Step 3: Configure Queuing\nno-drop on lossless queue\nWRED + ECN thresholds\nmin 150 KB / max 3000 KB"]
C --> D["Step 4: Apply Policies\nIngress: QoS classification\nEgress: Queuing policy"]
D --> E["Step 5: Enable PFC Watchdog\npriority-flow-control\nwatch-dog-interval on"]
E --> F["Step 6: Validate\nPFC frame counters\nECN mark counters\nZero-drop confirmation"]
style A fill:#2d5986,color:#fff
style B fill:#2d5986,color:#fff
style C fill:#2d5986,color:#fff
style D fill:#2d5986,color:#fff
style E fill:#2d5986,color:#fff
style F fill:#1a7a3a,color:#fff
Step 1: Enable PFC globally and on interfaces.
feature priority-flow-control
interface Ethernet1/1
priority-flow-control mode on
Step 2: Define QoS classification.
class-map type qos match-all class-rocev2
match cos 3
policy-map type qos rocev2-ingress
class class-rocev2
set qos-group 3
class class-default
set qos-group 0
Step 3: Configure queuing with no-drop and ECN/WRED.
policy-map type queuing rocev2-egress
class type queuing c-out-8q-q3
priority level 1
no-drop
random-detect minimum-threshold 150 kbytes maximum-threshold 3000 kbytes drop-probability 100 weight 0 ecn
class type queuing c-out-8q-q-default
bandwidth remaining percent 100
Step 4: Apply policies to interfaces.
interface Ethernet1/1
service-policy type qos input rocev2-ingress
service-policy type queuing output rocev2-egress
Step 5: Enable PFC Watchdog.
priority-flow-control watch-dog-interval on
Fabric-Wide QoS Consistency
Requirement
Implementation
Consistent CoS marking
All NICs, hypervisors, and switches must agree on which CoS value is lossless (commonly CoS 3)
PFC on every hop
Enable PFC Tx and Rx on all interfaces in the no-drop path
Matching MTU
Jumbo frames (9216 bytes) end-to-end; a single 1500-byte link will fragment or drop oversized frames
DCBX negotiation
Data Center Bridging Exchange ensures NIC and switch agree on PFC parameters automatically
Uniform WRED/ECN thresholds
Consistent ECN marking thresholds across leaf and spine tiers to avoid asymmetric congestion signaling
Testing and Validation
PFC frame verification -- generate controlled congestion and confirm PFC pause frames using show interface priority-flow-control.
ECN marking verification -- capture packets at the receiver and confirm CE bit; use show queuing interface for WRED ECN counters.
Zero-drop confirmation -- under sustained load, verify drop counters remain at zero for the no-drop class.
PFC Watchdog test -- simulate a PFC storm and confirm PFCWD activates and isolates the offending queue.
ECMP balance validation -- verify RoCEv2 flow distribution across spine uplinks using show routing hash with BTH-aware hashing.
Key Takeaway: Lossless Ethernet requires PFC and ECN configured consistently on every hop. ECN is the preferred congestion signal; PFC is the safety net. PFC Watchdog must always be enabled to prevent a single malfunctioning host from taking down the entire fabric through a PFC storm.
Animation Slot: Step-by-step PFC pause frame propagation -- queue fills, pause sent upstream, sender halts, queue drains, resume signal sent. Show ECN marking flow in parallel.
Post-Quiz -- Verify Your Understanding
1. In the NX-OS configuration, what does the no-drop keyword under the queuing policy accomplish?
A) It disables logging for dropped packetsB) It designates the queue as lossless, ensuring PFC will pause upstream rather than dropping packetsC) It enables packet duplication for redundancyD) It sets the queue to best-effort mode
2. What happens if a single link in the lossless path has a default 1500-byte MTU while the rest use 9216?
A) Traffic is automatically compressed to fitB) Oversized frames will be fragmented or dropped, breaking the lossless guaranteeC) PFC automatically adjusts the MTUD) ECMP re-routes around the mismatched link
3. What is the default PFC Watchdog interval on Nexus 9000, and what happens when it triggers?
A) 10 ms; it reboots the switchB) 100 ms; it flushes the affected queue and drops new packets destined for it until the storm clearsC) 1 second; it disables PFC on the interface permanentlyD) 500 ms; it sends an SNMP trap but takes no action
4. In the hierarchical congestion control model, what is the correct escalation order?
A) PFC first, then ECN if PFC failsB) ECN/WRED signals first (early marking), then PFC as safety net, then PFCWD if PFC storms formC) PFCWD first, then ECN, then PFCD) Tail drop first, then WRED, then PFC
5. What WRED/ECN thresholds are configured in the example for the lossless queue?
A) Minimum 50 KB, maximum 500 KBB) Minimum 150 KB, maximum 3000 KB with ECN markingC) Minimum 1 MB, maximum 10 MBD) No thresholds -- WRED is disabled for lossless queues
Section 3: Congestion Visibility with Nexus Dashboard Insights
Pre-Quiz -- Test Your Existing Knowledge
1. What is Nexus Dashboard Insights (NDI)?
A) A switch operating system that replaces NX-OSB) An analytics and assurance platform that collects telemetry from Nexus switches for congestion monitoringC) A protocol for encrypting data center trafficD) A hardware module installed in Nexus 9000 switches
2. What is a "congestion tree" in the context of PFC?
A) A spanning tree variant used in lossless fabricsB) A directed subgraph of switches and hosts affected by PFC pause propagation from a congestion rootC) A multicast distribution tree for RoCEv2 trafficD) A routing tree computed by OSPF
3. How does NDI collect telemetry data?
A) Through SNMP polling every 60 secondsB) Through hardware flow telemetry built into the Nexus 9000 ASICC) Through NetFlow export from spine switches onlyD) Through syslog message parsing
4. What PFC metric helps identify whether pause events are brief or sustained?
A) PFC frame sizeB) PFC duration tracking -- how long each pause event lastsC) PFC source MAC addressD) PFC VLAN membership
5. In a congestion tree, what is an "innocent bystander"?
A) A switch that is powered offB) A host whose traffic is paused due to PFC propagation, even though its traffic is unrelated to the original congestionC) A monitoring probe that captures traffic passivelyD) A spine switch with zero utilization
Key Points
NDI provides real-time congestion scoring, ML-driven anomaly detection, AI job observability, and sustainability insights.
Hardware flow telemetry in the Nexus 9000 ASIC accounts for every packet at per-device, per-interface, and per-flow granularity.
ECN analytics include per-flow mark counters, historical trending, and correlation with specific application flows.
PFC analytics include per-CoS frame counts, duration tracking, and PFC Watchdog activation events.
Congestion tree visualization shows the full blast radius of PFC pause propagation -- the congestion root (bottleneck) and all affected upstream branches (including innocent bystanders).
A single slow receiver can cascade PFC pauses across the entire fabric topology if not detected and remediated.
NDI for Congestion Monitoring
Configuring a lossless fabric is only half the battle -- you must also see what the fabric is doing in real time. NDI collects telemetry data from Nexus switches and correlates it with advanced algorithms for baselining, anomaly detection, and forecasting.
Capability
Description
Real-time congestion scoring
Assigns a numerical congestion score to each device, interface, and flow based on buffer utilization, ECN marks, and PFC events
Anomaly detection
Proactively identifies performance bottlenecks using ML-driven baselines and raises alerts with suggested remediation
AI job observability
End-to-end visibility from network layer through to GPU utilization, correlating network events with training job performance
Sustainability insights
Monitors power consumption and provides optimization recommendations for energy-efficient operation
ECN and PFC Analytics
ECN Metrics:
ECN mark counters per device, per interface, and per flow
Historical trending of ECN marking rates to identify worsening congestion
Correlation of ECN marks with specific application flows to pinpoint affected workloads
PFC Metrics:
PFC frames issued and received per interface, broken down by Class of Service
PFC duration tracking -- how long each pause event lasts
PFC Watchdog activation events, showing when and where storms were detected
Congestion Tree Visualization
When PFC pause frames propagate upstream from a congestion point, they create a tree-shaped pattern of affected switches and links. The congestion root is the bottleneck source (often a slow receiver or oversubscribed link), and the congestion branches include all upstream switches and hosts whose traffic is paused as a result.
PFC Pause Propagation -- Congestion Tree Formation
NDI visualizes these congestion trees on a topology map, allowing administrators to:
Identify the congestion root (often a slow receiver or oversubscribed link)
Trace all affected branches to understand the blast radius
Correlate congestion trees with specific AI training jobs or collective operations
Track how congestion trees form and dissolve over time
Key Takeaway: Nexus Dashboard Insights transforms raw switch telemetry into actionable congestion intelligence. Its congestion tree visualization is especially critical in lossless fabrics, where a single slow receiver can cascade PFC pauses across the entire fabric topology.
Animation Slot: Congestion tree forming in real time -- slow receiver triggers PFC pause propagation upstream through spine to leaves, highlighting innocent bystanders. NDI dashboard overlay showing congestion scores rising.
Post-Quiz -- Verify Your Understanding
1. What three data sources does NDI use to compute a congestion score?
A) CPU utilization, memory usage, and disk I/OB) Buffer utilization, ECN marks, and PFC eventsC) SNMP traps, syslog messages, and NetFlow recordsD) BGP route count, OSPF adjacency state, and ARP table size
2. Server X is a slow receiver causing congestion. Server Y shares a spine with Server X but has unrelated traffic. What happens to Server Y?
A) Server Y is unaffected because PFC is per-priorityB) Server Y's lossless traffic is paused as an innocent bystander because PFC pauses propagate through the shared spineC) Server Y's traffic is rerouted through a different spine automaticallyD) Server Y's switch reboots due to the PFC storm
3. How does NDI detect anomalies in congestion behavior?
A) By comparing against static thresholds configured by the administratorB) By using ML-driven baselines and raising alerts when metrics deviate from learned normal behaviorC) By running ping tests to every host every 5 secondsD) By analyzing BGP routing table changes
4. At what granularity does NDI present telemetry data?
A) Only at the device levelB) Per-device, per-interface, and per-flow granularityC) Only at the fabric level (aggregate)D) Per-VLAN only
5. What critical operational question can PFC duration tracking help answer?
A) Which VLANs are configured on each portB) Whether pause events are brief (normal burst absorption) or sustained (indicating a problem requiring intervention)C) How many routes are in the routing tableD) Which firmware version is running on each switch
Section 4: AI/ML Traffic Flow Monitoring
Pre-Quiz -- Test Your Existing Knowledge
1. What is the dominant traffic pattern during AI training gradient synchronization?
A) North-south unicast from servers to clientsB) All-reduce -- every GPU sends to and receives from every other GPU, creating massive east-west trafficC) Multicast from a single parameter serverD) Sequential point-to-point between adjacent GPUs only
2. How does inference traffic differ from training traffic?
A) Inference produces no network trafficB) Inference generates sustained elephant flows (e.g., KV-cache transfers) rather than short synchronized burstsC) Inference uses only broadcast trafficD) Inference traffic is identical to training traffic
3. What is a microburst?
A) A sustained period of high link utilization lasting minutesB) A short-lived traffic surge (microseconds to milliseconds) that exceeds link capacity even when average utilization is lowC) A type of packet corruption caused by electromagnetic interferenceD) A brief power surge to the switch ASIC
4. Why does traditional polling-based monitoring miss microbursts?
A) Polling uses too much bandwidthB) Polling samples every 30-60 seconds, while microbursts last microseconds to millisecondsC) Polling only works on spine switchesD) Polling is incompatible with RoCEv2
5. For inference-optimized fabric design, what buffer strategy is recommended?
A) No buffers -- drop all excess packetsB) Shallow buffers with fast drainC) Deep buffers (HBM) for sustained absorption of elephant flowsD) Per-VLAN dedicated buffers
Key Points
AI training produces synchronized collective operations (all-reduce, all-gather, reduce-scatter) that create massive east-west incast bursts.
AI inference generates sustained elephant flows (KV-cache transfers) that do not drain between phases -- buffers eventually overflow regardless of size.
Training is optimized with shallow buffers and higher ECN thresholds; inference needs deep buffers (HBM) and lower ECN thresholds for early signaling.
Microbursts are the signature traffic pattern of distributed AI training -- traditional 30-60 second polling completely misses them.
NDI detects microbursts at sub-millisecond granularity using hardware telemetry: per-port burst tracking, temporal correlation, historical trending, and flow-level attribution.
The NDI monitoring workflow: Baseline normal metrics, Alert on deviations, Diagnose with congestion trees and flow analytics, Tune ECN/WRED thresholds or redistribute workloads.
Traffic Patterns in AI Workloads
Traffic Pattern
Workload Stage
Characteristics
Network Impact
All-Reduce
Training (gradient sync)
Every GPU sends to and receives from every other GPU
Massive east-west bandwidth; synchronized incast
All-Gather
Training (parameter broadcast)
Each GPU contributes a shard; all GPUs receive full result
High fan-out followed by high fan-in
Reduce-Scatter
Training (distributed reduction)
Reduction result scattered across GPUs
Balanced but bursty bidirectional traffic
Pipeline parallel
Training/Inference
Data flows sequentially through model stages on different GPUs
Sustained point-to-point flows; latency-sensitive
KV-cache transfer
Inference (prefill/decode)
Large context windows require distributing key-value caches across nodes
Sustained elephant flows that can overwhelm buffers
Training vs. Inference: Different Stress on the Network
Training: Transient Incast. GPUs execute a compute phase (forward/backward pass) followed by a communication phase (gradient synchronization). The communication phase produces highly synchronized bursts where many GPUs transmit simultaneously. This transient incast can exceed link capacity by 10x or more for microseconds. Training congestion is self-limiting: buffers absorb the burst, traffic drains during the compute phase.
Inference: Sustained Elephant Flows. Inference (especially for LLMs with extended context windows) creates sustained, long-lived flows as models distribute memory and KV-caches. These elephant flows do not drain between phases. A fabric designed solely for training assumptions will experience sustained congestion during inference.
Design Parameter
Training Optimized
Inference Optimized
Buffer strategy
Shallow buffers with fast drain
Deep buffers (HBM) for sustained absorption
ECN thresholds
Higher thresholds (allow burst absorption)
Lower thresholds (signal early before buffers fill)
PFC reliance
Occasional PFC acceptable
Minimize PFC; rely on ECN/WRED
Link capacity
Sized for peak burst bandwidth
Sized for sustained throughput
Microburst Detection
A microburst is a short-lived surge of traffic -- typically microseconds to low milliseconds -- that exceeds link capacity even when average utilization appears low. Traditional polling-based monitoring (sampling every 30-60 seconds) completely misses them.
NDI Microburst Detection uses hardware telemetry from the Nexus 9000 ASIC at sub-millisecond granularity:
Per-port burst tracking: Monitors instantaneous queue depth and flags when utilization exceeds thresholds.
Temporal correlation: Correlates microbursts across multiple ports and switches to identify collective operation boundaries.
Historical trending: Records microburst frequency, duration, and magnitude over time.
Flow-level attribution: Attributes specific microbursts to individual RoCEv2 flows, identifying which GPU pairs cause congestion.
NDI Congestion Monitoring Workflow for AI/ML Traffic
flowchart TD
A["1. Baseline\nEstablish normal ECN rates,\nPFC counts, microburst frequency\nper AI job type"] --> B["2. Alert\nAnomaly thresholds trigger on\ndeviation from baseline\ne.g., 3x PFC rate increase"]
B --> C["3. Diagnose\nCongestion tree visualization\nidentifies root cause;\nflow-level analytics pinpoints\naffected workload"]
C --> D["4. Tune\nAdjust ECN/WRED thresholds,\nrebalance ECMP paths,\nor redistribute workloads"]
D --> A
style A fill:#2d7d5e,color:#fff
style B fill:#cc8833,color:#fff
style C fill:#cc3333,color:#fff
style D fill:#2d5986,color:#fff
NDI's traffic analytics also auto-discovers services by well-known Layer 4 ports or user-defined custom categories, baselines them, and tracks latency, congestion, and drops over time. This is particularly valuable for converged fabrics where RoCEv2 AI traffic coexists with storage (NVMe-oF) and management traffic.
Key Takeaway: AI training produces transient microbursts from synchronized collectives, while inference generates sustained elephant flows. Both require monitoring at sub-millisecond granularity. NDI's hardware-based flow telemetry provides the visibility needed to detect, diagnose, and resolve congestion before it impacts job completion time.
Animation Slot: Side-by-side comparison of training traffic (synchronized bursts with quiet intervals) vs. inference traffic (continuous elephant flows). Buffer fill levels shown over time for each pattern. NDI dashboard overlay showing microburst detection alerts.
Post-Quiz -- Verify Your Understanding
1. During all-reduce gradient synchronization, what type of congestion pattern occurs?
A) Sustained unicast congestion on a single linkB) Transient incast -- brief but intense synchronized bursts from many GPUs to overlapping destinationsC) Broadcast storm across all VLANsD) No congestion because GPUs take turns transmitting
2. Why does a fabric designed solely for training assumptions fail during LLM inference workloads?
A) Inference requires different routing protocolsB) Inference generates sustained elephant flows that do not drain between phases, eventually overflowing buffers regardless of sizeC) Inference uses a different Ethernet frame formatD) Inference requires more VLANs than training
3. What is the correct sequence of the NDI monitoring workflow?
4. How does NDI attribute a microburst to a specific workload?
A) By reading application-layer headersB) Through hardware flow telemetry that traces individual RoCEv2 flows to specific GPU pairs or collective operationsC) By asking the administrator to label each flow manuallyD) By correlating with DNS lookups
5. For an inference-optimized fabric, which ECN threshold strategy is recommended?
A) Disable ECN entirely and rely on PFCB) Higher thresholds to allow maximum burst absorptionC) Lower thresholds to signal congestion early before deep buffers fillD) No thresholds -- use tail drop instead