Chapter 10: Building Lossless Ethernet Fabrics

Learning Objectives

Section 1: High-Throughput Converged Fabric Architecture

Pre-Quiz -- Test Your Existing Knowledge

1. What happens when a packet is dropped in an RoCEv2 flow?

A) A simple TCP-style retransmission occurs B) The packet is silently ignored by the receiver C) A go-back-N recovery stalls the entire collective operation across all participating GPUs D) The switch automatically re-queues the packet

2. Which four constraints must a high-throughput AI fabric satisfy simultaneously?

A) Encryption, compression, deduplication, and caching B) Non-blocking bandwidth, ultra-low latency, zero packet loss, and deterministic performance at scale C) High availability, redundancy, encryption, and multitenancy D) QoS, VLAN segmentation, spanning tree, and link aggregation

3. What is a "rail-optimized" topology?

A) A ring topology connecting all GPUs in a loop B) A topology with a dedicated spine per GPU rail to minimize east-west hops C) A flat Layer 2 topology for maximum broadcast efficiency D) A topology that uses railroad-style serial links between switches

4. What is the primary benefit of Dynamic Packet Prioritization (DPP)?

A) Encrypts packets based on priority level B) Classifies flows as mice or elephant in hardware and maps short flows to a low-latency queue C) Compresses high-priority packets for faster delivery D) Duplicates packets across multiple paths for redundancy

5. Why is standard 5-tuple ECMP hashing problematic for RoCEv2 traffic?

A) RoCEv2 uses TCP, which does not support ECMP B) RoCEv2 traffic shares the same UDP destination port (4791), reducing hash entropy C) ECMP cannot be used on leaf-spine topologies D) RoCEv2 packets are too large for ECMP to handle

Key Points

Why Lossless Matters for AI/ML

When hundreds or thousands of GPUs exchange gradients during a training run, the network is the single largest determinant of job completion time. RoCEv2 relies on lossless Ethernet because a dropped packet does not trigger a simple retransmission -- it triggers a go-back-N recovery at the transport layer that can stall an entire collective operation across every participating GPU.

Design Principles for AI Fabrics

A high-throughput AI fabric must satisfy four simultaneous constraints:

  1. Non-blocking bandwidth -- every port-to-port path must sustain line-rate forwarding with no oversubscription in GPU-to-GPU switching tiers.
  2. Ultra-low latency -- collective communication operations (all-reduce, all-gather, reduce-scatter) are latency-sensitive; even microseconds of additional delay compound across thousands of iterations.
  3. Zero packet loss under congestion -- the fabric must absorb bursts and signal congestion without discarding frames.
  4. Deterministic performance at scale -- these guarantees must hold whether the cluster contains 32 GPUs or 32,000.

Topologies for AI/ML Workloads

TopologyScaleTiersTypical Use Case
2-tier leaf-spineUp to ~512 GPUsLeaf + SpineSingle-pod training clusters
3-tier leaf-spine-superspine512 -- 10,000+ GPUsLeaf + Spine + SuperspineMulti-pod, large-scale training
Rail-optimizedVariesDedicated spine per GPU railMinimizes east-west hops for collective ops

Multi-Tier Leaf-Spine Topology

graph TD subgraph Superspine Tier SS1[Superspine-1] SS2[Superspine-2] end subgraph Spine Tier S1[Spine-1] S2[Spine-2] S3[Spine-3] S4[Spine-4] end subgraph Leaf Tier - Pod A L1[Leaf-1] L2[Leaf-2] end subgraph Leaf Tier - Pod B L3[Leaf-3] L4[Leaf-4] end subgraph GPU Servers G1[GPU Server 1] G2[GPU Server 2] G3[GPU Server 3] G4[GPU Server 4] end SS1 --- S1 SS1 --- S2 SS1 --- S3 SS1 --- S4 SS2 --- S1 SS2 --- S2 SS2 --- S3 SS2 --- S4 S1 --- L1 S1 --- L2 S2 --- L1 S2 --- L2 S3 --- L3 S3 --- L4 S4 --- L3 S4 --- L4 L1 --- G1 L2 --- G2 L3 --- G3 L4 --- G4

Buffer Management and Queue Design

Buffers are the shock absorbers of a data center switch. When multiple flows converge on the same egress port simultaneously -- a pattern called incast -- the switch must temporarily store excess packets in buffers rather than dropping them.

FeatureFunctionBenefit
DPPClassifies flows as mice (short) or elephant (long); maps mice to low-latency queueLow latency for short flows even during port congestion
AFDIdentifies individual 5-tuple flows and assigns per-flow drop probabilitiesPrevents elephant flows from monopolizing shared buffers
Shared MemoryOn-die packet buffer shared across all ports (e.g., 256 MB on Nexus 9364E-SG2)Consistent burst absorption without per-port buffer starvation
Deep Buffer / HBMExtended buffer using High Bandwidth Memory (e.g., 80 MB on-die + 8 GB HBM on Nexus 9332D-H2R)Absorbs sustained congestion from inference workloads and large-scale incast

ECMP Enhancements for RoCEv2: Standard 5-tuple ECMP hashing suffers from low entropy with RoCEv2 because all traffic shares UDP destination port 4791. Nexus 9000 switches with CloudScale H2R and H1 ASICs parse Base Transport Header (BTH) fields -- including opcode, destination queue pair, and packet sequence number -- to distribute flows more evenly across spine uplinks.

Key Takeaway: A lossless AI fabric requires non-blocking leaf-spine topologies, high-radix switches that minimize tier count, and intelligent buffer management that can absorb synchronized incast bursts without dropping packets. ECMP must be enhanced with RoCEv2-aware hashing to avoid flow polarization.
Animation Slot: Incast burst arriving at a switch egress port -- packets queued in shared buffer, DPP classifying mice vs. elephant flows, AFD assigning per-flow drop probabilities.
Post-Quiz -- Verify Your Understanding

1. Why does a single dropped RoCEv2 packet have such a severe impact on AI training?

A) It causes the entire job to restart from checkpoint B) It triggers go-back-N recovery that stalls the entire collective operation across all participating GPUs C) It corrupts the model weights permanently D) It only affects the single GPU that sent the packet

2. The Nexus 9364E-SG2 has 64 ports of 800 GbE. How does its high-radix design benefit AI fabrics?

A) It allows more VLANs per port B) It reduces the number of switching tiers needed, lowering latency and simplifying design C) It increases the encryption throughput per port D) It eliminates the need for ECMP

3. Which Nexus 9000 buffer feature uses High Bandwidth Memory to absorb sustained congestion?

A) Dynamic Packet Prioritization (DPP) B) Approximate Fair Drop (AFD) C) Deep Buffer / HBM (e.g., Nexus 9332D-H2R with 8 GB HBM) D) Shared Memory Architecture

4. What BTH fields do Nexus 9000 CloudScale ASICs parse to improve RoCEv2 ECMP hashing?

A) Source MAC address and VLAN tag B) Opcode, destination queue pair, and packet sequence number C) TTL and IP identification fields D) TCP window size and sequence number

5. What traffic pattern makes AI training particularly prone to buffer overflow?

A) Unicast streaming from a single source B) Incast -- hundreds of GPUs transmitting simultaneously to overlapping destinations during collective operations C) Multicast replication across VLANs D) Broadcast storms from spanning tree convergence

Section 2: Lossless Fabric Design and Configuration

Pre-Quiz -- Test Your Existing Knowledge

1. What does Priority Flow Control (PFC) do differently from legacy IEEE 802.3x PAUSE?

A) PFC pauses all traffic on a link, while 802.3x is per-priority B) PFC pauses traffic on a per-Class-of-Service basis, while 802.3x halts all traffic C) PFC only works on 10 GbE links D) PFC drops packets instead of pausing them

2. Why is ECN preferred over PFC as a congestion signal?

A) ECN drops packets faster than PFC can pause them B) ECN marks packets to signal senders to reduce rate, avoiding the latency penalty of pausing C) ECN requires no switch configuration D) ECN is only used for best-effort traffic

3. What is the role of WRED when combined with ECN?

A) WRED encrypts packets before ECN marking B) WRED probabilistically marks packets before queue saturation, providing early congestion signaling C) WRED replaces ECN entirely in modern fabrics D) WRED only operates on best-effort queues

4. What does PFC Watchdog (PFCWD) protect against?

A) Unauthorized access to switch management interfaces B) PFC storms where a malfunctioning NIC continuously sends pause frames, cascading through the fabric C) Spanning tree loops in Layer 2 networks D) Excessive multicast traffic on lossless queues

5. Which CoS value is commonly used for the lossless RoCEv2 traffic class?

A) CoS 0 B) CoS 1 C) CoS 3 D) CoS 7

Key Points

The Three Pillars of Lossless Ethernet

Achieving lossless Ethernet is a fabric-wide property that must be configured consistently on every hop between source NIC and destination NIC.

  1. Priority Flow Control (PFC) -- IEEE 802.1Qbb: Pauses a specific Class of Service on an ingress port when receive buffers approach capacity. Unlike legacy 802.3x PAUSE (which halts all traffic), PFC is per-priority -- only the lossless class is paused while best-effort continues.
  2. Explicit Congestion Notification (ECN) -- RFC 3168: A switch marks the IP ECN field (CE -- Congestion Experienced) in transit packets rather than dropping them. The receiver echoes this back, and the sender reduces its rate. ECN is preferred because it avoids the latency penalty of PFC pauses.
  3. Weighted Random Early Detection (WRED): Probabilistically marks (ECN) or drops packets before queue saturation. Combined with ECN, WRED provides early congestion signaling that keeps queues short.

Hierarchical Congestion Control

stateDiagram-v2 [*] --> Normal: Traffic flowing Normal --> ECN_WRED_Active: Queue depth rises above WRED min threshold ECN_WRED_Active --> Normal: Senders reduce rate after ECN feedback ECN_WRED_Active --> PFC_Triggered: Queue depth continues rising despite ECN PFC_Triggered --> ECN_WRED_Active: Queue drains below PFC threshold PFC_Triggered --> PFCWD_Activated: Pause persists beyond watchdog interval PFCWD_Activated --> Normal: Storm cleared, queue flushed

NX-OS Lossless QoS Configuration Walkthrough

The following steps enable lossless transport for RoCEv2 on Cisco Nexus 9000 using CoS 3 as the no-drop class and CoS 0 as best-effort.

Configuration Workflow

flowchart TD A["Step 1: Enable PFC\nfeature priority-flow-control\nPFC mode on per-interface"] --> B["Step 2: Define QoS Classification\nclass-map: match CoS 3\npolicy-map: set qos-group 3"] B --> C["Step 3: Configure Queuing\nno-drop on lossless queue\nWRED + ECN thresholds\nmin 150 KB / max 3000 KB"] C --> D["Step 4: Apply Policies\nIngress: QoS classification\nEgress: Queuing policy"] D --> E["Step 5: Enable PFC Watchdog\npriority-flow-control\nwatch-dog-interval on"] E --> F["Step 6: Validate\nPFC frame counters\nECN mark counters\nZero-drop confirmation"] style A fill:#2d5986,color:#fff style B fill:#2d5986,color:#fff style C fill:#2d5986,color:#fff style D fill:#2d5986,color:#fff style E fill:#2d5986,color:#fff style F fill:#1a7a3a,color:#fff

Step 1: Enable PFC globally and on interfaces.

feature priority-flow-control

interface Ethernet1/1
  priority-flow-control mode on

Step 2: Define QoS classification.

class-map type qos match-all class-rocev2
  match cos 3

policy-map type qos rocev2-ingress
  class class-rocev2
    set qos-group 3
  class class-default
    set qos-group 0

Step 3: Configure queuing with no-drop and ECN/WRED.

policy-map type queuing rocev2-egress
  class type queuing c-out-8q-q3
    priority level 1
    no-drop
    random-detect minimum-threshold 150 kbytes maximum-threshold 3000 kbytes drop-probability 100 weight 0 ecn
  class type queuing c-out-8q-q-default
    bandwidth remaining percent 100

Step 4: Apply policies to interfaces.

interface Ethernet1/1
  service-policy type qos input rocev2-ingress
  service-policy type queuing output rocev2-egress

Step 5: Enable PFC Watchdog.

priority-flow-control watch-dog-interval on

Fabric-Wide QoS Consistency

RequirementImplementation
Consistent CoS markingAll NICs, hypervisors, and switches must agree on which CoS value is lossless (commonly CoS 3)
PFC on every hopEnable PFC Tx and Rx on all interfaces in the no-drop path
Matching MTUJumbo frames (9216 bytes) end-to-end; a single 1500-byte link will fragment or drop oversized frames
DCBX negotiationData Center Bridging Exchange ensures NIC and switch agree on PFC parameters automatically
Uniform WRED/ECN thresholdsConsistent ECN marking thresholds across leaf and spine tiers to avoid asymmetric congestion signaling

Testing and Validation

  1. PFC frame verification -- generate controlled congestion and confirm PFC pause frames using show interface priority-flow-control.
  2. ECN marking verification -- capture packets at the receiver and confirm CE bit; use show queuing interface for WRED ECN counters.
  3. Zero-drop confirmation -- under sustained load, verify drop counters remain at zero for the no-drop class.
  4. PFC Watchdog test -- simulate a PFC storm and confirm PFCWD activates and isolates the offending queue.
  5. ECMP balance validation -- verify RoCEv2 flow distribution across spine uplinks using show routing hash with BTH-aware hashing.
Key Takeaway: Lossless Ethernet requires PFC and ECN configured consistently on every hop. ECN is the preferred congestion signal; PFC is the safety net. PFC Watchdog must always be enabled to prevent a single malfunctioning host from taking down the entire fabric through a PFC storm.
Animation Slot: Step-by-step PFC pause frame propagation -- queue fills, pause sent upstream, sender halts, queue drains, resume signal sent. Show ECN marking flow in parallel.
Post-Quiz -- Verify Your Understanding

1. In the NX-OS configuration, what does the no-drop keyword under the queuing policy accomplish?

A) It disables logging for dropped packets B) It designates the queue as lossless, ensuring PFC will pause upstream rather than dropping packets C) It enables packet duplication for redundancy D) It sets the queue to best-effort mode

2. What happens if a single link in the lossless path has a default 1500-byte MTU while the rest use 9216?

A) Traffic is automatically compressed to fit B) Oversized frames will be fragmented or dropped, breaking the lossless guarantee C) PFC automatically adjusts the MTU D) ECMP re-routes around the mismatched link

3. What is the default PFC Watchdog interval on Nexus 9000, and what happens when it triggers?

A) 10 ms; it reboots the switch B) 100 ms; it flushes the affected queue and drops new packets destined for it until the storm clears C) 1 second; it disables PFC on the interface permanently D) 500 ms; it sends an SNMP trap but takes no action

4. In the hierarchical congestion control model, what is the correct escalation order?

A) PFC first, then ECN if PFC fails B) ECN/WRED signals first (early marking), then PFC as safety net, then PFCWD if PFC storms form C) PFCWD first, then ECN, then PFC D) Tail drop first, then WRED, then PFC

5. What WRED/ECN thresholds are configured in the example for the lossless queue?

A) Minimum 50 KB, maximum 500 KB B) Minimum 150 KB, maximum 3000 KB with ECN marking C) Minimum 1 MB, maximum 10 MB D) No thresholds -- WRED is disabled for lossless queues

Section 3: Congestion Visibility with Nexus Dashboard Insights

Pre-Quiz -- Test Your Existing Knowledge

1. What is Nexus Dashboard Insights (NDI)?

A) A switch operating system that replaces NX-OS B) An analytics and assurance platform that collects telemetry from Nexus switches for congestion monitoring C) A protocol for encrypting data center traffic D) A hardware module installed in Nexus 9000 switches

2. What is a "congestion tree" in the context of PFC?

A) A spanning tree variant used in lossless fabrics B) A directed subgraph of switches and hosts affected by PFC pause propagation from a congestion root C) A multicast distribution tree for RoCEv2 traffic D) A routing tree computed by OSPF

3. How does NDI collect telemetry data?

A) Through SNMP polling every 60 seconds B) Through hardware flow telemetry built into the Nexus 9000 ASIC C) Through NetFlow export from spine switches only D) Through syslog message parsing

4. What PFC metric helps identify whether pause events are brief or sustained?

A) PFC frame size B) PFC duration tracking -- how long each pause event lasts C) PFC source MAC address D) PFC VLAN membership

5. In a congestion tree, what is an "innocent bystander"?

A) A switch that is powered off B) A host whose traffic is paused due to PFC propagation, even though its traffic is unrelated to the original congestion C) A monitoring probe that captures traffic passively D) A spine switch with zero utilization

Key Points

NDI for Congestion Monitoring

Configuring a lossless fabric is only half the battle -- you must also see what the fabric is doing in real time. NDI collects telemetry data from Nexus switches and correlates it with advanced algorithms for baselining, anomaly detection, and forecasting.

CapabilityDescription
Real-time congestion scoringAssigns a numerical congestion score to each device, interface, and flow based on buffer utilization, ECN marks, and PFC events
Anomaly detectionProactively identifies performance bottlenecks using ML-driven baselines and raises alerts with suggested remediation
AI job observabilityEnd-to-end visibility from network layer through to GPU utilization, correlating network events with training job performance
Sustainability insightsMonitors power consumption and provides optimization recommendations for energy-efficient operation

ECN and PFC Analytics

ECN Metrics:

PFC Metrics:

Congestion Tree Visualization

When PFC pause frames propagate upstream from a congestion point, they create a tree-shaped pattern of affected switches and links. The congestion root is the bottleneck source (often a slow receiver or oversubscribed link), and the congestion branches include all upstream switches and hosts whose traffic is paused as a result.

PFC Pause Propagation -- Congestion Tree Formation

flowchart LR SrvA["Server A\n(Sender)"] -->|"RoCEv2 traffic"| L1["Leaf-1"] SrvB["Server B\n(Sender)"] -->|"RoCEv2 traffic"| L1 L1 -->|"Uplink"| Sp1["Spine-1"] Sp1 -->|"Downlink"| L3["Leaf-3"] Sp1 -->|"Downlink"| L4["Leaf-4"] L3 -->|"Delivery"| SrvX["Server X\n(Slow Receiver)\nCONGESTION ROOT"] L4 -->|"Delivery"| SrvY["Server Y\n(Innocent Bystander)"] SrvX -. "Buffer full" .-> L3 L3 -. "PFC Pause" .-> Sp1 Sp1 -. "PFC Pause" .-> L1 Sp1 -. "PFC Pause" .-> L4 L1 -. "PFC Pause" .-> SrvA L1 -. "PFC Pause" .-> SrvB style SrvX fill:#cc3333,color:#fff style SrvY fill:#cc8833,color:#fff style SrvA fill:#cc8833,color:#fff style SrvB fill:#cc8833,color:#fff style Sp1 fill:#996633,color:#fff style L1 fill:#996633,color:#fff style L3 fill:#996633,color:#fff style L4 fill:#996633,color:#fff

NDI visualizes these congestion trees on a topology map, allowing administrators to:

Key Takeaway: Nexus Dashboard Insights transforms raw switch telemetry into actionable congestion intelligence. Its congestion tree visualization is especially critical in lossless fabrics, where a single slow receiver can cascade PFC pauses across the entire fabric topology.
Animation Slot: Congestion tree forming in real time -- slow receiver triggers PFC pause propagation upstream through spine to leaves, highlighting innocent bystanders. NDI dashboard overlay showing congestion scores rising.
Post-Quiz -- Verify Your Understanding

1. What three data sources does NDI use to compute a congestion score?

A) CPU utilization, memory usage, and disk I/O B) Buffer utilization, ECN marks, and PFC events C) SNMP traps, syslog messages, and NetFlow records D) BGP route count, OSPF adjacency state, and ARP table size

2. Server X is a slow receiver causing congestion. Server Y shares a spine with Server X but has unrelated traffic. What happens to Server Y?

A) Server Y is unaffected because PFC is per-priority B) Server Y's lossless traffic is paused as an innocent bystander because PFC pauses propagate through the shared spine C) Server Y's traffic is rerouted through a different spine automatically D) Server Y's switch reboots due to the PFC storm

3. How does NDI detect anomalies in congestion behavior?

A) By comparing against static thresholds configured by the administrator B) By using ML-driven baselines and raising alerts when metrics deviate from learned normal behavior C) By running ping tests to every host every 5 seconds D) By analyzing BGP routing table changes

4. At what granularity does NDI present telemetry data?

A) Only at the device level B) Per-device, per-interface, and per-flow granularity C) Only at the fabric level (aggregate) D) Per-VLAN only

5. What critical operational question can PFC duration tracking help answer?

A) Which VLANs are configured on each port B) Whether pause events are brief (normal burst absorption) or sustained (indicating a problem requiring intervention) C) How many routes are in the routing table D) Which firmware version is running on each switch

Section 4: AI/ML Traffic Flow Monitoring

Pre-Quiz -- Test Your Existing Knowledge

1. What is the dominant traffic pattern during AI training gradient synchronization?

A) North-south unicast from servers to clients B) All-reduce -- every GPU sends to and receives from every other GPU, creating massive east-west traffic C) Multicast from a single parameter server D) Sequential point-to-point between adjacent GPUs only

2. How does inference traffic differ from training traffic?

A) Inference produces no network traffic B) Inference generates sustained elephant flows (e.g., KV-cache transfers) rather than short synchronized bursts C) Inference uses only broadcast traffic D) Inference traffic is identical to training traffic

3. What is a microburst?

A) A sustained period of high link utilization lasting minutes B) A short-lived traffic surge (microseconds to milliseconds) that exceeds link capacity even when average utilization is low C) A type of packet corruption caused by electromagnetic interference D) A brief power surge to the switch ASIC

4. Why does traditional polling-based monitoring miss microbursts?

A) Polling uses too much bandwidth B) Polling samples every 30-60 seconds, while microbursts last microseconds to milliseconds C) Polling only works on spine switches D) Polling is incompatible with RoCEv2

5. For inference-optimized fabric design, what buffer strategy is recommended?

A) No buffers -- drop all excess packets B) Shallow buffers with fast drain C) Deep buffers (HBM) for sustained absorption of elephant flows D) Per-VLAN dedicated buffers

Key Points

Traffic Patterns in AI Workloads

Traffic PatternWorkload StageCharacteristicsNetwork Impact
All-ReduceTraining (gradient sync)Every GPU sends to and receives from every other GPUMassive east-west bandwidth; synchronized incast
All-GatherTraining (parameter broadcast)Each GPU contributes a shard; all GPUs receive full resultHigh fan-out followed by high fan-in
Reduce-ScatterTraining (distributed reduction)Reduction result scattered across GPUsBalanced but bursty bidirectional traffic
Pipeline parallelTraining/InferenceData flows sequentially through model stages on different GPUsSustained point-to-point flows; latency-sensitive
KV-cache transferInference (prefill/decode)Large context windows require distributing key-value caches across nodesSustained elephant flows that can overwhelm buffers

Training vs. Inference: Different Stress on the Network

Training: Transient Incast. GPUs execute a compute phase (forward/backward pass) followed by a communication phase (gradient synchronization). The communication phase produces highly synchronized bursts where many GPUs transmit simultaneously. This transient incast can exceed link capacity by 10x or more for microseconds. Training congestion is self-limiting: buffers absorb the burst, traffic drains during the compute phase.

Inference: Sustained Elephant Flows. Inference (especially for LLMs with extended context windows) creates sustained, long-lived flows as models distribute memory and KV-caches. These elephant flows do not drain between phases. A fabric designed solely for training assumptions will experience sustained congestion during inference.

Design ParameterTraining OptimizedInference Optimized
Buffer strategyShallow buffers with fast drainDeep buffers (HBM) for sustained absorption
ECN thresholdsHigher thresholds (allow burst absorption)Lower thresholds (signal early before buffers fill)
PFC relianceOccasional PFC acceptableMinimize PFC; rely on ECN/WRED
Link capacitySized for peak burst bandwidthSized for sustained throughput

Microburst Detection

A microburst is a short-lived surge of traffic -- typically microseconds to low milliseconds -- that exceeds link capacity even when average utilization appears low. Traditional polling-based monitoring (sampling every 30-60 seconds) completely misses them.

NDI Microburst Detection uses hardware telemetry from the Nexus 9000 ASIC at sub-millisecond granularity:

NDI Congestion Monitoring Workflow for AI/ML Traffic

flowchart TD A["1. Baseline\nEstablish normal ECN rates,\nPFC counts, microburst frequency\nper AI job type"] --> B["2. Alert\nAnomaly thresholds trigger on\ndeviation from baseline\ne.g., 3x PFC rate increase"] B --> C["3. Diagnose\nCongestion tree visualization\nidentifies root cause;\nflow-level analytics pinpoints\naffected workload"] C --> D["4. Tune\nAdjust ECN/WRED thresholds,\nrebalance ECMP paths,\nor redistribute workloads"] D --> A style A fill:#2d7d5e,color:#fff style B fill:#cc8833,color:#fff style C fill:#cc3333,color:#fff style D fill:#2d5986,color:#fff

NDI's traffic analytics also auto-discovers services by well-known Layer 4 ports or user-defined custom categories, baselines them, and tracks latency, congestion, and drops over time. This is particularly valuable for converged fabrics where RoCEv2 AI traffic coexists with storage (NVMe-oF) and management traffic.

Key Takeaway: AI training produces transient microbursts from synchronized collectives, while inference generates sustained elephant flows. Both require monitoring at sub-millisecond granularity. NDI's hardware-based flow telemetry provides the visibility needed to detect, diagnose, and resolve congestion before it impacts job completion time.
Animation Slot: Side-by-side comparison of training traffic (synchronized bursts with quiet intervals) vs. inference traffic (continuous elephant flows). Buffer fill levels shown over time for each pattern. NDI dashboard overlay showing microburst detection alerts.
Post-Quiz -- Verify Your Understanding

1. During all-reduce gradient synchronization, what type of congestion pattern occurs?

A) Sustained unicast congestion on a single link B) Transient incast -- brief but intense synchronized bursts from many GPUs to overlapping destinations C) Broadcast storm across all VLANs D) No congestion because GPUs take turns transmitting

2. Why does a fabric designed solely for training assumptions fail during LLM inference workloads?

A) Inference requires different routing protocols B) Inference generates sustained elephant flows that do not drain between phases, eventually overflowing buffers regardless of size C) Inference uses a different Ethernet frame format D) Inference requires more VLANs than training

3. What is the correct sequence of the NDI monitoring workflow?

A) Alert, Diagnose, Baseline, Tune B) Baseline, Alert, Diagnose, Tune (then repeat) C) Tune, Baseline, Alert, Diagnose D) Diagnose, Tune, Alert, Baseline

4. How does NDI attribute a microburst to a specific workload?

A) By reading application-layer headers B) Through hardware flow telemetry that traces individual RoCEv2 flows to specific GPU pairs or collective operations C) By asking the administrator to label each flow manually D) By correlating with DNS lookups

5. For an inference-optimized fabric, which ECN threshold strategy is recommended?

A) Disable ECN entirely and rely on PFC B) Higher thresholds to allow maximum burst absorption C) Lower thresholds to signal congestion early before deep buffers fill D) No thresholds -- use tail drop instead

Your Progress

Answer Explanations