Chapter 9: High-Performance Networking — RDMA, RoCE, and QoS

Learning Objectives

Section 1: RDMA Fundamentals

Pre-Quiz: RDMA Fundamentals

1. What is the primary benefit of RDMA over traditional TCP/IP networking?

It encrypts data in transit by default
It bypasses the OS kernel to transfer data directly between memory regions
It uses smaller packet sizes for faster transmission
It compresses data before sending across the network

2. Why is RDMA critical for distributed AI training?

It provides built-in model checkpointing
It eliminates the need for GPU-to-GPU communication
It reduces latency and CPU overhead for gradient exchange between GPUs
It automatically scales the number of GPUs in a cluster

3. Which RDMA implementation uses a dedicated physical fabric with its own switches and cables?

RoCE v1
RoCE v2
InfiniBand
iWARP

4. In the RDMA architecture, what component handles transport reliability and segmentation?

The operating system kernel
The application layer
The NIC hardware (HCA / RNIC)
The TCP/UDP stack

5. Approximately how much TCO savings can RoCEv2 offer compared to InfiniBand?

10%
25%
55%
80%

Key Points

What Is RDMA?

Remote Direct Memory Access (RDMA) is a technology that enables direct memory-to-memory data transfers between two computers without involving the operating system or CPU of either machine. By bypassing the kernel network stack entirely, RDMA dramatically reduces latency, eliminates CPU overhead, and increases throughput -- three properties essential for distributed AI training workloads.

Think of traditional networking like sending a package through a corporate mailroom: your data leaves the application, passes through several departments (the kernel, the TCP/IP stack, device drivers), gets wrapped in packaging at each stop, and eventually reaches the wire. RDMA is more like a private conveyor belt installed directly between two offices -- the data moves straight from the sender's desk to the receiver's desk with no intermediary handling.

Why RDMA Matters for AI/ML

In AI/ML clusters, GPUs across multiple servers must exchange gradient data at extremely high speed during distributed training. NVIDIA's GPUDirect RDMA enables GPU-to-GPU memory transfers over the network without staging data through system memory or the host CPU. This is critical because:

The RDMA Protocol Stack

flowchart LR subgraph Traditional["Traditional TCP/IP Path"] direction TB A1["Application"] --> A2["Socket API"] A2 --> A3["TCP/UDP Transport"] A3 --> A4["IP Layer"] A4 --> A5["Device Driver"] A5 --> A6["NIC Hardware"] end subgraph RDMA["RDMA Path"] direction TB B1["Application"] --> B2["User-Space Verbs API"] B2 --> B3["RNIC / HCA\n(handles transport,\nsegmentation, reassembly\nin hardware)"] end style Traditional fill:#f9e0e0,stroke:#c0392b style RDMA fill:#e0f9e0,stroke:#27ae60

In a conventional stack, application data traverses the socket API, TCP/UDP transport, IP layer, and device driver before reaching the NIC hardware. Each layer involves kernel context switches and memory copies. With RDMA, applications interact directly with the NIC through a user-space verbs API. The NIC itself handles transport reliability, segmentation, and reassembly in hardware.

Three RDMA Implementations

FeatureInfiniBandRoCE v1RoCE v2
Year Introduced~199920102014
LayerDedicated L1-L4 fabricEthernet L2 onlyEthernet + UDP/IP (L3 routable)
EtherType / PortN/A (own physical layer)EtherType 0x8915UDP destination port 4791
RoutabilityIB subnet routing (OpenSM)Same L2 broadcast domain onlyFull IP routing across subnets
Lossless GuaranteeBuilt-in credit-based flow controlRequires PFC/ECN on EthernetRequires PFC/ECN on Ethernet
Typical Use CaseHPC, scientific computingLegacy/niche deploymentsAI/ML data center fabrics

InfiniBand vs. Ethernet RDMA: The Trade-Off

InfiniBand remains the gold standard for raw RDMA performance. Its dedicated fabric provides the lowest, most consistent latency because every component is purpose-built for RDMA. Flow control is handled natively through a credit-based mechanism, so there is no need for complex PFC/ECN configuration.

However, InfiniBand requires a completely separate network infrastructure with specialized switches, cables, and management tools (such as the OpenSM subnet manager). Ethernet-based RoCEv2 deployments can achieve up to 55% total cost of ownership (TCO) savings -- including 56% OpEx savings and 55% CapEx savings over three years. For organizations already invested in Ethernet infrastructure, RoCEv2 allows them to add RDMA capabilities without building a parallel network.

Animation: Side-by-side comparison of traditional TCP/IP data path (multiple kernel copies) vs. RDMA zero-copy path, showing packets moving from application memory to wire
Post-Quiz: RDMA Fundamentals

1. Which API do RDMA applications use to communicate with the NIC?

Socket API
User-space verbs API
POSIX file I/O API
REST API

2. What makes RoCE v1 impractical for modern leaf-spine data center designs?

It requires InfiniBand switches
It only operates at Layer 2 and cannot cross router boundaries
It does not support lossless transport
It is limited to 10G speeds

3. NVIDIA GPUDirect RDMA enables what specific capability?

GPU-to-CPU memory transfers over PCIe
GPU-to-GPU memory transfers over the network without staging through system memory
Direct disk-to-GPU data loading
Automatic GPU memory defragmentation

4. Which flow control mechanism does InfiniBand use natively (without additional configuration)?

PFC (Priority Flow Control)
ECN (Explicit Congestion Notification)
Credit-based flow control
TCP window scaling

5. In the context of RDMA, what does HCA stand for?

High-Capacity Adapter
Host Channel Adapter
Hardware Control Agent
Host Convergence Accelerator

Section 2: RoCE and RoCEv2 Protocols

Pre-Quiz: RoCE and RoCEv2 Protocols

1. What UDP destination port does RoCEv2 use?

443
4791
8080
3260

2. What EtherType does RoCE v1 use for encapsulation?

0x0800
0x86DD
0x8915
0x8906

3. What is the approximate end-to-end latency of the Cisco Nexus 9000 for leaf-spine traffic?

50 microseconds
4.5 microseconds
1 millisecond
100 nanoseconds

4. Which field in the RoCEv2 packet provides end-to-end data integrity verification?

FCS (Frame Check Sequence)
UDP checksum
ICRC (Invariant CRC)
IP header checksum

5. What is the purpose of the UDP source port in RoCEv2 packets?

It identifies the sending application
It is derived from a hash of flow parameters for ECMP load balancing
It always matches the destination port 4791
It is randomly assigned by the operating system

Key Points

RoCE v1 vs. RoCE v2

RoCE v1, introduced by the InfiniBand Trade Association (IBTA) in 2010, encapsulates InfiniBand transport packets directly inside Ethernet frames using EtherType 0x8915. Because it operates at Layer 2 only, RoCE v1 traffic cannot cross router boundaries -- both communicating hosts must reside in the same Ethernet broadcast domain.

RoCE v2, introduced in 2014, solves this by wrapping the InfiniBand transport header inside a UDP/IP packet (IPv4 or IPv6) with UDP destination port 4791. This "Routable RoCE" can traverse any IP network, making it fully compatible with leaf-spine Clos topologies. RoCEv2 is the dominant Ethernet-based RDMA transport for AI workloads today.

RoCEv2 Packet Format

flowchart LR subgraph RoCEv1["RoCE v1 Frame"] direction LR V1A["Ethernet Header\n(EtherType 0x8915)"] --> V1B["IB GRH"] --> V1C["IB BTH"] --> V1D["Payload"] --> V1E["ICRC"] --> V1F["FCS"] end subgraph RoCEv2["RoCE v2 Packet"] direction LR V2A["Ethernet\nHeader"] --> V2B["IP\nHeader"] --> V2C["UDP Header\n(dst 4791)"] --> V2D["IB BTH"] --> V2E["Payload"] --> V2F["ICRC"] --> V2G["FCS"] end

The RoCEv2 packet adds two key headers on top of the original InfiniBand transport:

Deploying RoCEv2 on Cisco Nexus 9000

Deploying RoCEv2 requires configuring several interdependent features: PFC for lossless transport, ECN for congestion signaling, ETS for bandwidth allocation, and DCBX for parameter negotiation. Here is a basic RoCEv2 QoS configuration using the Modular QoS CLI (MQC):

! Step 1: Classification -- match RoCEv2 data and CNP traffic
class-map type qos match-all class-rocev2-data
  match dscp 26
class-map type qos match-all class-cnp
  match dscp 48

! Step 2: QoS policy -- set CoS and queue group for matched traffic
policy-map type qos rocev2-ingress-policy
  class class-rocev2-data
    set qos-group 3
  class class-cnp
    set qos-group 6

! Step 3: Queuing policy -- configure ECN thresholds and scheduling
policy-map type queuing rocev2-queuing-policy
  class type queuing c-out-8q-q3
    bandwidth percent 50
    random-detect minimum-threshold 150 kbytes maximum-threshold 3000 kbytes
    congestion-control ecn
  class type queuing c-out-8q-q6
    priority level 1

! Step 4: Network QoS -- enable PFC on the RoCEv2 traffic class
policy-map type network-qos rocev2-network-policy
  class type network-qos c-8q-nq3
    pause no-drop
    mtu 9216

! Step 5: Apply policies
system qos
  service-policy type network-qos rocev2-network-policy
  service-policy type queuing output rocev2-queuing-policy

interface Ethernet1/1
  service-policy type qos input rocev2-ingress-policy
Animation: RoCEv2 packet encapsulation walkthrough -- show how an InfiniBand transport payload gets wrapped with BTH, then UDP header (port 4791), then IP header, then Ethernet header as it traverses the stack
Post-Quiz: RoCE and RoCEv2 Protocols

1. In the Nexus MQC configuration, which DSCP value is used for RoCEv2 data traffic?

0
26
46
48

2. Which DSCP value is assigned to Congestion Notification Packets (CNPs)?

26
34
46
48

3. Under which NX-OS CLI context is the network-qos policy applied?

interface configuration
router ospf
system qos
vlan configuration

4. Why was RoCE v2 necessary for modern data center deployments?

RoCE v1 had no congestion control
RoCE v1 could not cross Layer 3 boundaries in leaf-spine topologies
RoCE v1 was limited to 10G link speeds
RoCE v1 required InfiniBand switches

5. What command enables lossless (no-drop) behavior for a traffic class in NX-OS?

congestion-control ecn
priority level 1
pause no-drop
bandwidth percent 50

Section 3: Congestion Control Mechanisms

Pre-Quiz: Congestion Control Mechanisms

1. What IEEE standard defines Priority Flow Control (PFC)?

IEEE 802.1Qaz
IEEE 802.1Qbb
IEEE 802.3x
RFC 3168

2. What does ECN do when a switch detects congestion?

Drops the congested packets
Sends a PFC PAUSE frame upstream
Marks packets with Congestion Experienced (CE) bits
Reroutes traffic to an alternate path

3. In the DCQCN framework, what role does the receiver NIC play?

Congestion Point -- marks packets
Reaction Point -- reduces sending rate
Notification Point -- generates CNP back to sender
Distribution Point -- load balances traffic

4. What protocol does DCBX use as its transport?

BGP
OSPF
LLDP
CDP

5. What is the purpose of Enhanced Transmission Selection (ETS)?

It encrypts traffic between switches
It provides bandwidth allocation and scheduling across traffic classes
It detects and mitigates PFC storms
It maps IP addresses to MAC addresses

Key Points

Priority Flow Control (PFC) -- IEEE 802.1Qbb

Standard Ethernet is inherently lossy -- when switch buffers overflow, packets are silently dropped. For RoCEv2, packet loss is catastrophic: it triggers expensive transport-layer retransmissions that can increase AI training job completion times by orders of magnitude.

PFC extends the legacy IEEE 802.3x PAUSE mechanism to operate on a per-priority basis. Rather than pausing all traffic on a link, PFC can selectively pause only the congested Class of Service (CoS) value while allowing other priorities to continue flowing.

stateDiagram-v2 [*] --> Normal Normal : Buffer below xon threshold\nAll priorities transmitting Normal --> Rising : Buffer utilization\nincreasing Rising : Buffer between xon and xoff\nMonitoring per-CoS usage Rising --> PFC_Pause : Buffer exceeds\nxoff threshold PFC_Pause : PFC PAUSE frame sent upstream\nCongested CoS paused\nOther CoS priorities flow normally PFC_Pause --> Draining : Upstream stops\nsending paused CoS Draining : Buffer draining\nWaiting for xon threshold Draining --> Normal : Buffer drops\nbelow xon threshold\n(Resume signal sent) Rising --> Normal : Congestion\nclears naturally

How PFC operates:

  1. Each switch port monitors ingress buffer utilization per CoS priority (0 through 7).
  2. When buffer usage exceeds the xoff threshold, the switch sends a PFC PAUSE frame upstream, specifying which priority to pause.
  3. The upstream device stops transmitting that priority for the specified pause quanta.
  4. When buffer usage drops below the xon threshold, the switch signals resume.
  5. Traffic on all other priorities continues uninterrupted.

PFC Risks and Mitigations

RiskDescriptionCisco Mitigation
Head-of-line blockingPausing one flow blocks all flows sharing the same priorityDedicate CoS for RoCEv2; avoid mixing traffic types
PFC stormsCascading pause frames propagate across the fabricPFC watchdog timer on Nexus 9000
DeadlocksCircular buffer dependencies cause permanent traffic stallPFC deadlock detection and recovery
Reduced throughputExcessive pausing degrades effective bandwidthTune ECN thresholds so PFC fires only as last resort

Explicit Congestion Notification (ECN) -- RFC 3168

While PFC is a reactive, hop-by-hop emergency brake, ECN is a proactive, end-to-end congestion signaling mechanism. ECN marks packets rather than dropping them, allowing endpoints to reduce their sending rate before buffers overflow.

  1. The sender sets the ECN field in the IP header to ECN-Capable Transport (ECT) -- binary 01 or 10.
  2. When a switch queue depth crosses a configured threshold, it changes the ECN field to Congestion Experienced (CE) -- binary 11.
  3. The receiver detects the CE marking and generates a Congestion Notification Packet (CNP) back to the sender (DSCP 48, strict-priority queue).
  4. The sender receives the CNP and reduces its transmission rate.

ECN Threshold Parameters

ParameterFunction
Minimum thresholdBelow this queue depth, no ECN marking occurs
Maximum thresholdAbove this queue depth, all ECN-capable packets are marked CE
Between min and maxProbabilistic marking -- probability increases linearly with queue depth

DCQCN: Putting ECN and PFC Together

sequenceDiagram participant Sender as Sender NIC (Reaction Point) participant Switch as Switch (Congestion Point) participant Receiver as Receiver NIC (Notification Point) Sender->>Switch: Data packets with ECN ECT bits set Note over Switch: Queue depth crosses ECN threshold Switch->>Receiver: Packets marked with ECN CE (11) Note over Receiver: Detects CE-marked packets Receiver->>Sender: CNP (DSCP 48, strict priority) Note over Sender: Multiplicative decrease of sending rate Sender->>Switch: Reduced-rate data packets Note over Sender: Additive increase recovery over time Sender->>Switch: Gradually increasing rate rect rgb(255, 230, 230) Note over Sender,Receiver: If ECN cannot control congestion (sudden incast) Switch-->>Sender: PFC PAUSE frame (emergency fail-safe) Note over Sender: Transmission halted for paused CoS end
DCQCN RoleLocationFunction
Congestion Point (CP)SwitchMonitors queue depth; marks packets with ECN CE bits
Notification Point (NP)Receiver NICDetects CE-marked packets; generates CNP back to sender
Reaction Point (RP)Sender NICReceives CNP; reduces rate via multiplicative decrease; recovers via additive increase

Enhanced Transmission Selection (ETS) -- IEEE 802.1Qaz

ETS provides bandwidth allocation and scheduling across traffic classes. It ensures RoCEv2 traffic receives a guaranteed share of link bandwidth while preventing best-effort traffic from starving.

Example ETS Allocation for an AI Fabric

Traffic ClassTraffic TypeCoSBandwidthScheduling
TC 3RoCEv2 Data350%Weighted (ETS)
TC 6CNP6--Strict Priority
TC 4Storage (NVMe-oF)425%Weighted (ETS)
TC 0Best-effort020%Weighted (ETS)
TC 7Management/Control75%Weighted (ETS)

Data Center Bridging Exchange (DCBX) -- IEEE 802.1Qaz

DCBX is the discovery and negotiation protocol that automates DCB parameter exchange between connected devices. It rides on top of LLDP and negotiates three key TLV parameters:

  1. PFC Configuration TLV: Which CoS priorities are lossless (PFC-enabled).
  2. ETS Configuration / Recommendation TLV: Bandwidth allocation percentages and traffic class mappings.
  3. Application Priority TLV: Which applications (RoCEv2, FCoE) map to which CoS priorities.

DCBX supports a willing/non-willing negotiation model. Typically, switches are non-willing (authoritative) and host NICs are willing, so the switch pushes consistent policy to all endpoints.

Animation: DCQCN congestion control flow -- show ECN marking at the switch, CNP generation at receiver, sender rate reduction with multiplicative decrease, then gradual additive increase recovery; include PFC activation as a fail-safe overlay
Post-Quiz: Congestion Control Mechanisms

1. What happens when PFC buffer usage exceeds the xoff threshold?

Packets are dropped from the queue
The switch sends a PFC PAUSE frame upstream for that priority
ECN marks are added to all packets
The switch reroutes traffic to another port

2. What binary value in the IP ECN field indicates Congestion Experienced?

00
01
10
11

3. In DCBX, what does a "willing" device do?

It refuses all parameter changes from peers
It accepts DCB parameters from its peer
It only negotiates ETS, not PFC
It sends configuration to all devices in the VLAN

4. Which Cisco Nexus 9000 feature mitigates PFC storm propagation?

ECN threshold tuning
PFC watchdog timer
DCBX willing mode
ETS strict priority scheduling

5. In a well-tuned AI fabric, which mechanism should handle the vast majority of congestion events?

PFC PAUSE frames
Packet dropping
ECN (via DCQCN)
DCBX renegotiation

Section 4: QoS and Load Distribution for AI

Pre-Quiz: QoS and Load Distribution for AI

1. What are the three policy-map types in Cisco NX-OS MQC?

type input, type output, type global
type qos, type queuing, type network-qos
type classification, type scheduling, type marking
type ingress, type egress, type system

2. Why should you never mix drop-eligible and lossless traffic in the same queue?

It causes VLAN mismatches
PFC pauses the entire queue, so drop-eligible traffic gets unnecessarily paused and lossless traffic may be dropped
It exceeds the maximum MTU size
It disables ECMP load balancing

3. What is the main problem with static ECMP for AI training workloads?

It does not support IPv6
Elephant flows hash to the same path, causing congestion while other paths sit idle
It requires manual path configuration
It does not work with leaf-spine topologies

4. What is a "flowlet" in Dynamic Load Balancing?

A new TCP connection within an existing session
A burst of packets within a flow, separated from other bursts by idle gaps
A flow that is less than 1 KB in size
A multicast packet group

5. What throughput improvement did Cisco benchmarks show for DLB flowlet mode over static ECMP?

5%
18.6%
35%
55%

Key Points

QoS Policy Design with MQC

flowchart LR subgraph Ingress["Ingress (per-interface)"] QOS["policy-map type qos\n\nClassification and Marking\n- Match DSCP, CoS, ACL\n- Assign qos-group"] end subgraph Egress["Egress (per-interface or system-wide)"] QUEUING["policy-map type queuing\n\nScheduling and Buffering\n- Bandwidth allocation\n- ECN thresholds\n- Strict priority queues"] end subgraph SystemWide["System-Wide (system qos)"] NQOS["policy-map type network-qos\n\nLossless Behavior\n- PFC enable/disable\n- MTU per traffic class\n- DCBX parameters"] end QOS -->|"qos-group\nmapping"| QUEUING NQOS -.->|"PFC and MTU\napplied globally"| QUEUING style Ingress fill:#e8f4fd,stroke:#2980b9 style Egress fill:#fdf2e8,stroke:#e67e22 style SystemWide fill:#f0e8fd,stroke:#8e44ad

Cisco NX-OS implements QoS through the Modular QoS CLI (MQC) framework with three distinct policy-map types:

Policy-Map TypePurposeScope
type qosClassification and marking -- matches traffic by DSCP, CoS, ACL; assigns qos-groupsPer-interface (ingress)
type queuingQueue scheduling, bandwidth allocation, ECN thresholds, queue depth limitsPer-interface or system-wide (egress)
type network-qosNetwork-wide behavior: PFC enable/disable, MTU per traffic class, DCBXSystem-wide under system qos

Traffic Classification and Marking

Traffic TypeDSCP ValueCoSQueueTreatment
RoCEv2 Data26 (AF31) or 24 (CS3)3Queue 3Lossless: PFC-enabled, ECN-enabled
CNP48 (CS6)6Queue 6Strict priority, low-latency delivery
Storage (iSCSI/NVMe-oF)14 (AF13)4Queue 4May be lossless depending on requirements
Best-effort / Default0 (BE)0Queue 0Default drop-eligible treatment
Management / Control46 (EF) or 48 (CS6)7Queue 7Strict priority, small bandwidth

A critical design rule: never mix drop-eligible and lossless (no-drop) traffic in the same queue. When PFC pauses a queue, all traffic in that queue is paused. Cisco recommends dedicating a queue exclusively to no-drop RoCEv2 traffic.

Load Distribution Strategies

flowchart TD subgraph Static["Static ECMP"] S_SRC["Source GPU"] --> S_HASH{"5-tuple\nhash"} S_HASH -->|"Flow A"| S_P1["Path 1\n(congested)"] S_HASH -->|"Flow B"| S_P1 S_HASH -->|"Flow C"| S_P2["Path 2\n(idle)"] S_P1 --> S_DST["Destination"] S_P2 --> S_DST end subgraph Flowlet["Flowlet DLB"] F_SRC["Source GPU"] --> F_DET{"Detect idle\ngap in flow"} F_DET -->|"Flowlet 1"| F_P1["Path 1"] F_DET -->|"Flowlet 2"| F_P2["Path 2"] F_DET -->|"Flowlet 3"| F_P1 F_P1 --> F_DST["Destination"] F_P2 --> F_DST end subgraph PerPacket["Per-Packet PLB"] PP_SRC["Source GPU"] --> PP_RR{"Round-robin\nper packet"} PP_RR -->|"Pkt 1,3,5"| PP_P1["Path 1"] PP_RR -->|"Pkt 2,4,6"| PP_P2["Path 2"] PP_P1 --> PP_DST["Destination\n(must handle reorder)"] PP_P2 --> PP_DST end style Static fill:#f9e0e0,stroke:#c0392b style Flowlet fill:#e0f9e0,stroke:#27ae60 style PerPacket fill:#e0e0f9,stroke:#2c3e9c

The ECMP Problem with Elephant Flows

Traditional ECMP assigns entire flows to a single path based on a static 5-tuple hash. AI training generates large, persistent "elephant flows" between GPUs that can last minutes or hours. When multiple elephant flows hash to the same path, that link becomes congested while parallel links sit idle.

Dynamic Load Balancing (DLB)

Cisco Nexus 9000 switches (NX-OS 10.5(1)F and later) support Layer 3 ECMP Dynamic Load Balancing:

ModeGranularityReordering RiskBest For
Static ECMPPer-flow (5-tuple hash)NoneGeneral-purpose workloads
Flowlet (FLB)Per-flowlet (gap-based)MinimalRoCEv2 AI training (default)
Per-Packet (PLB)Per-packetYes (endpoint must handle)Endpoints with reorder tolerance

Approximate Fair Drop (AFD)

AFD distinguishes high-bandwidth "elephant flows" from short-lived "mice flows." It applies more aggressive drop probability to elephant flows during congestion, preventing them from starving mice flows. This is useful on front-end (north-south) networks where AI inference traffic coexists with smaller management and storage flows.

Fabric Design: Spine-Leaf (Clos) Topology

graph TD S1["Spine 1"] --- L1["Leaf 1"] S1 --- L2["Leaf 2"] S1 --- L3["Leaf 3"] S1 --- L4["Leaf 4"] S2["Spine 2"] --- L1 S2 --- L2 S2 --- L3 S2 --- L4 S3["Spine 3"] --- L1 S3 --- L2 S3 --- L3 S3 --- L4 L1 --- G1["GPU Server 1"] L1 --- G2["GPU Server 2"] L2 --- G3["GPU Server 3"] L2 --- G4["GPU Server 4"] L3 --- G5["GPU Server 5"] L3 --- G6["GPU Server 6"] L4 --- G7["GPU Server 7"] L4 --- G8["GPU Server 8"] style S1 fill:#d5e8d4,stroke:#82b366 style S2 fill:#d5e8d4,stroke:#82b366 style S3 fill:#d5e8d4,stroke:#82b366 style L1 fill:#dae8fc,stroke:#6c8ebf style L2 fill:#dae8fc,stroke:#6c8ebf style L3 fill:#dae8fc,stroke:#6c8ebf style L4 fill:#dae8fc,stroke:#6c8ebf

The spine-leaf (Clos) topology is the recommended architecture for AI/ML clusters:

Cisco Nexus Dashboard Management Tools

Animation: Comparison of static ECMP vs. flowlet DLB -- show elephant flows hashing to the same path with static ECMP causing congestion, then flowlet DLB detecting idle gaps and redistributing flowlets across available paths for balanced utilization
Post-Quiz: QoS and Load Distribution for AI

1. Which MQC policy-map type is applied under "system qos" for network-wide lossless behavior?

type qos
type queuing
type network-qos
type global-qos

2. What distinguishes Flowlet DLB from Per-Packet PLB?

FLB only works on Layer 2 networks
FLB redistributes at idle gaps to avoid reordering; PLB distributes every packet but risks reordering
PLB requires dedicated InfiniBand hardware
FLB provides more even distribution than PLB

3. What does Approximate Fair Drop (AFD) do during congestion?

Pauses all traffic equally
Applies more aggressive drop probability to elephant flows to protect mice flows
Marks all packets with ECN CE bits
Reroutes elephant flows to dedicated paths

4. Which Cisco tool automates fabric provisioning and QoS configuration for AI/ML network profiles?

Nexus Dashboard Insights (NDI)
Cisco DNA Center
Nexus Dashboard Fabric Controller (NDFC)
Cisco ACI

5. Why is the spine-leaf topology ideal for AI/ML clusters?

It requires fewer cables than a traditional three-tier design
It provides non-blocking, consistent latency, horizontal scalability, and equal-cost paths for DLB
It eliminates the need for QoS configuration
It only requires two switches total

Your Progress

Answer Explanations