Chapter 9: High-Performance Networking — RDMA, RoCE, and QoS

Learning Objectives

Explain RDMA and RoCE/RoCEv2 protocol mechanisms and operations
Configure congestion control mechanisms including PFC, ECN, and ETS
Implement QoS for AI traffic on Cisco data center networks
Design load distribution strategies for AI fabric traffic

Section 1: RDMA Fundamentals

Pre-Quiz: RDMA Fundamentals

1. What is the primary benefit of RDMA over traditional TCP/IP networking?

It encrypts data in transit by default

It bypasses the OS kernel to transfer data directly between memory regions

It uses smaller packet sizes for faster transmission

It compresses data before sending across the network

2. Why is RDMA critical for distributed AI training?

It provides built-in model checkpointing

It eliminates the need for GPU-to-GPU communication

It reduces latency and CPU overhead for gradient exchange between GPUs

It automatically scales the number of GPUs in a cluster

3. Which RDMA implementation uses a dedicated physical fabric with its own switches and cables?

RoCE v1

RoCE v2

InfiniBand

iWARP

4. In the RDMA architecture, what component handles transport reliability and segmentation?

The operating system kernel

The application layer

The NIC hardware (HCA / RNIC)

The TCP/UDP stack

5. Approximately how much TCO savings can RoCEv2 offer compared to InfiniBand?

10%

25%

55%

80%

Key Points

RDMA enables direct memory-to-memory data transfers between computers, bypassing the OS kernel and CPU entirely
NVIDIA GPUDirect RDMA allows GPU-to-GPU memory transfers over the network without staging through system memory
RDMA applications use a user-space verbs API to communicate directly with the NIC (HCA or RNIC)
Three RDMA implementations exist: InfiniBand (dedicated fabric), RoCE v1 (L2 Ethernet), and RoCE v2 (routable UDP/IP)
RoCEv2 can achieve up to 55% TCO savings over InfiniBand but requires careful QoS configuration for lossless transport

What Is RDMA?

Remote Direct Memory Access (RDMA) is a technology that enables direct memory-to-memory data transfers between two computers without involving the operating system or CPU of either machine. By bypassing the kernel network stack entirely, RDMA dramatically reduces latency, eliminates CPU overhead, and increases throughput -- three properties essential for distributed AI training workloads.

Think of traditional networking like sending a package through a corporate mailroom: your data leaves the application, passes through several departments (the kernel, the TCP/IP stack, device drivers), gets wrapped in packaging at each stop, and eventually reaches the wire. RDMA is more like a private conveyor belt installed directly between two offices -- the data moves straight from the sender's desk to the receiver's desk with no intermediary handling.

Why RDMA Matters for AI/ML

In AI/ML clusters, GPUs across multiple servers must exchange gradient data at extremely high speed during distributed training. NVIDIA's GPUDirect RDMA enables GPU-to-GPU memory transfers over the network without staging data through system memory or the host CPU. This is critical because:

Latency sensitivity: Collective operations like AllReduce require synchronization across hundreds or thousands of GPUs. Every microsecond of network latency compounds across iterations.
CPU offload: Without RDMA, the host CPU must copy data between GPU memory, system memory, and the NIC. RDMA eliminates these copies.
Throughput: Modern AI fabrics operate at 100G, 200G, and 400G per port. RDMA's zero-copy architecture can saturate these links efficiently.

The RDMA Protocol Stack

flowchart LR subgraph Traditional["Traditional TCP/IP Path"] direction TB A1["Application"] --> A2["Socket API"] A2 --> A3["TCP/UDP Transport"] A3 --> A4["IP Layer"] A4 --> A5["Device Driver"] A5 --> A6["NIC Hardware"] end subgraph RDMA["RDMA Path"] direction TB B1["Application"] --> B2["User-Space Verbs API"] B2 --> B3["RNIC / HCA\n(handles transport,\nsegmentation, reassembly\nin hardware)"] end style Traditional fill:#f9e0e0,stroke:#c0392b style RDMA fill:#e0f9e0,stroke:#27ae60

In a conventional stack, application data traverses the socket API, TCP/UDP transport, IP layer, and device driver before reaching the NIC hardware. Each layer involves kernel context switches and memory copies. With RDMA, applications interact directly with the NIC through a user-space verbs API. The NIC itself handles transport reliability, segmentation, and reassembly in hardware.

Three RDMA Implementations

Feature	InfiniBand	RoCE v1	RoCE v2
Year Introduced	~1999	2010	2014
Layer	Dedicated L1-L4 fabric	Ethernet L2 only	Ethernet + UDP/IP (L3 routable)
EtherType / Port	N/A (own physical layer)	EtherType 0x8915	UDP destination port 4791
Routability	IB subnet routing (OpenSM)	Same L2 broadcast domain only	Full IP routing across subnets
Lossless Guarantee	Built-in credit-based flow control	Requires PFC/ECN on Ethernet	Requires PFC/ECN on Ethernet
Typical Use Case	HPC, scientific computing	Legacy/niche deployments	AI/ML data center fabrics

InfiniBand vs. Ethernet RDMA: The Trade-Off

InfiniBand remains the gold standard for raw RDMA performance. Its dedicated fabric provides the lowest, most consistent latency because every component is purpose-built for RDMA. Flow control is handled natively through a credit-based mechanism, so there is no need for complex PFC/ECN configuration.

However, InfiniBand requires a completely separate network infrastructure with specialized switches, cables, and management tools (such as the OpenSM subnet manager). Ethernet-based RoCEv2 deployments can achieve up to 55% total cost of ownership (TCO) savings -- including 56% OpEx savings and 55% CapEx savings over three years. For organizations already invested in Ethernet infrastructure, RoCEv2 allows them to add RDMA capabilities without building a parallel network.

Animation: Side-by-side comparison of traditional TCP/IP data path (multiple kernel copies) vs. RDMA zero-copy path, showing packets moving from application memory to wire

Post-Quiz: RDMA Fundamentals

1. Which API do RDMA applications use to communicate with the NIC?

Socket API

User-space verbs API

POSIX file I/O API

REST API

2. What makes RoCE v1 impractical for modern leaf-spine data center designs?

It requires InfiniBand switches

It only operates at Layer 2 and cannot cross router boundaries

It does not support lossless transport

It is limited to 10G speeds

3. NVIDIA GPUDirect RDMA enables what specific capability?

GPU-to-CPU memory transfers over PCIe

GPU-to-GPU memory transfers over the network without staging through system memory

Direct disk-to-GPU data loading

Automatic GPU memory defragmentation

4. Which flow control mechanism does InfiniBand use natively (without additional configuration)?

PFC (Priority Flow Control)

ECN (Explicit Congestion Notification)

Credit-based flow control

TCP window scaling

5. In the context of RDMA, what does HCA stand for?

High-Capacity Adapter

Host Channel Adapter

Hardware Control Agent

Host Convergence Accelerator

Section 2: RoCE and RoCEv2 Protocols

Pre-Quiz: RoCE and RoCEv2 Protocols

1. What UDP destination port does RoCEv2 use?

443

4791

8080

3260

2. What EtherType does RoCE v1 use for encapsulation?

0x0800

0x86DD

0x8915

0x8906

3. What is the approximate end-to-end latency of the Cisco Nexus 9000 for leaf-spine traffic?

50 microseconds

4.5 microseconds

1 millisecond

100 nanoseconds

4. Which field in the RoCEv2 packet provides end-to-end data integrity verification?

FCS (Frame Check Sequence)

UDP checksum

ICRC (Invariant CRC)

IP header checksum

5. What is the purpose of the UDP source port in RoCEv2 packets?

It identifies the sending application

It is derived from a hash of flow parameters for ECMP load balancing

It always matches the destination port 4791

It is randomly assigned by the operating system

Key Points

RoCE v1 (2010) uses EtherType 0x8915 and operates at Layer 2 only -- no routing across subnets
RoCE v2 (2014) wraps IB transport in UDP/IP (destination port 4791), enabling full Layer 3 routability
The IB Base Transport Header (BTH) contains opcode, partition key, destination queue pair number, and packet sequence number
ICRC provides end-to-end integrity that survives IP header modifications during routing
Deploying RoCEv2 on Nexus 9000 requires coordinated configuration of classification (type qos), queuing (type queuing), and network-qos policies
Cisco Nexus 9000 provides approximately 4.5 microsecond end-to-end latency across leaf and spine

RoCE v1 vs. RoCE v2

RoCE v1, introduced by the InfiniBand Trade Association (IBTA) in 2010, encapsulates InfiniBand transport packets directly inside Ethernet frames using EtherType 0x8915. Because it operates at Layer 2 only, RoCE v1 traffic cannot cross router boundaries -- both communicating hosts must reside in the same Ethernet broadcast domain.

RoCE v2, introduced in 2014, solves this by wrapping the InfiniBand transport header inside a UDP/IP packet (IPv4 or IPv6) with UDP destination port 4791. This "Routable RoCE" can traverse any IP network, making it fully compatible with leaf-spine Clos topologies. RoCEv2 is the dominant Ethernet-based RDMA transport for AI workloads today.

RoCEv2 Packet Format

flowchart LR subgraph RoCEv1["RoCE v1 Frame"] direction LR V1A["Ethernet Header\n(EtherType 0x8915)"] --> V1B["IB GRH"] --> V1C["IB BTH"] --> V1D["Payload"] --> V1E["ICRC"] --> V1F["FCS"] end subgraph RoCEv2["RoCE v2 Packet"] direction LR V2A["Ethernet\nHeader"] --> V2B["IP\nHeader"] --> V2C["UDP Header\n(dst 4791)"] --> V2D["IB BTH"] --> V2E["Payload"] --> V2F["ICRC"] --> V2G["FCS"] end

The RoCEv2 packet adds two key headers on top of the original InfiniBand transport:

IP Header: Provides Layer 3 addressing and enables routing. The DSCP field is used for QoS classification and ECN signaling.
UDP Header: Uses destination port 4791. The source port is typically derived from a hash of flow parameters, which is important for ECMP load balancing.
IB Base Transport Header (BTH): Contains the InfiniBand operation code, partition key, destination queue pair (QP) number, and packet sequence number. NX-OS 10.6(1)F supports filtering on BTH fields in access lists.
ICRC (Invariant CRC): Provides end-to-end data integrity verification that is not affected by IP header modifications during routing.

Deploying RoCEv2 on Cisco Nexus 9000

Deploying RoCEv2 requires configuring several interdependent features: PFC for lossless transport, ECN for congestion signaling, ETS for bandwidth allocation, and DCBX for parameter negotiation. Here is a basic RoCEv2 QoS configuration using the Modular QoS CLI (MQC):

! Step 1: Classification -- match RoCEv2 data and CNP traffic
class-map type qos match-all class-rocev2-data
  match dscp 26
class-map type qos match-all class-cnp
  match dscp 48

! Step 2: QoS policy -- set CoS and queue group for matched traffic
policy-map type qos rocev2-ingress-policy
  class class-rocev2-data
    set qos-group 3
  class class-cnp
    set qos-group 6

! Step 3: Queuing policy -- configure ECN thresholds and scheduling
policy-map type queuing rocev2-queuing-policy
  class type queuing c-out-8q-q3
    bandwidth percent 50
    random-detect minimum-threshold 150 kbytes maximum-threshold 3000 kbytes
    congestion-control ecn
  class type queuing c-out-8q-q6
    priority level 1

! Step 4: Network QoS -- enable PFC on the RoCEv2 traffic class
policy-map type network-qos rocev2-network-policy
  class type network-qos c-8q-nq3
    pause no-drop
    mtu 9216

! Step 5: Apply policies
system qos
  service-policy type network-qos rocev2-network-policy
  service-policy type queuing output rocev2-queuing-policy

interface Ethernet1/1
  service-policy type qos input rocev2-ingress-policy

Animation: RoCEv2 packet encapsulation walkthrough -- show how an InfiniBand transport payload gets wrapped with BTH, then UDP header (port 4791), then IP header, then Ethernet header as it traverses the stack

Post-Quiz: RoCE and RoCEv2 Protocols

1. In the Nexus MQC configuration, which DSCP value is used for RoCEv2 data traffic?

2. Which DSCP value is assigned to Congestion Notification Packets (CNPs)?

3. Under which NX-OS CLI context is the network-qos policy applied?

interface configuration

router ospf

system qos

vlan configuration

4. Why was RoCE v2 necessary for modern data center deployments?

RoCE v1 had no congestion control

RoCE v1 could not cross Layer 3 boundaries in leaf-spine topologies

RoCE v1 was limited to 10G link speeds

RoCE v1 required InfiniBand switches

5. What command enables lossless (no-drop) behavior for a traffic class in NX-OS?

congestion-control ecn

priority level 1

pause no-drop

bandwidth percent 50

Section 3: Congestion Control Mechanisms

Pre-Quiz: Congestion Control Mechanisms

1. What IEEE standard defines Priority Flow Control (PFC)?

IEEE 802.1Qaz

IEEE 802.1Qbb

IEEE 802.3x

RFC 3168

2. What does ECN do when a switch detects congestion?

Drops the congested packets

Sends a PFC PAUSE frame upstream

Marks packets with Congestion Experienced (CE) bits

Reroutes traffic to an alternate path

3. In the DCQCN framework, what role does the receiver NIC play?

Congestion Point -- marks packets

Reaction Point -- reduces sending rate

Notification Point -- generates CNP back to sender

Distribution Point -- load balances traffic

4. What protocol does DCBX use as its transport?

BGP

OSPF

LLDP

CDP

5. What is the purpose of Enhanced Transmission Selection (ETS)?

It encrypts traffic between switches

It provides bandwidth allocation and scheduling across traffic classes

It detects and mitigates PFC storms

It maps IP addresses to MAC addresses

Key Points

PFC (IEEE 802.1Qbb) provides per-priority pause to prevent packet loss -- it is the emergency fail-safe
PFC uses xoff/xon thresholds: PAUSE frame sent when buffer exceeds xoff, resume when buffer drops below xon
PFC risks include head-of-line blocking, PFC storms, and deadlocks -- mitigated by PFC watchdog and deadlock detection on Nexus 9000
ECN (RFC 3168) is proactive end-to-end congestion signaling: switches mark packets CE, receivers generate CNPs (DSCP 48)
DCQCN defines three roles: Congestion Point (switch), Notification Point (receiver NIC), Reaction Point (sender NIC)
In a well-tuned network, ECN handles routine congestion; PFC fires only during sudden bursts
ETS (IEEE 802.1Qaz) guarantees minimum bandwidth percentages per traffic class with strict priority and weighted scheduling
DCBX negotiates PFC, ETS, and application priority settings using a willing/non-willing model over LLDP

Priority Flow Control (PFC) -- IEEE 802.1Qbb

Standard Ethernet is inherently lossy -- when switch buffers overflow, packets are silently dropped. For RoCEv2, packet loss is catastrophic: it triggers expensive transport-layer retransmissions that can increase AI training job completion times by orders of magnitude.

PFC extends the legacy IEEE 802.3x PAUSE mechanism to operate on a per-priority basis. Rather than pausing all traffic on a link, PFC can selectively pause only the congested Class of Service (CoS) value while allowing other priorities to continue flowing.

stateDiagram-v2 [*] --> Normal Normal : Buffer below xon threshold\nAll priorities transmitting Normal --> Rising : Buffer utilization\nincreasing Rising : Buffer between xon and xoff\nMonitoring per-CoS usage Rising --> PFC_Pause : Buffer exceeds\nxoff threshold PFC_Pause : PFC PAUSE frame sent upstream\nCongested CoS paused\nOther CoS priorities flow normally PFC_Pause --> Draining : Upstream stops\nsending paused CoS Draining : Buffer draining\nWaiting for xon threshold Draining --> Normal : Buffer drops\nbelow xon threshold\n(Resume signal sent) Rising --> Normal : Congestion\nclears naturally

How PFC operates:

Each switch port monitors ingress buffer utilization per CoS priority (0 through 7).
When buffer usage exceeds the xoff threshold, the switch sends a PFC PAUSE frame upstream, specifying which priority to pause.
The upstream device stops transmitting that priority for the specified pause quanta.
When buffer usage drops below the xon threshold, the switch signals resume.
Traffic on all other priorities continues uninterrupted.

PFC Risks and Mitigations

Risk	Description	Cisco Mitigation
Head-of-line blocking	Pausing one flow blocks all flows sharing the same priority	Dedicate CoS for RoCEv2; avoid mixing traffic types
PFC storms	Cascading pause frames propagate across the fabric	PFC watchdog timer on Nexus 9000
Deadlocks	Circular buffer dependencies cause permanent traffic stall	PFC deadlock detection and recovery
Reduced throughput	Excessive pausing degrades effective bandwidth	Tune ECN thresholds so PFC fires only as last resort

Explicit Congestion Notification (ECN) -- RFC 3168

While PFC is a reactive, hop-by-hop emergency brake, ECN is a proactive, end-to-end congestion signaling mechanism. ECN marks packets rather than dropping them, allowing endpoints to reduce their sending rate before buffers overflow.

The sender sets the ECN field in the IP header to ECN-Capable Transport (ECT) -- binary 01 or 10.
When a switch queue depth crosses a configured threshold, it changes the ECN field to Congestion Experienced (CE) -- binary 11.
The receiver detects the CE marking and generates a Congestion Notification Packet (CNP) back to the sender (DSCP 48, strict-priority queue).
The sender receives the CNP and reduces its transmission rate.

ECN Threshold Parameters

Parameter	Function
Minimum threshold	Below this queue depth, no ECN marking occurs
Maximum threshold	Above this queue depth, all ECN-capable packets are marked CE
Between min and max	Probabilistic marking -- probability increases linearly with queue depth

DCQCN: Putting ECN and PFC Together

sequenceDiagram participant Sender as Sender NIC (Reaction Point) participant Switch as Switch (Congestion Point) participant Receiver as Receiver NIC (Notification Point) Sender->>Switch: Data packets with ECN ECT bits set Note over Switch: Queue depth crosses ECN threshold Switch->>Receiver: Packets marked with ECN CE (11) Note over Receiver: Detects CE-marked packets Receiver->>Sender: CNP (DSCP 48, strict priority) Note over Sender: Multiplicative decrease of sending rate Sender->>Switch: Reduced-rate data packets Note over Sender: Additive increase recovery over time Sender->>Switch: Gradually increasing rate rect rgb(255, 230, 230) Note over Sender,Receiver: If ECN cannot control congestion (sudden incast) Switch-->>Sender: PFC PAUSE frame (emergency fail-safe) Note over Sender: Transmission halted for paused CoS end

DCQCN Role	Location	Function
Congestion Point (CP)	Switch	Monitors queue depth; marks packets with ECN CE bits
Notification Point (NP)	Receiver NIC	Detects CE-marked packets; generates CNP back to sender
Reaction Point (RP)	Sender NIC	Receives CNP; reduces rate via multiplicative decrease; recovers via additive increase

Enhanced Transmission Selection (ETS) -- IEEE 802.1Qaz

ETS provides bandwidth allocation and scheduling across traffic classes. It ensures RoCEv2 traffic receives a guaranteed share of link bandwidth while preventing best-effort traffic from starving.

Traffic Class Mapping: Up to 8 traffic classes (TC 0-7), each mapped to one or more 802.1p CoS priorities.
Bandwidth Allocation: Each traffic class receives a guaranteed minimum percentage (must sum to 100%). Unused bandwidth is redistributed proportionally.
Scheduling: ETS supports both strict priority (for CNPs) and weighted scheduling (for data classes).

Example ETS Allocation for an AI Fabric

Traffic Class	Traffic Type	CoS	Bandwidth	Scheduling
TC 3	RoCEv2 Data	3	50%	Weighted (ETS)
TC 6	CNP	6	--	Strict Priority
TC 4	Storage (NVMe-oF)	4	25%	Weighted (ETS)
TC 0	Best-effort	0	20%	Weighted (ETS)
TC 7	Management/Control	7	5%	Weighted (ETS)

Data Center Bridging Exchange (DCBX) -- IEEE 802.1Qaz

DCBX is the discovery and negotiation protocol that automates DCB parameter exchange between connected devices. It rides on top of LLDP and negotiates three key TLV parameters:

PFC Configuration TLV: Which CoS priorities are lossless (PFC-enabled).
ETS Configuration / Recommendation TLV: Bandwidth allocation percentages and traffic class mappings.
Application Priority TLV: Which applications (RoCEv2, FCoE) map to which CoS priorities.

DCBX supports a willing/non-willing negotiation model. Typically, switches are non-willing (authoritative) and host NICs are willing, so the switch pushes consistent policy to all endpoints.

Animation: DCQCN congestion control flow -- show ECN marking at the switch, CNP generation at receiver, sender rate reduction with multiplicative decrease, then gradual additive increase recovery; include PFC activation as a fail-safe overlay

Post-Quiz: Congestion Control Mechanisms

1. What happens when PFC buffer usage exceeds the xoff threshold?

Packets are dropped from the queue

The switch sends a PFC PAUSE frame upstream for that priority

ECN marks are added to all packets

The switch reroutes traffic to another port

2. What binary value in the IP ECN field indicates Congestion Experienced?

3. In DCBX, what does a "willing" device do?

It refuses all parameter changes from peers

It accepts DCB parameters from its peer

It only negotiates ETS, not PFC

It sends configuration to all devices in the VLAN

4. Which Cisco Nexus 9000 feature mitigates PFC storm propagation?

ECN threshold tuning

PFC watchdog timer

DCBX willing mode

ETS strict priority scheduling

5. In a well-tuned AI fabric, which mechanism should handle the vast majority of congestion events?

PFC PAUSE frames

Packet dropping

ECN (via DCQCN)

DCBX renegotiation

Section 4: QoS and Load Distribution for AI

Pre-Quiz: QoS and Load Distribution for AI

1. What are the three policy-map types in Cisco NX-OS MQC?

type input, type output, type global

type qos, type queuing, type network-qos

type classification, type scheduling, type marking

type ingress, type egress, type system

2. Why should you never mix drop-eligible and lossless traffic in the same queue?

It causes VLAN mismatches

PFC pauses the entire queue, so drop-eligible traffic gets unnecessarily paused and lossless traffic may be dropped

It exceeds the maximum MTU size

It disables ECMP load balancing

3. What is the main problem with static ECMP for AI training workloads?

It does not support IPv6

Elephant flows hash to the same path, causing congestion while other paths sit idle

It requires manual path configuration

It does not work with leaf-spine topologies

4. What is a "flowlet" in Dynamic Load Balancing?

A new TCP connection within an existing session

A burst of packets within a flow, separated from other bursts by idle gaps

A flow that is less than 1 KB in size

A multicast packet group

5. What throughput improvement did Cisco benchmarks show for DLB flowlet mode over static ECMP?

18.6%

35%

55%

Key Points

NX-OS MQC uses three policy-map types: type qos (classification, per-interface ingress), type queuing (scheduling, egress), type network-qos (PFC/MTU, system-wide)
Never mix drop-eligible and lossless (no-drop) traffic in the same queue -- dedicate a queue exclusively to RoCEv2
RoCEv2 data uses DSCP 26 (AF31) and CNPs use DSCP 48 (CS6) with strict-priority scheduling
Static ECMP fails for AI workloads because elephant flows create link hot spots
DLB Flowlet mode (default on Nexus 9000 with NX-OS 10.5(1)F) re-hashes flows during idle gaps, delivering 18.6% throughput gain
Per-Packet Load Balancing (PLB) provides the most even distribution but requires endpoints that handle reordering
Approximate Fair Drop (AFD) distinguishes elephant vs. mice flows, applying more aggressive drop to elephant flows during congestion
Spine-leaf (Clos) topology provides non-blocking, consistent latency, and ECMP-friendly paths for AI fabrics
Nexus Dashboard Fabric Controller (NDFC) automates QoS provisioning; Nexus Dashboard Insights (NDI) provides real-time visibility

QoS Policy Design with MQC

flowchart LR subgraph Ingress["Ingress (per-interface)"] QOS["policy-map type qos\n\nClassification and Marking\n- Match DSCP, CoS, ACL\n- Assign qos-group"] end subgraph Egress["Egress (per-interface or system-wide)"] QUEUING["policy-map type queuing\n\nScheduling and Buffering\n- Bandwidth allocation\n- ECN thresholds\n- Strict priority queues"] end subgraph SystemWide["System-Wide (system qos)"] NQOS["policy-map type network-qos\n\nLossless Behavior\n- PFC enable/disable\n- MTU per traffic class\n- DCBX parameters"] end QOS -->|"qos-group\nmapping"| QUEUING NQOS -.->|"PFC and MTU\napplied globally"| QUEUING style Ingress fill:#e8f4fd,stroke:#2980b9 style Egress fill:#fdf2e8,stroke:#e67e22 style SystemWide fill:#f0e8fd,stroke:#8e44ad

Cisco NX-OS implements QoS through the Modular QoS CLI (MQC) framework with three distinct policy-map types:

Policy-Map Type	Purpose	Scope
`type qos`	Classification and marking -- matches traffic by DSCP, CoS, ACL; assigns qos-groups	Per-interface (ingress)
`type queuing`	Queue scheduling, bandwidth allocation, ECN thresholds, queue depth limits	Per-interface or system-wide (egress)
`type network-qos`	Network-wide behavior: PFC enable/disable, MTU per traffic class, DCBX	System-wide under `system qos`

Traffic Classification and Marking

Traffic Type	DSCP Value	CoS	Queue	Treatment
RoCEv2 Data	26 (AF31) or 24 (CS3)	3	Queue 3	Lossless: PFC-enabled, ECN-enabled
CNP	48 (CS6)	6	Queue 6	Strict priority, low-latency delivery
Storage (iSCSI/NVMe-oF)	14 (AF13)	4	Queue 4	May be lossless depending on requirements
Best-effort / Default	0 (BE)	0	Queue 0	Default drop-eligible treatment
Management / Control	46 (EF) or 48 (CS6)	7	Queue 7	Strict priority, small bandwidth

A critical design rule: never mix drop-eligible and lossless (no-drop) traffic in the same queue. When PFC pauses a queue, all traffic in that queue is paused. Cisco recommends dedicating a queue exclusively to no-drop RoCEv2 traffic.

Load Distribution Strategies

flowchart TD subgraph Static["Static ECMP"] S_SRC["Source GPU"] --> S_HASH{"5-tuple\nhash"} S_HASH -->|"Flow A"| S_P1["Path 1\n(congested)"] S_HASH -->|"Flow B"| S_P1 S_HASH -->|"Flow C"| S_P2["Path 2\n(idle)"] S_P1 --> S_DST["Destination"] S_P2 --> S_DST end subgraph Flowlet["Flowlet DLB"] F_SRC["Source GPU"] --> F_DET{"Detect idle\ngap in flow"} F_DET -->|"Flowlet 1"| F_P1["Path 1"] F_DET -->|"Flowlet 2"| F_P2["Path 2"] F_DET -->|"Flowlet 3"| F_P1 F_P1 --> F_DST["Destination"] F_P2 --> F_DST end subgraph PerPacket["Per-Packet PLB"] PP_SRC["Source GPU"] --> PP_RR{"Round-robin\nper packet"} PP_RR -->|"Pkt 1,3,5"| PP_P1["Path 1"] PP_RR -->|"Pkt 2,4,6"| PP_P2["Path 2"] PP_P1 --> PP_DST["Destination\n(must handle reorder)"] PP_P2 --> PP_DST end style Static fill:#f9e0e0,stroke:#c0392b style Flowlet fill:#e0f9e0,stroke:#27ae60 style PerPacket fill:#e0e0f9,stroke:#2c3e9c

The ECMP Problem with Elephant Flows

Traditional ECMP assigns entire flows to a single path based on a static 5-tuple hash. AI training generates large, persistent "elephant flows" between GPUs that can last minutes or hours. When multiple elephant flows hash to the same path, that link becomes congested while parallel links sit idle.

Dynamic Load Balancing (DLB)

Cisco Nexus 9000 switches (NX-OS 10.5(1)F and later) support Layer 3 ECMP Dynamic Load Balancing:

Flowlet Load Balancing (FLB) -- Default Mode: Identifies natural idle gaps within long-lived flows. When a gap exceeds a configurable threshold, the next burst (flowlet) can be routed to a different ECMP path. Packets within a single flowlet always follow the same path, avoiding reordering. Cisco benchmarks showed 18.6% performance gain (43.48 GB/s vs. 36.66 GB/s) over static ECMP.
Per-Packet Load Balancing (PLB): Distributes individual packets across all available ECMP paths. Provides the most even distribution but requires endpoints that can handle packet reordering.

Mode	Granularity	Reordering Risk	Best For
Static ECMP	Per-flow (5-tuple hash)	None	General-purpose workloads
Flowlet (FLB)	Per-flowlet (gap-based)	Minimal	RoCEv2 AI training (default)
Per-Packet (PLB)	Per-packet	Yes (endpoint must handle)	Endpoints with reorder tolerance

Approximate Fair Drop (AFD)

AFD distinguishes high-bandwidth "elephant flows" from short-lived "mice flows." It applies more aggressive drop probability to elephant flows during congestion, preventing them from starving mice flows. This is useful on front-end (north-south) networks where AI inference traffic coexists with smaller management and storage flows.

Fabric Design: Spine-Leaf (Clos) Topology

graph TD S1["Spine 1"] --- L1["Leaf 1"] S1 --- L2["Leaf 2"] S1 --- L3["Leaf 3"] S1 --- L4["Leaf 4"] S2["Spine 2"] --- L1 S2 --- L2 S2 --- L3 S2 --- L4 S3["Spine 3"] --- L1 S3 --- L2 S3 --- L3 S3 --- L4 L1 --- G1["GPU Server 1"] L1 --- G2["GPU Server 2"] L2 --- G3["GPU Server 3"] L2 --- G4["GPU Server 4"] L3 --- G5["GPU Server 5"] L3 --- G6["GPU Server 6"] L4 --- G7["GPU Server 7"] L4 --- G8["GPU Server 8"] style S1 fill:#d5e8d4,stroke:#82b366 style S2 fill:#d5e8d4,stroke:#82b366 style S3 fill:#d5e8d4,stroke:#82b366 style L1 fill:#dae8fc,stroke:#6c8ebf style L2 fill:#dae8fc,stroke:#6c8ebf style L3 fill:#dae8fc,stroke:#6c8ebf style L4 fill:#dae8fc,stroke:#6c8ebf

The spine-leaf (Clos) topology is the recommended architecture for AI/ML clusters:

Non-blocking: With 1:1 oversubscription ratio, every GPU can communicate at full line rate simultaneously.
Consistent latency: Every source-destination pair traverses the same number of hops (leaf-spine-leaf).
Horizontal scalability: Add capacity by adding more spine or leaf switches without redesigning the fabric.
ECMP-friendly: All leaf-to-leaf paths are equal cost, maximizing DLB effectiveness.

Cisco Nexus Dashboard Management Tools

Nexus Dashboard Fabric Controller (NDFC): Automates fabric provisioning including QoS configuration for PFC/ECN, template-based AI/ML network profiles, and consistent policy deployment.
Nexus Dashboard Insights (NDI): Provides real-time visibility into congestion hot spots, flow telemetry, PFC/ECN statistics, and anomaly detection.

Animation: Comparison of static ECMP vs. flowlet DLB -- show elephant flows hashing to the same path with static ECMP causing congestion, then flowlet DLB detecting idle gaps and redistributing flowlets across available paths for balanced utilization

Post-Quiz: QoS and Load Distribution for AI

1. Which MQC policy-map type is applied under "system qos" for network-wide lossless behavior?

type qos

type queuing

type network-qos

type global-qos

2. What distinguishes Flowlet DLB from Per-Packet PLB?

FLB only works on Layer 2 networks

FLB redistributes at idle gaps to avoid reordering; PLB distributes every packet but risks reordering

PLB requires dedicated InfiniBand hardware

FLB provides more even distribution than PLB

3. What does Approximate Fair Drop (AFD) do during congestion?

Pauses all traffic equally

Applies more aggressive drop probability to elephant flows to protect mice flows

Marks all packets with ECN CE bits

Reroutes elephant flows to dedicated paths

4. Which Cisco tool automates fabric provisioning and QoS configuration for AI/ML network profiles?

Nexus Dashboard Insights (NDI)

Cisco DNA Center

Nexus Dashboard Fabric Controller (NDFC)

Cisco ACI

5. Why is the spine-leaf topology ideal for AI/ML clusters?

It requires fewer cables than a traditional three-tier design

It provides non-blocking, consistent latency, horizontal scalability, and equal-cost paths for DLB

It eliminates the need for QoS configuration

It only requires two switches total

Chapter 9: High-Performance Networking — RDMA, RoCE, and QoS

Learning Objectives

Section 1: RDMA Fundamentals

Key Points

What Is RDMA?

Why RDMA Matters for AI/ML

The RDMA Protocol Stack

Three RDMA Implementations

InfiniBand vs. Ethernet RDMA: The Trade-Off

Section 2: RoCE and RoCEv2 Protocols

Key Points

RoCE v1 vs. RoCE v2

RoCEv2 Packet Format

Deploying RoCEv2 on Cisco Nexus 9000

Section 3: Congestion Control Mechanisms

Key Points

Priority Flow Control (PFC) -- IEEE 802.1Qbb

PFC Risks and Mitigations

Explicit Congestion Notification (ECN) -- RFC 3168

ECN Threshold Parameters

DCQCN: Putting ECN and PFC Together

Enhanced Transmission Selection (ETS) -- IEEE 802.1Qaz

Example ETS Allocation for an AI Fabric

Data Center Bridging Exchange (DCBX) -- IEEE 802.1Qaz

Section 4: QoS and Load Distribution for AI

Key Points

QoS Policy Design with MQC

Traffic Classification and Marking

Load Distribution Strategies

The ECMP Problem with Elephant Flows

Dynamic Load Balancing (DLB)

Approximate Fair Drop (AFD)

Fabric Design: Spine-Leaf (Clos) Topology

Cisco Nexus Dashboard Management Tools

Your Progress

Answer Explanations