Chapter 9: High-Performance Networking — RDMA, RoCE, and QoS
Learning Objectives
Explain RDMA and RoCE/RoCEv2 protocol mechanisms and operations
Configure congestion control mechanisms including PFC, ECN, and ETS
Implement QoS for AI traffic on Cisco data center networks
Design load distribution strategies for AI fabric traffic
Section 1: RDMA Fundamentals
Pre-Quiz: RDMA Fundamentals
1. What is the primary benefit of RDMA over traditional TCP/IP networking?
It encrypts data in transit by default
It bypasses the OS kernel to transfer data directly between memory regions
It uses smaller packet sizes for faster transmission
It compresses data before sending across the network
2. Why is RDMA critical for distributed AI training?
It provides built-in model checkpointing
It eliminates the need for GPU-to-GPU communication
It reduces latency and CPU overhead for gradient exchange between GPUs
It automatically scales the number of GPUs in a cluster
3. Which RDMA implementation uses a dedicated physical fabric with its own switches and cables?
RoCE v1
RoCE v2
InfiniBand
iWARP
4. In the RDMA architecture, what component handles transport reliability and segmentation?
The operating system kernel
The application layer
The NIC hardware (HCA / RNIC)
The TCP/UDP stack
5. Approximately how much TCO savings can RoCEv2 offer compared to InfiniBand?
10%
25%
55%
80%
Key Points
RDMA enables direct memory-to-memory data transfers between computers, bypassing the OS kernel and CPU entirely
NVIDIA GPUDirect RDMA allows GPU-to-GPU memory transfers over the network without staging through system memory
RDMA applications use a user-space verbs API to communicate directly with the NIC (HCA or RNIC)
Three RDMA implementations exist: InfiniBand (dedicated fabric), RoCE v1 (L2 Ethernet), and RoCE v2 (routable UDP/IP)
RoCEv2 can achieve up to 55% TCO savings over InfiniBand but requires careful QoS configuration for lossless transport
What Is RDMA?
Remote Direct Memory Access (RDMA) is a technology that enables direct memory-to-memory data transfers between two computers without involving the operating system or CPU of either machine. By bypassing the kernel network stack entirely, RDMA dramatically reduces latency, eliminates CPU overhead, and increases throughput -- three properties essential for distributed AI training workloads.
Think of traditional networking like sending a package through a corporate mailroom: your data leaves the application, passes through several departments (the kernel, the TCP/IP stack, device drivers), gets wrapped in packaging at each stop, and eventually reaches the wire. RDMA is more like a private conveyor belt installed directly between two offices -- the data moves straight from the sender's desk to the receiver's desk with no intermediary handling.
Why RDMA Matters for AI/ML
In AI/ML clusters, GPUs across multiple servers must exchange gradient data at extremely high speed during distributed training. NVIDIA's GPUDirect RDMA enables GPU-to-GPU memory transfers over the network without staging data through system memory or the host CPU. This is critical because:
Latency sensitivity: Collective operations like AllReduce require synchronization across hundreds or thousands of GPUs. Every microsecond of network latency compounds across iterations.
CPU offload: Without RDMA, the host CPU must copy data between GPU memory, system memory, and the NIC. RDMA eliminates these copies.
Throughput: Modern AI fabrics operate at 100G, 200G, and 400G per port. RDMA's zero-copy architecture can saturate these links efficiently.
The RDMA Protocol Stack
flowchart LR
subgraph Traditional["Traditional TCP/IP Path"]
direction TB
A1["Application"] --> A2["Socket API"]
A2 --> A3["TCP/UDP Transport"]
A3 --> A4["IP Layer"]
A4 --> A5["Device Driver"]
A5 --> A6["NIC Hardware"]
end
subgraph RDMA["RDMA Path"]
direction TB
B1["Application"] --> B2["User-Space Verbs API"]
B2 --> B3["RNIC / HCA\n(handles transport,\nsegmentation, reassembly\nin hardware)"]
end
style Traditional fill:#f9e0e0,stroke:#c0392b
style RDMA fill:#e0f9e0,stroke:#27ae60
In a conventional stack, application data traverses the socket API, TCP/UDP transport, IP layer, and device driver before reaching the NIC hardware. Each layer involves kernel context switches and memory copies. With RDMA, applications interact directly with the NIC through a user-space verbs API. The NIC itself handles transport reliability, segmentation, and reassembly in hardware.
Three RDMA Implementations
Feature
InfiniBand
RoCE v1
RoCE v2
Year Introduced
~1999
2010
2014
Layer
Dedicated L1-L4 fabric
Ethernet L2 only
Ethernet + UDP/IP (L3 routable)
EtherType / Port
N/A (own physical layer)
EtherType 0x8915
UDP destination port 4791
Routability
IB subnet routing (OpenSM)
Same L2 broadcast domain only
Full IP routing across subnets
Lossless Guarantee
Built-in credit-based flow control
Requires PFC/ECN on Ethernet
Requires PFC/ECN on Ethernet
Typical Use Case
HPC, scientific computing
Legacy/niche deployments
AI/ML data center fabrics
InfiniBand vs. Ethernet RDMA: The Trade-Off
InfiniBand remains the gold standard for raw RDMA performance. Its dedicated fabric provides the lowest, most consistent latency because every component is purpose-built for RDMA. Flow control is handled natively through a credit-based mechanism, so there is no need for complex PFC/ECN configuration.
However, InfiniBand requires a completely separate network infrastructure with specialized switches, cables, and management tools (such as the OpenSM subnet manager). Ethernet-based RoCEv2 deployments can achieve up to 55% total cost of ownership (TCO) savings -- including 56% OpEx savings and 55% CapEx savings over three years. For organizations already invested in Ethernet infrastructure, RoCEv2 allows them to add RDMA capabilities without building a parallel network.
Animation: Side-by-side comparison of traditional TCP/IP data path (multiple kernel copies) vs. RDMA zero-copy path, showing packets moving from application memory to wire
Post-Quiz: RDMA Fundamentals
1. Which API do RDMA applications use to communicate with the NIC?
Socket API
User-space verbs API
POSIX file I/O API
REST API
2. What makes RoCE v1 impractical for modern leaf-spine data center designs?
It requires InfiniBand switches
It only operates at Layer 2 and cannot cross router boundaries
It does not support lossless transport
It is limited to 10G speeds
3. NVIDIA GPUDirect RDMA enables what specific capability?
GPU-to-CPU memory transfers over PCIe
GPU-to-GPU memory transfers over the network without staging through system memory
Direct disk-to-GPU data loading
Automatic GPU memory defragmentation
4. Which flow control mechanism does InfiniBand use natively (without additional configuration)?
PFC (Priority Flow Control)
ECN (Explicit Congestion Notification)
Credit-based flow control
TCP window scaling
5. In the context of RDMA, what does HCA stand for?
High-Capacity Adapter
Host Channel Adapter
Hardware Control Agent
Host Convergence Accelerator
Section 2: RoCE and RoCEv2 Protocols
Pre-Quiz: RoCE and RoCEv2 Protocols
1. What UDP destination port does RoCEv2 use?
443
4791
8080
3260
2. What EtherType does RoCE v1 use for encapsulation?
0x0800
0x86DD
0x8915
0x8906
3. What is the approximate end-to-end latency of the Cisco Nexus 9000 for leaf-spine traffic?
50 microseconds
4.5 microseconds
1 millisecond
100 nanoseconds
4. Which field in the RoCEv2 packet provides end-to-end data integrity verification?
FCS (Frame Check Sequence)
UDP checksum
ICRC (Invariant CRC)
IP header checksum
5. What is the purpose of the UDP source port in RoCEv2 packets?
It identifies the sending application
It is derived from a hash of flow parameters for ECMP load balancing
It always matches the destination port 4791
It is randomly assigned by the operating system
Key Points
RoCE v1 (2010) uses EtherType 0x8915 and operates at Layer 2 only -- no routing across subnets
RoCE v2 (2014) wraps IB transport in UDP/IP (destination port 4791), enabling full Layer 3 routability
The IB Base Transport Header (BTH) contains opcode, partition key, destination queue pair number, and packet sequence number
ICRC provides end-to-end integrity that survives IP header modifications during routing
Deploying RoCEv2 on Nexus 9000 requires coordinated configuration of classification (type qos), queuing (type queuing), and network-qos policies
Cisco Nexus 9000 provides approximately 4.5 microsecond end-to-end latency across leaf and spine
RoCE v1 vs. RoCE v2
RoCE v1, introduced by the InfiniBand Trade Association (IBTA) in 2010, encapsulates InfiniBand transport packets directly inside Ethernet frames using EtherType 0x8915. Because it operates at Layer 2 only, RoCE v1 traffic cannot cross router boundaries -- both communicating hosts must reside in the same Ethernet broadcast domain.
RoCE v2, introduced in 2014, solves this by wrapping the InfiniBand transport header inside a UDP/IP packet (IPv4 or IPv6) with UDP destination port 4791. This "Routable RoCE" can traverse any IP network, making it fully compatible with leaf-spine Clos topologies. RoCEv2 is the dominant Ethernet-based RDMA transport for AI workloads today.
RoCEv2 Packet Format
flowchart LR
subgraph RoCEv1["RoCE v1 Frame"]
direction LR
V1A["Ethernet Header\n(EtherType 0x8915)"] --> V1B["IB GRH"] --> V1C["IB BTH"] --> V1D["Payload"] --> V1E["ICRC"] --> V1F["FCS"]
end
subgraph RoCEv2["RoCE v2 Packet"]
direction LR
V2A["Ethernet\nHeader"] --> V2B["IP\nHeader"] --> V2C["UDP Header\n(dst 4791)"] --> V2D["IB BTH"] --> V2E["Payload"] --> V2F["ICRC"] --> V2G["FCS"]
end
The RoCEv2 packet adds two key headers on top of the original InfiniBand transport:
IP Header: Provides Layer 3 addressing and enables routing. The DSCP field is used for QoS classification and ECN signaling.
UDP Header: Uses destination port 4791. The source port is typically derived from a hash of flow parameters, which is important for ECMP load balancing.
IB Base Transport Header (BTH): Contains the InfiniBand operation code, partition key, destination queue pair (QP) number, and packet sequence number. NX-OS 10.6(1)F supports filtering on BTH fields in access lists.
ICRC (Invariant CRC): Provides end-to-end data integrity verification that is not affected by IP header modifications during routing.
Deploying RoCEv2 on Cisco Nexus 9000
Deploying RoCEv2 requires configuring several interdependent features: PFC for lossless transport, ECN for congestion signaling, ETS for bandwidth allocation, and DCBX for parameter negotiation. Here is a basic RoCEv2 QoS configuration using the Modular QoS CLI (MQC):
! Step 1: Classification -- match RoCEv2 data and CNP traffic
class-map type qos match-all class-rocev2-data
match dscp 26
class-map type qos match-all class-cnp
match dscp 48
! Step 2: QoS policy -- set CoS and queue group for matched traffic
policy-map type qos rocev2-ingress-policy
class class-rocev2-data
set qos-group 3
class class-cnp
set qos-group 6
! Step 3: Queuing policy -- configure ECN thresholds and scheduling
policy-map type queuing rocev2-queuing-policy
class type queuing c-out-8q-q3
bandwidth percent 50
random-detect minimum-threshold 150 kbytes maximum-threshold 3000 kbytes
congestion-control ecn
class type queuing c-out-8q-q6
priority level 1
! Step 4: Network QoS -- enable PFC on the RoCEv2 traffic class
policy-map type network-qos rocev2-network-policy
class type network-qos c-8q-nq3
pause no-drop
mtu 9216
! Step 5: Apply policies
system qos
service-policy type network-qos rocev2-network-policy
service-policy type queuing output rocev2-queuing-policy
interface Ethernet1/1
service-policy type qos input rocev2-ingress-policy
Animation: RoCEv2 packet encapsulation walkthrough -- show how an InfiniBand transport payload gets wrapped with BTH, then UDP header (port 4791), then IP header, then Ethernet header as it traverses the stack
Post-Quiz: RoCE and RoCEv2 Protocols
1. In the Nexus MQC configuration, which DSCP value is used for RoCEv2 data traffic?
0
26
46
48
2. Which DSCP value is assigned to Congestion Notification Packets (CNPs)?
26
34
46
48
3. Under which NX-OS CLI context is the network-qos policy applied?
interface configuration
router ospf
system qos
vlan configuration
4. Why was RoCE v2 necessary for modern data center deployments?
RoCE v1 had no congestion control
RoCE v1 could not cross Layer 3 boundaries in leaf-spine topologies
RoCE v1 was limited to 10G link speeds
RoCE v1 required InfiniBand switches
5. What command enables lossless (no-drop) behavior for a traffic class in NX-OS?
congestion-control ecn
priority level 1
pause no-drop
bandwidth percent 50
Section 3: Congestion Control Mechanisms
Pre-Quiz: Congestion Control Mechanisms
1. What IEEE standard defines Priority Flow Control (PFC)?
IEEE 802.1Qaz
IEEE 802.1Qbb
IEEE 802.3x
RFC 3168
2. What does ECN do when a switch detects congestion?
Drops the congested packets
Sends a PFC PAUSE frame upstream
Marks packets with Congestion Experienced (CE) bits
Reroutes traffic to an alternate path
3. In the DCQCN framework, what role does the receiver NIC play?
Congestion Point -- marks packets
Reaction Point -- reduces sending rate
Notification Point -- generates CNP back to sender
Distribution Point -- load balances traffic
4. What protocol does DCBX use as its transport?
BGP
OSPF
LLDP
CDP
5. What is the purpose of Enhanced Transmission Selection (ETS)?
It encrypts traffic between switches
It provides bandwidth allocation and scheduling across traffic classes
It detects and mitigates PFC storms
It maps IP addresses to MAC addresses
Key Points
PFC (IEEE 802.1Qbb) provides per-priority pause to prevent packet loss -- it is the emergency fail-safe
PFC uses xoff/xon thresholds: PAUSE frame sent when buffer exceeds xoff, resume when buffer drops below xon
PFC risks include head-of-line blocking, PFC storms, and deadlocks -- mitigated by PFC watchdog and deadlock detection on Nexus 9000
ECN (RFC 3168) is proactive end-to-end congestion signaling: switches mark packets CE, receivers generate CNPs (DSCP 48)
DCQCN defines three roles: Congestion Point (switch), Notification Point (receiver NIC), Reaction Point (sender NIC)
In a well-tuned network, ECN handles routine congestion; PFC fires only during sudden bursts
ETS (IEEE 802.1Qaz) guarantees minimum bandwidth percentages per traffic class with strict priority and weighted scheduling
DCBX negotiates PFC, ETS, and application priority settings using a willing/non-willing model over LLDP
Priority Flow Control (PFC) -- IEEE 802.1Qbb
Standard Ethernet is inherently lossy -- when switch buffers overflow, packets are silently dropped. For RoCEv2, packet loss is catastrophic: it triggers expensive transport-layer retransmissions that can increase AI training job completion times by orders of magnitude.
PFC extends the legacy IEEE 802.3x PAUSE mechanism to operate on a per-priority basis. Rather than pausing all traffic on a link, PFC can selectively pause only the congested Class of Service (CoS) value while allowing other priorities to continue flowing.
stateDiagram-v2
[*] --> Normal
Normal : Buffer below xon threshold\nAll priorities transmitting
Normal --> Rising : Buffer utilization\nincreasing
Rising : Buffer between xon and xoff\nMonitoring per-CoS usage
Rising --> PFC_Pause : Buffer exceeds\nxoff threshold
PFC_Pause : PFC PAUSE frame sent upstream\nCongested CoS paused\nOther CoS priorities flow normally
PFC_Pause --> Draining : Upstream stops\nsending paused CoS
Draining : Buffer draining\nWaiting for xon threshold
Draining --> Normal : Buffer drops\nbelow xon threshold\n(Resume signal sent)
Rising --> Normal : Congestion\nclears naturally
How PFC operates:
Each switch port monitors ingress buffer utilization per CoS priority (0 through 7).
When buffer usage exceeds the xoff threshold, the switch sends a PFC PAUSE frame upstream, specifying which priority to pause.
The upstream device stops transmitting that priority for the specified pause quanta.
When buffer usage drops below the xon threshold, the switch signals resume.
Traffic on all other priorities continues uninterrupted.
PFC Risks and Mitigations
Risk
Description
Cisco Mitigation
Head-of-line blocking
Pausing one flow blocks all flows sharing the same priority
Dedicate CoS for RoCEv2; avoid mixing traffic types
PFC storms
Cascading pause frames propagate across the fabric
PFC watchdog timer on Nexus 9000
Deadlocks
Circular buffer dependencies cause permanent traffic stall
PFC deadlock detection and recovery
Reduced throughput
Excessive pausing degrades effective bandwidth
Tune ECN thresholds so PFC fires only as last resort
While PFC is a reactive, hop-by-hop emergency brake, ECN is a proactive, end-to-end congestion signaling mechanism. ECN marks packets rather than dropping them, allowing endpoints to reduce their sending rate before buffers overflow.
The sender sets the ECN field in the IP header to ECN-Capable Transport (ECT) -- binary 01 or 10.
When a switch queue depth crosses a configured threshold, it changes the ECN field to Congestion Experienced (CE) -- binary 11.
The receiver detects the CE marking and generates a Congestion Notification Packet (CNP) back to the sender (DSCP 48, strict-priority queue).
The sender receives the CNP and reduces its transmission rate.
ECN Threshold Parameters
Parameter
Function
Minimum threshold
Below this queue depth, no ECN marking occurs
Maximum threshold
Above this queue depth, all ECN-capable packets are marked CE
Between min and max
Probabilistic marking -- probability increases linearly with queue depth
DCQCN: Putting ECN and PFC Together
sequenceDiagram
participant Sender as Sender NIC (Reaction Point)
participant Switch as Switch (Congestion Point)
participant Receiver as Receiver NIC (Notification Point)
Sender->>Switch: Data packets with ECN ECT bits set
Note over Switch: Queue depth crosses ECN threshold
Switch->>Receiver: Packets marked with ECN CE (11)
Note over Receiver: Detects CE-marked packets
Receiver->>Sender: CNP (DSCP 48, strict priority)
Note over Sender: Multiplicative decrease of sending rate
Sender->>Switch: Reduced-rate data packets
Note over Sender: Additive increase recovery over time
Sender->>Switch: Gradually increasing rate
rect rgb(255, 230, 230)
Note over Sender,Receiver: If ECN cannot control congestion (sudden incast)
Switch-->>Sender: PFC PAUSE frame (emergency fail-safe)
Note over Sender: Transmission halted for paused CoS
end
DCQCN Role
Location
Function
Congestion Point (CP)
Switch
Monitors queue depth; marks packets with ECN CE bits
Notification Point (NP)
Receiver NIC
Detects CE-marked packets; generates CNP back to sender
Reaction Point (RP)
Sender NIC
Receives CNP; reduces rate via multiplicative decrease; recovers via additive increase
ETS provides bandwidth allocation and scheduling across traffic classes. It ensures RoCEv2 traffic receives a guaranteed share of link bandwidth while preventing best-effort traffic from starving.
Traffic Class Mapping: Up to 8 traffic classes (TC 0-7), each mapped to one or more 802.1p CoS priorities.
Bandwidth Allocation: Each traffic class receives a guaranteed minimum percentage (must sum to 100%). Unused bandwidth is redistributed proportionally.
Scheduling: ETS supports both strict priority (for CNPs) and weighted scheduling (for data classes).
Example ETS Allocation for an AI Fabric
Traffic Class
Traffic Type
CoS
Bandwidth
Scheduling
TC 3
RoCEv2 Data
3
50%
Weighted (ETS)
TC 6
CNP
6
--
Strict Priority
TC 4
Storage (NVMe-oF)
4
25%
Weighted (ETS)
TC 0
Best-effort
0
20%
Weighted (ETS)
TC 7
Management/Control
7
5%
Weighted (ETS)
Data Center Bridging Exchange (DCBX) -- IEEE 802.1Qaz
DCBX is the discovery and negotiation protocol that automates DCB parameter exchange between connected devices. It rides on top of LLDP and negotiates three key TLV parameters:
PFC Configuration TLV: Which CoS priorities are lossless (PFC-enabled).
ETS Configuration / Recommendation TLV: Bandwidth allocation percentages and traffic class mappings.
Application Priority TLV: Which applications (RoCEv2, FCoE) map to which CoS priorities.
DCBX supports a willing/non-willing negotiation model. Typically, switches are non-willing (authoritative) and host NICs are willing, so the switch pushes consistent policy to all endpoints.
Animation: DCQCN congestion control flow -- show ECN marking at the switch, CNP generation at receiver, sender rate reduction with multiplicative decrease, then gradual additive increase recovery; include PFC activation as a fail-safe overlay
Post-Quiz: Congestion Control Mechanisms
1. What happens when PFC buffer usage exceeds the xoff threshold?
Packets are dropped from the queue
The switch sends a PFC PAUSE frame upstream for that priority
ECN marks are added to all packets
The switch reroutes traffic to another port
2. What binary value in the IP ECN field indicates Congestion Experienced?
00
01
10
11
3. In DCBX, what does a "willing" device do?
It refuses all parameter changes from peers
It accepts DCB parameters from its peer
It only negotiates ETS, not PFC
It sends configuration to all devices in the VLAN
4. Which Cisco Nexus 9000 feature mitigates PFC storm propagation?
ECN threshold tuning
PFC watchdog timer
DCBX willing mode
ETS strict priority scheduling
5. In a well-tuned AI fabric, which mechanism should handle the vast majority of congestion events?
PFC PAUSE frames
Packet dropping
ECN (via DCQCN)
DCBX renegotiation
Section 4: QoS and Load Distribution for AI
Pre-Quiz: QoS and Load Distribution for AI
1. What are the three policy-map types in Cisco NX-OS MQC?
type input, type output, type global
type qos, type queuing, type network-qos
type classification, type scheduling, type marking
type ingress, type egress, type system
2. Why should you never mix drop-eligible and lossless traffic in the same queue?
It causes VLAN mismatches
PFC pauses the entire queue, so drop-eligible traffic gets unnecessarily paused and lossless traffic may be dropped
It exceeds the maximum MTU size
It disables ECMP load balancing
3. What is the main problem with static ECMP for AI training workloads?
It does not support IPv6
Elephant flows hash to the same path, causing congestion while other paths sit idle
It requires manual path configuration
It does not work with leaf-spine topologies
4. What is a "flowlet" in Dynamic Load Balancing?
A new TCP connection within an existing session
A burst of packets within a flow, separated from other bursts by idle gaps
A flow that is less than 1 KB in size
A multicast packet group
5. What throughput improvement did Cisco benchmarks show for DLB flowlet mode over static ECMP?
5%
18.6%
35%
55%
Key Points
NX-OS MQC uses three policy-map types: type qos (classification, per-interface ingress), type queuing (scheduling, egress), type network-qos (PFC/MTU, system-wide)
Never mix drop-eligible and lossless (no-drop) traffic in the same queue -- dedicate a queue exclusively to RoCEv2
RoCEv2 data uses DSCP 26 (AF31) and CNPs use DSCP 48 (CS6) with strict-priority scheduling
Static ECMP fails for AI workloads because elephant flows create link hot spots
DLB Flowlet mode (default on Nexus 9000 with NX-OS 10.5(1)F) re-hashes flows during idle gaps, delivering 18.6% throughput gain
Per-Packet Load Balancing (PLB) provides the most even distribution but requires endpoints that handle reordering
Approximate Fair Drop (AFD) distinguishes elephant vs. mice flows, applying more aggressive drop to elephant flows during congestion
Spine-leaf (Clos) topology provides non-blocking, consistent latency, and ECMP-friendly paths for AI fabrics
flowchart LR
subgraph Ingress["Ingress (per-interface)"]
QOS["policy-map type qos\n\nClassification and Marking\n- Match DSCP, CoS, ACL\n- Assign qos-group"]
end
subgraph Egress["Egress (per-interface or system-wide)"]
QUEUING["policy-map type queuing\n\nScheduling and Buffering\n- Bandwidth allocation\n- ECN thresholds\n- Strict priority queues"]
end
subgraph SystemWide["System-Wide (system qos)"]
NQOS["policy-map type network-qos\n\nLossless Behavior\n- PFC enable/disable\n- MTU per traffic class\n- DCBX parameters"]
end
QOS -->|"qos-group\nmapping"| QUEUING
NQOS -.->|"PFC and MTU\napplied globally"| QUEUING
style Ingress fill:#e8f4fd,stroke:#2980b9
style Egress fill:#fdf2e8,stroke:#e67e22
style SystemWide fill:#f0e8fd,stroke:#8e44ad
Cisco NX-OS implements QoS through the Modular QoS CLI (MQC) framework with three distinct policy-map types:
Policy-Map Type
Purpose
Scope
type qos
Classification and marking -- matches traffic by DSCP, CoS, ACL; assigns qos-groups
Network-wide behavior: PFC enable/disable, MTU per traffic class, DCBX
System-wide under system qos
Traffic Classification and Marking
Traffic Type
DSCP Value
CoS
Queue
Treatment
RoCEv2 Data
26 (AF31) or 24 (CS3)
3
Queue 3
Lossless: PFC-enabled, ECN-enabled
CNP
48 (CS6)
6
Queue 6
Strict priority, low-latency delivery
Storage (iSCSI/NVMe-oF)
14 (AF13)
4
Queue 4
May be lossless depending on requirements
Best-effort / Default
0 (BE)
0
Queue 0
Default drop-eligible treatment
Management / Control
46 (EF) or 48 (CS6)
7
Queue 7
Strict priority, small bandwidth
A critical design rule: never mix drop-eligible and lossless (no-drop) traffic in the same queue. When PFC pauses a queue, all traffic in that queue is paused. Cisco recommends dedicating a queue exclusively to no-drop RoCEv2 traffic.
Traditional ECMP assigns entire flows to a single path based on a static 5-tuple hash. AI training generates large, persistent "elephant flows" between GPUs that can last minutes or hours. When multiple elephant flows hash to the same path, that link becomes congested while parallel links sit idle.
Dynamic Load Balancing (DLB)
Cisco Nexus 9000 switches (NX-OS 10.5(1)F and later) support Layer 3 ECMP Dynamic Load Balancing:
Flowlet Load Balancing (FLB) -- Default Mode: Identifies natural idle gaps within long-lived flows. When a gap exceeds a configurable threshold, the next burst (flowlet) can be routed to a different ECMP path. Packets within a single flowlet always follow the same path, avoiding reordering. Cisco benchmarks showed 18.6% performance gain (43.48 GB/s vs. 36.66 GB/s) over static ECMP.
Per-Packet Load Balancing (PLB): Distributes individual packets across all available ECMP paths. Provides the most even distribution but requires endpoints that can handle packet reordering.
Mode
Granularity
Reordering Risk
Best For
Static ECMP
Per-flow (5-tuple hash)
None
General-purpose workloads
Flowlet (FLB)
Per-flowlet (gap-based)
Minimal
RoCEv2 AI training (default)
Per-Packet (PLB)
Per-packet
Yes (endpoint must handle)
Endpoints with reorder tolerance
Approximate Fair Drop (AFD)
AFD distinguishes high-bandwidth "elephant flows" from short-lived "mice flows." It applies more aggressive drop probability to elephant flows during congestion, preventing them from starving mice flows. This is useful on front-end (north-south) networks where AI inference traffic coexists with smaller management and storage flows.
The spine-leaf (Clos) topology is the recommended architecture for AI/ML clusters:
Non-blocking: With 1:1 oversubscription ratio, every GPU can communicate at full line rate simultaneously.
Consistent latency: Every source-destination pair traverses the same number of hops (leaf-spine-leaf).
Horizontal scalability: Add capacity by adding more spine or leaf switches without redesigning the fabric.
ECMP-friendly: All leaf-to-leaf paths are equal cost, maximizing DLB effectiveness.
Cisco Nexus Dashboard Management Tools
Nexus Dashboard Fabric Controller (NDFC): Automates fabric provisioning including QoS configuration for PFC/ECN, template-based AI/ML network profiles, and consistent policy deployment.
Nexus Dashboard Insights (NDI): Provides real-time visibility into congestion hot spots, flow telemetry, PFC/ECN statistics, and anomaly detection.
Animation: Comparison of static ECMP vs. flowlet DLB -- show elephant flows hashing to the same path with static ECMP causing congestion, then flowlet DLB detecting idle gaps and redistributing flowlets across available paths for balanced utilization
Post-Quiz: QoS and Load Distribution for AI
1. Which MQC policy-map type is applied under "system qos" for network-wide lossless behavior?
type qos
type queuing
type network-qos
type global-qos
2. What distinguishes Flowlet DLB from Per-Packet PLB?
FLB only works on Layer 2 networks
FLB redistributes at idle gaps to avoid reordering; PLB distributes every packet but risks reordering
PLB requires dedicated InfiniBand hardware
FLB provides more even distribution than PLB
3. What does Approximate Fair Drop (AFD) do during congestion?
Pauses all traffic equally
Applies more aggressive drop probability to elephant flows to protect mice flows
Marks all packets with ECN CE bits
Reroutes elephant flows to dedicated paths
4. Which Cisco tool automates fabric provisioning and QoS configuration for AI/ML network profiles?
Nexus Dashboard Insights (NDI)
Cisco DNA Center
Nexus Dashboard Fabric Controller (NDFC)
Cisco ACI
5. Why is the spine-leaf topology ideal for AI/ML clusters?
It requires fewer cables than a traditional three-tier design
It provides non-blocking, consistent latency, horizontal scalability, and equal-cost paths for DLB