Chapter 5: Network Architecture for AI Workloads

Learning Objectives

Section 1: Network Requirements for AI Workloads

Pre-Quiz: Test Your Baseline Knowledge

1. What per-node bandwidth does a modern AI training cluster typically require?

100-200 Gbps 400 Gbps 800 Gbps to 1.6 Tbps 10 Gbps

2. Which RDMA transport technology delivers the lowest latency for GPU-to-GPU communication?

TCP/IP RoCE v2 InfiniBand iSCSI

3. What happens in a spine-leaf fabric when one spine switch fails in an 8-spine design?

Complete network outage 50% bandwidth reduction 12.5% bandwidth reduction with traffic rerouted No impact at all

4. Why is the AI backend RDMA fabric typically isolated from general-purpose traffic?

To save on switch port costs To prevent non-AI traffic from triggering PFC storms Because RDMA only works on separate physical cables For regulatory compliance reasons only

5. What oversubscription ratio is acceptable for AI inference networks?

1:1 only 2:1 to 3:1 5:1 or higher Oversubscription does not matter for inference

Key Points

Bandwidth: Training vs. Inference

AI workloads split into two categories with very different network demands. Training is the bandwidth-hungry phase: hundreds or thousands of GPUs exchange gradient updates after every forward and backward pass. Modern LLM clusters require 800 Gbps to 1.6 Tbps per node. In Cisco's reference architecture, a single Scalable Unit (SU) houses 256 GPUs across 32 HGX H200 systems, all communicating at full line rate.

Inference requires less per-node bandwidth (100-400 Gbps) but demands consistent low-latency responses at scale for user-facing SLAs.

CharacteristicTrainingInference
Traffic patternBurst, synchronized (AllReduce)Steady-state, request/response
Per-node bandwidth800 Gbps - 1.6 Tbps100 - 400 Gbps
Latency sensitivityMicrosecond-level (collective ops)Low milliseconds (user-facing SLA)
Packet loss toleranceNear-zero (lossless RDMA)Low (TCP retransmits acceptable)
Oversubscription tolerance1:1 (non-blocking)2:1 - 3:1 acceptable
Analogy: Training is like a relay race where every runner (GPU) must hand off the baton (gradient data) at exactly the same moment. If one handoff is slow, the entire team waits. Inference is more like a customer-service counter -- each request is independent.
stateDiagram-v2 direction LR [*] --> ForwardPass ForwardPass --> BackwardPass : Compute gradients BackwardPass --> AllReduce : Exchange gradients AllReduce --> WaitForSlowest : Synchronized barrier WaitForSlowest --> ForwardPass : All GPUs resume

Figure 5.1: Bulk-Synchronous Training -- GPU Synchronization Cycle

Latency Sensitivity

Collective communication operations (AllReduce, parameter sync) are exquisitely latency-sensitive because distributed training follows a bulk-synchronous model: every GPU must finish communicating before any GPU can begin the next computation step. The slowest flow dictates cluster pace.

Transport TechnologyTypical Latency (Single Hop)Notes
TCP/IP~50 microsecondsKernel stack overhead
RoCE v25 - 10 microsecondsBypasses kernel; requires lossless fabric
InfiniBand1 - 2 microsecondsPurpose-built RDMA fabric; lowest latency
sequenceDiagram participant App as Application participant NIC as Network Interface participant Fabric as Network Fabric participant RNIC as Remote NIC participant RApp as Remote Application Note over App, RApp: TCP/IP Path (~50 us) App->>App: System call into kernel App->>NIC: Kernel copies data to NIC buffer NIC->>Fabric: Transmit packet Fabric->>RNIC: Deliver packet RNIC->>RApp: Kernel copies data to app memory Note over App, RApp: RoCE v2 / RDMA Path (5-10 us) App->>NIC: NIC reads directly from app memory NIC->>Fabric: Transmit packet (UDP/IP) Fabric->>RNIC: Deliver packet RNIC->>RApp: NIC writes directly to app memory Note over App, RApp: No kernel involvement

Figure 5.2: RDMA Latency Comparison -- TCP/IP vs. RoCE v2

Redundancy and High Availability

Unplanned downtime in an AI cluster wastes expensive GPU compute hours. Redundancy is achieved through:

flowchart TD GPU_A[GPU Node A] --> Leaf1[Leaf Switch 1] GPU_B[GPU Node B] --> Leaf2[Leaf Switch 2] Leaf1 -->|Path 1| Spine1[Spine 1] Leaf1 -->|Path 2| Spine2[Spine 2] Leaf1 -->|Path 3| Spine3[Spine 3] Leaf1 -.->|Path 4| Spine4["Spine 4 (FAILED)"] Spine1 --> Leaf2 Spine2 --> Leaf2 Spine3 --> Leaf2 Spine4 -.-> Leaf2 style Spine4 fill:#ffcdd2,stroke:#E53935,stroke-dasharray: 5 5

Figure: ECMP Load Balancing and Graceful Degradation

Security Considerations

Animation slot: Interactive visualization showing ECMP path selection and graceful degradation when a spine switch fails.
Post-Quiz: Verify Your Understanding

1. In a Cisco reference AI cluster, a Scalable Unit (SU) houses how many GPUs?

64 128 256 512

2. RoCE v2 achieves low latency primarily by:

Using dedicated InfiniBand switches Compressing data before transmission Bypassing the kernel, allowing NICs to read/write application memory directly Using TCP acceleration hardware

3. Why is jitter especially damaging to AI training performance?

It causes data corruption in gradient values A single straggler flow with jitter can stall hundreds of GPUs in synchronous operations It only affects inference workloads It increases power consumption of switches

4. What is the primary purpose of ECMP in an AI fabric?

Encrypting traffic between GPUs Distributing flows across multiple equal-cost paths for load balancing and fault tolerance Reducing the number of spine switches needed Compressing packet headers

5. What packet loss tolerance does AI training demand?

Up to 1% loss is acceptable Near-zero (lossless RDMA fabric required) Same as standard web traffic Loss tolerance depends on model size only

Section 2: Network Topology Design for AI

Pre-Quiz: Test Your Baseline Knowledge

1. How many hops does it take for any endpoint to reach any other endpoint in a spine-leaf fabric?

1 hop 2 hops (leaf-spine-leaf) 3 hops minimum Variable depending on load

2. Which traffic direction dominates in AI training workloads?

North-south (client to server) East-west (server to server) Both equally It depends on model architecture only

3. What is the oversubscription ratio on the Cisco Nexus 9364D-GX2A leaf switch for AI backends?

2:1 3:1 1:1 (non-blocking) 4:1

4. In a rail-optimized topology, how are GPUs connected to leaf switches?

All GPUs in a server connect to one leaf Each GPU position across all servers connects to a dedicated leaf switch (rail) GPUs connect directly to spine switches GPUs connect via a separate management network

5. What cost and power savings can rail-only networks achieve compared to full-bisection designs?

5-10% savings 10-20% savings 38-77% cost reduction and 37-75% power reduction No measurable savings

Key Points

Spine-Leaf Architecture

The spine-leaf (Clos) topology is the dominant architecture in AI-ready data centers. Leaf switches connect to endpoints (GPUs, storage). Spine switches interconnect all leaf switches in a full mesh. Any endpoint can reach any other endpoint in exactly two hops, providing predictable, uniform latency.

For larger deployments, a super-spine (third tier) is added. Cisco's three-tier design scales to 32 planes with 32 switches each, maintaining 1:1 oversubscription within an SU and typically 3:1 at the spine-to-super-spine boundary.

flowchart TD subgraph Spines S1[Spine 1] S2[Spine 2] S3[Spine 3] S4[Spine 4] end subgraph Leaves L1[Leaf 1] L2[Leaf 2] L3[Leaf 3] L4[Leaf 4] end L1 --- S1 L1 --- S2 L1 --- S3 L1 --- S4 L2 --- S1 L2 --- S2 L2 --- S3 L2 --- S4 L3 --- S1 L3 --- S2 L3 --- S3 L3 --- S4 L4 --- S1 L4 --- S2 L4 --- S3 L4 --- S4 L1 --- G1[GPUs] L2 --- G2[GPUs] L3 --- G3[GPUs] L4 --- G4[GPUs]

Figure: Spine-Leaf Full-Mesh Topology

East-West vs. North-South Traffic

Traffic DirectionDefinitionDominant InOptimization
North-SouthClient-to-server (in/out of DC)Web apps, APIs, inferencePerimeter firewalls, load balancers
East-WestServer-to-server (within DC)AI training, distributed storageNon-blocking fabric, ECMP, RDMA

When 1,024 GPUs perform AllReduce, every GPU sends data to many other GPUs across the fabric. This east-west traffic can saturate spine links if the network lacks sufficient bisection bandwidth.

flowchart TD subgraph NorthSouth["North-South (Traditional)"] Client([Client]) -->|Request| FW[Firewall/LB] FW --> Server1[Web Server] end subgraph EastWest["East-West (AI Training)"] GPUA[GPU A] <-->|AllReduce| Leaf1[Leaf] GPUB[GPU B] <-->|AllReduce| Leaf2[Leaf] Leaf1 <-->|Full bisection BW| SpineX[Spine] Leaf2 <-->|Full bisection BW| SpineX end style NorthSouth fill:#e8f4f8,stroke:#2196F3 style EastWest fill:#fff3e0,stroke:#FF9800

Figure 5.3: East-West vs. North-South Traffic Patterns

Rail-Optimized Topology

In a rail-optimized design, each "rail" is a dedicated parallel data path connecting one specific GPU position across all servers to a single leaf switch. For an 8-GPU-per-server cluster: GPU 0 in every server connects to Leaf A (Rail 0), GPU 1 to Leaf B (Rail 1), and so on.

This exploits sparse communication patterns in LLM training. Rail-only networks (without spine) can achieve the same training performance while reducing network cost by 38-77% and power by 37-75%.

flowchart LR subgraph Server1[Server 1] S1G0[GPU 0] S1G1[GPU 1] S1G2[GPU 2] S1G3[GPU 3] end subgraph Server2[Server 2] S2G0[GPU 0] S2G1[GPU 1] S2G2[GPU 2] S2G3[GPU 3] end S1G0 --> R0[Rail 0 - Leaf A] S2G0 --> R0 S1G1 --> R1[Rail 1 - Leaf B] S2G1 --> R1 S1G2 --> R2[Rail 2 - Leaf C] S2G2 --> R2 S1G3 --> R3[Rail 3 - Leaf D] S2G3 --> R3 style R0 fill:#c8e6c9,stroke:#388E3C style R1 fill:#bbdefb,stroke:#1976D2 style R2 fill:#fff9c4,stroke:#F9A825 style R3 fill:#ce93d8,stroke:#7B1FA2

Figure: Rail-Optimized Topology

Oversubscription and Capacity Planning

Oversubscription is the ratio of total downlink to total uplink bandwidth on a leaf switch -- the single most critical capacity planning metric.

RatioMeaningUse Case
1:1Non-blocking; full bandwidthAI training backend (mandatory)
2:1Half bandwidth at peakInference, storage networks
3:1One-third bandwidth at peakGeneral enterprise, management
> 3:1Significant contention likelyNot recommended for AI
Exam Tip: A 128-port leaf at 400G with 64 ports to GPUs (25.6 Tbps) and 64 ports to spines (25.6 Tbps) = 1:1. Change to 96 GPU ports and 32 spine ports = 3:1. Know how to calculate this ratio.
Animation slot: Interactive oversubscription calculator -- adjust GPU ports vs. spine ports to see the ratio change in real time.
Post-Quiz: Verify Your Understanding

1. Why does a three-tier fabric add a super-spine layer?

To reduce the number of leaf switches needed To scale beyond what a two-tier spine-leaf can support To eliminate ECMP routing To add firewall capability

2. What port split does the Nexus 9364D-GX2A use to achieve 1:1 oversubscription?

48 GPU-facing + 16 spine uplinks 32 GPU-facing + 32 spine uplinks 64 GPU-facing + 64 spine uplinks 16 GPU-facing + 48 spine uplinks

3. Why do traditional north-south-optimized data centers fail under AI workloads?

They lack enough server ports They have east-west bottlenecks from oversubscribed aggregation/core tiers They cannot run RDMA protocols They use incompatible cabling

4. What communication property of LLM training makes rail-optimized topologies effective?

All GPUs communicate with all other GPUs equally Sparse, predictable communication patterns aligned with parallelism dimensions Only adjacent GPUs communicate Communication only occurs during inference

5. What is the typical oversubscription ratio at the spine-to-super-spine boundary in Cisco's three-tier AI design?

1:1 2:1 3:1 5:1

Section 3: Key Network Challenges for AI/ML

Pre-Quiz: Test Your Baseline Knowledge

1. What is the "straggler problem" in distributed AI training?

A GPU running outdated firmware The slowest GPU in a sync step forces all other GPUs to idle A spine switch dropping packets randomly A storage node running out of disk space

2. Which mechanism pauses traffic per-priority to prevent buffer overflow?

ECN DCQCN PFC (Priority Flow Control) ECMP

3. What collective communication operation is most commonly used in distributed training?

Broadcast AllReduce Scatter Point-to-point

4. What MTU size must be configured end-to-end for AI lossless fabrics?

1500 bytes 4000 bytes 9000 bytes (Jumbo MTU) 16000 bytes

5. What Cisco tool provides real-time congestion monitoring for AI fabrics?

Cisco DNA Center Cisco Nexus Dashboard Insights Cisco Meraki Dashboard Cisco ISE

Key Points

Scalability in Large GPU Clusters

Scaling from hundreds to thousands of GPUs introduces compounding challenges:

Adding leaf switches without proportionally adding spine ports degrades the oversubscription ratio.

Bottlenecks in Distributed Training

  1. Synchronized congestion: Collective operations cause all GPUs to burst traffic simultaneously -- far more damaging than random congestion.
  2. Tail latency dominance: If 1,023 GPUs finish in 5 us but one takes 50 us, all 1,023 idle for 45 us.
  3. Packet loss amplification: A single dropped packet triggers RDMA retransmission, stalling all GPUs in the cluster.
MechanismFunction
PFC (Priority Flow Control)Pauses traffic per-priority to prevent buffer overflow and packet drops
ECN (Explicit Congestion Notification)Marks packets when queue depth exceeds threshold, signaling sender to slow down
DCQCNCombines ECN with rate-based congestion control for RDMA traffic
Jumbo MTU 9000Reduces per-packet overhead; must be enabled end-to-end
flowchart LR Sender[Sender GPU/NIC] -->|Data packets| Switch1[Ingress Switch] Switch1 -->|Forward| Switch2[Egress Switch] Switch2 -->|Deliver| Receiver[Receiver GPU/NIC] Switch2 -->|Queue depth exceeds threshold?| ECN_Check{ECN Threshold} ECN_Check -->|Yes| ECN_Mark[Mark packet with ECN bit] ECN_Mark --> Receiver Receiver -->|CNP: Congestion Notification| Sender Sender -->|DCQCN: Reduce rate| Switch1 Switch2 -->|Buffer nearly full?| PFC_Check{PFC Threshold} PFC_Check -->|Yes| PFC_Pause[Send PFC PAUSE upstream] PFC_Pause --> Switch1 style ECN_Check fill:#fff9c4,stroke:#F9A825 style PFC_Check fill:#ffcdd2,stroke:#E53935 style ECN_Mark fill:#fff9c4,stroke:#F9A825 style PFC_Pause fill:#ffcdd2,stroke:#E53935

Figure 5.4: Lossless Fabric Mechanisms -- PFC, ECN, and DCQCN Interaction

Analogy: Imagine an orchestra where every musician must play in perfect unison. If even one musician's sheet music arrives late (packet delay), the entire orchestra must pause and wait. A lossless network is like a perfectly reliable music distribution system.

Connectivity Models: RoCE v2 vs. InfiniBand

FeatureRoCE v2InfiniBand
Latency7-10 microseconds1-2 microseconds
JitterModerate (requires tuning)Extremely low (by design)
Packet loss handlingRequires PFC/ECN/DCQCN configBuilt-in credit-based flow control
EcosystemStandard Ethernet (Cisco Nexus)Dedicated IB switches (NVIDIA)
CostLower (existing Ethernet)Higher (separate fabric)
Scale proofMeta (Llama), IBM (Granite)NVIDIA DGX SuperPOD
Exam Tip: Cisco's AI networking strategy centers on RoCE v2 over Nexus platforms. Know the three non-negotiable requirements: lossless transport (PFC + ECN + DCQCN), non-blocking leaf-tier bandwidth, and jumbo MTU 9000 end-to-end.
Animation slot: Step-by-step visualization of PFC pause frame and ECN marking during a congestion event.
Post-Quiz: Verify Your Understanding

1. Why is synchronized congestion more damaging than random congestion in AI fabrics?

It happens less frequently All GPUs burst traffic simultaneously during collective ops, creating correlated congestion It only affects the spine layer Random congestion does not exist in data centers

2. What does ECN do when queue depth exceeds a threshold?

Drops the packet immediately Marks the packet with an ECN bit, signaling the sender to slow down Reroutes the packet to a different spine Compresses the packet to reduce size

3. Which companies have proven RoCE v2 at hyperscale for AI training?

Google and Amazon Meta (Llama) and IBM (Granite) Microsoft and Oracle NVIDIA and AMD

4. What advantage does InfiniBand have over RoCE v2 for packet loss handling?

It uses TCP retransmission Built-in credit-based flow control (no PFC/ECN configuration needed) It ignores packet loss entirely It uses forward error correction only

5. A single dropped packet in an RDMA fabric causes:

No impact -- RDMA handles loss gracefully Retransmission delay that forces all GPUs in the cluster to wait Automatic failover to TCP Only the affected GPU pauses

Section 4: Optical and Copper Technologies for AI

Pre-Quiz: Test Your Baseline Knowledge

1. What is the maximum practical reach of an 800G DAC (passive copper cable)?

100 meters 50 meters 3-5 meters 500 meters

2. What modulation technique do 800G transceivers use to double the bit rate per lane?

NRZ (Non-Return-to-Zero) QAM-16 PAM4 (4-level Pulse Amplitude Modulation) OFDM

3. Which transceiver reach designation is appropriate for inter-building links up to 2 km?

SR8 DR4 FR4 LR4

4. What emerging technology integrates optical transceivers directly onto the switch ASIC?

e-Tube Co-Packaged Optics (CPO) Active Optical Cables (AOC) Silicon photonics

5. QSFP-DD supports up to how many electrical lanes?

4 lanes 8 lanes 16 lanes 2 lanes

Key Points

Copper Interconnects

Direct Attach Cables (DACs) are passive copper assemblies with pre-attached transceivers, saving 50-70% vs. optical at distances up to 3-5 meters. Zero power consumption and near-zero latency contribution make them ideal for within-rack connections.

Active Electrical Cables (AECs) embed signal-conditioning electronics, extending copper reach to 5-10 meters while consuming 25-50% less power than Active Optical Cables. AECs are emerging as preferred for top-of-rack to end-of-row in AI clusters.

Copper limitation: At 800G+, the skin effect causes signal loss to increase with frequency. Industry consensus: copper cannot scale beyond 800G -- at 1.6 Tbps, cables become too short, thick, and impractical.

Optical Interconnects

Fiber optics deliver data faster and over longer distances with minimal signal loss. Optical cables are thinner and lighter, enabling higher density. However, they cost up to 7x equivalent copper connections and consume more power.

For AI workloads, optical is essential for any link exceeding 5-10 meters: all spine uplinks, inter-rack, and inter-building links.

Transceiver Form Factors

Form FactorElectrical LanesMax SpeedBest For
QSFP-DD8 x 100G800 GbpsDense leaf switches; backward compatible with QSFP+/28/56
OSFP8 x 100G800 GbpsSpine switches; high-power 800G deployments

Both use PAM4 modulation (4-level Pulse Amplitude Modulation), encoding 2 bits per symbol to double the bit rate per lane compared to NRZ.

Transceiver Reach Designations

DesignationFiber TypeMax DistanceTypical Use
SR8 (Short Reach)Multimode OM4/OM5100 mIntra-rack, same-row
DR4 (Data Center Reach)Single-mode500 mInter-rack, spine uplinks
FR4 (Far Reach)Single-mode2 kmInter-building, campus
LR4 (Long Reach)Single-mode10 kmCampus backbone, DCI
flowchart TD Start([Select Interconnect]) --> Dist{Link Distance?} Dist -->|Under 3 m| DAC[800G DAC - Passive Copper] Dist -->|3 - 10 m| AEC[800G AEC - Active Electrical] Dist -->|10 - 100 m| SR8[800G SR8 - Multimode Fiber] Dist -->|100 - 500 m| DR4[800G DR4 - Single-mode] Dist -->|500 m - 2 km| FR4[800G FR4 - Single-mode] Dist -->|2 - 10 km| LR4[800G LR4 - Single-mode] style DAC fill:#c8e6c9,stroke:#388E3C style AEC fill:#c8e6c9,stroke:#388E3C style SR8 fill:#bbdefb,stroke:#1976D2 style DR4 fill:#bbdefb,stroke:#1976D2 style FR4 fill:#ce93d8,stroke:#7B1FA2 style LR4 fill:#ce93d8,stroke:#7B1FA2

Figure 5.5: Interconnect Technology Selection by Distance

Emerging Technologies

Co-Packaged Optics (CPO) integrates optical transceivers directly onto the switch ASIC package, eliminating pluggable transceivers. Expected to be critical for 1.6 Tbps and beyond, reducing electrical-to-optical conversion losses.

e-Tube uses RF data transmission through a plastic dielectric waveguide -- a third option between copper and optical offering copper-like simplicity with optical-like bandwidth density for short reach.

The optical interconnect market for AI data centers reached $3.75 billion in 2025, projected to reach $18.36 billion by 2033 (21.87% CAGR). Analysts predict all AI DC interconnects will be optical within five years.

Exam Tip: For the 300-640 DCAI exam, know QSFP-DD and OSFP form factors, understand PAM4 modulation at 800G, and be able to match transceiver reach designations (SR8/DR4/FR4/LR4) to deployment distances.
Animation slot: Interactive distance-to-transceiver selector -- input a distance and see the recommended interconnect technology with cost/power tradeoffs.
Post-Quiz: Verify Your Understanding

1. For a GPU NIC-to-leaf connection within a rack (under 3 m), which interconnect is optimal?

800G SR8 optics 800G DAC (passive copper) 800G DR4 optics 800G FR4 optics

2. Why can copper not practically scale beyond 800G for data center interconnects?

Copper is too expensive at high speeds The skin effect causes signal loss at high frequencies, requiring impractically thick/short cables Copper does not support PAM4 modulation Regulatory limits on copper bandwidth

3. What fiber type does SR8 use, and what is its maximum reach?

Single-mode, 500 m Multimode (OM4/OM5), 100 m Multimode, 2 km Single-mode, 100 m

4. How does PAM4 modulation double the bit rate compared to NRZ?

By using two separate fiber strands By encoding 2 bits per symbol using 4 amplitude levels By doubling the clock frequency By compressing the data before transmission

5. For a leaf-to-spine link spanning 50 meters across data center rows, which transceiver is most appropriate?

800G DAC 800G AEC 800G SR8 over multimode fiber 800G LR4 over single-mode fiber

Your Progress

Answer Explanations