Evaluate network deployments based on AI workload requirements for bandwidth, latency, redundancy, scalability, and security.
Design network topologies optimized for AI east-west traffic patterns using spine-leaf and rail-optimized architectures.
Identify key network challenges specific to AI/ML application requirements, including collective communication bottlenecks and lossless fabric design.
Select appropriate optical and copper interconnect technologies for different segments of an AI data center network.
Section 1: Network Requirements for AI Workloads
Pre-Quiz: Test Your Baseline Knowledge
1. What per-node bandwidth does a modern AI training cluster typically require?
100-200 Gbps400 Gbps800 Gbps to 1.6 Tbps10 Gbps
2. Which RDMA transport technology delivers the lowest latency for GPU-to-GPU communication?
TCP/IPRoCE v2InfiniBandiSCSI
3. What happens in a spine-leaf fabric when one spine switch fails in an 8-spine design?
Complete network outage50% bandwidth reduction12.5% bandwidth reduction with traffic reroutedNo impact at all
4. Why is the AI backend RDMA fabric typically isolated from general-purpose traffic?
To save on switch port costsTo prevent non-AI traffic from triggering PFC stormsBecause RDMA only works on separate physical cablesFor regulatory compliance reasons only
5. What oversubscription ratio is acceptable for AI inference networks?
1:1 only2:1 to 3:15:1 or higherOversubscription does not matter for inference
Key Points
AI training demands 800 Gbps to 1.6 Tbps per node with non-blocking (1:1) oversubscription; inference needs 100-400 Gbps and tolerates 2:1 to 3:1.
RoCE v2 achieves 5-10 microsecond latency by bypassing the OS kernel (RDMA); InfiniBand achieves 1-2 microseconds.
Jitter (variation in latency) is as critical as average latency -- a single straggler flow can stall hundreds of GPUs.
Full-mesh leaf-spine with ECMP provides both load balancing and fault tolerance with graceful degradation.
AI fabric security requires VRF segmentation, dedicated lossless fabric isolation, and east-west firewall policies.
Bandwidth: Training vs. Inference
AI workloads split into two categories with very different network demands. Training is the bandwidth-hungry phase: hundreds or thousands of GPUs exchange gradient updates after every forward and backward pass. Modern LLM clusters require 800 Gbps to 1.6 Tbps per node. In Cisco's reference architecture, a single Scalable Unit (SU) houses 256 GPUs across 32 HGX H200 systems, all communicating at full line rate.
Inference requires less per-node bandwidth (100-400 Gbps) but demands consistent low-latency responses at scale for user-facing SLAs.
Characteristic
Training
Inference
Traffic pattern
Burst, synchronized (AllReduce)
Steady-state, request/response
Per-node bandwidth
800 Gbps - 1.6 Tbps
100 - 400 Gbps
Latency sensitivity
Microsecond-level (collective ops)
Low milliseconds (user-facing SLA)
Packet loss tolerance
Near-zero (lossless RDMA)
Low (TCP retransmits acceptable)
Oversubscription tolerance
1:1 (non-blocking)
2:1 - 3:1 acceptable
Analogy: Training is like a relay race where every runner (GPU) must hand off the baton (gradient data) at exactly the same moment. If one handoff is slow, the entire team waits. Inference is more like a customer-service counter -- each request is independent.
Figure 5.1: Bulk-Synchronous Training -- GPU Synchronization Cycle
Latency Sensitivity
Collective communication operations (AllReduce, parameter sync) are exquisitely latency-sensitive because distributed training follows a bulk-synchronous model: every GPU must finish communicating before any GPU can begin the next computation step. The slowest flow dictates cluster pace.
Transport Technology
Typical Latency (Single Hop)
Notes
TCP/IP
~50 microseconds
Kernel stack overhead
RoCE v2
5 - 10 microseconds
Bypasses kernel; requires lossless fabric
InfiniBand
1 - 2 microseconds
Purpose-built RDMA fabric; lowest latency
sequenceDiagram
participant App as Application
participant NIC as Network Interface
participant Fabric as Network Fabric
participant RNIC as Remote NIC
participant RApp as Remote Application
Note over App, RApp: TCP/IP Path (~50 us)
App->>App: System call into kernel
App->>NIC: Kernel copies data to NIC buffer
NIC->>Fabric: Transmit packet
Fabric->>RNIC: Deliver packet
RNIC->>RApp: Kernel copies data to app memory
Note over App, RApp: RoCE v2 / RDMA Path (5-10 us)
App->>NIC: NIC reads directly from app memory
NIC->>Fabric: Transmit packet (UDP/IP)
Fabric->>RNIC: Deliver packet
RNIC->>RApp: NIC writes directly to app memory
Note over App, RApp: No kernel involvement
Figure 5.2: RDMA Latency Comparison -- TCP/IP vs. RoCE v2
Redundancy and High Availability
Unplanned downtime in an AI cluster wastes expensive GPU compute hours. Redundancy is achieved through:
Full-mesh leaf-spine connectivity -- every leaf connects to every spine, eliminating single points of failure.
ECMP routing -- distributes flows across multiple equal-cost paths for load balancing and fault tolerance.
Graceful degradation -- losing one spine in an 8-spine fabric reduces bandwidth by 12.5%, not a full outage.
Access control: GPU nodes reachable only from authorized management and storage networks; east-west firewall policies restrict lateral movement.
Animation slot: Interactive visualization showing ECMP path selection and graceful degradation when a spine switch fails.
Post-Quiz: Verify Your Understanding
1. In a Cisco reference AI cluster, a Scalable Unit (SU) houses how many GPUs?
64128256512
2. RoCE v2 achieves low latency primarily by:
Using dedicated InfiniBand switchesCompressing data before transmissionBypassing the kernel, allowing NICs to read/write application memory directlyUsing TCP acceleration hardware
3. Why is jitter especially damaging to AI training performance?
It causes data corruption in gradient valuesA single straggler flow with jitter can stall hundreds of GPUs in synchronous operationsIt only affects inference workloadsIt increases power consumption of switches
4. What is the primary purpose of ECMP in an AI fabric?
Encrypting traffic between GPUsDistributing flows across multiple equal-cost paths for load balancing and fault toleranceReducing the number of spine switches neededCompressing packet headers
5. What packet loss tolerance does AI training demand?
Up to 1% loss is acceptableNear-zero (lossless RDMA fabric required)Same as standard web trafficLoss tolerance depends on model size only
Section 2: Network Topology Design for AI
Pre-Quiz: Test Your Baseline Knowledge
1. How many hops does it take for any endpoint to reach any other endpoint in a spine-leaf fabric?
1 hop2 hops (leaf-spine-leaf)3 hops minimumVariable depending on load
2. Which traffic direction dominates in AI training workloads?
North-south (client to server)East-west (server to server)Both equallyIt depends on model architecture only
3. What is the oversubscription ratio on the Cisco Nexus 9364D-GX2A leaf switch for AI backends?
2:13:11:1 (non-blocking)4:1
4. In a rail-optimized topology, how are GPUs connected to leaf switches?
All GPUs in a server connect to one leafEach GPU position across all servers connects to a dedicated leaf switch (rail)GPUs connect directly to spine switchesGPUs connect via a separate management network
5. What cost and power savings can rail-only networks achieve compared to full-bisection designs?
5-10% savings10-20% savings38-77% cost reduction and 37-75% power reductionNo measurable savings
Key Points
Spine-leaf (Clos) provides uniform 2-hop latency; every leaf connects to every spine in a full mesh.
AI training traffic is overwhelmingly east-west; traditional north-south-optimized designs fail under AI workloads.
Rail-optimized topology aligns each GPU position to a dedicated leaf switch, exploiting sparse communication patterns.
1:1 oversubscription at the leaf tier is non-negotiable for AI training; 3:1 is acceptable at spine-to-super-spine.
Spine-Leaf Architecture
The spine-leaf (Clos) topology is the dominant architecture in AI-ready data centers. Leaf switches connect to endpoints (GPUs, storage). Spine switches interconnect all leaf switches in a full mesh. Any endpoint can reach any other endpoint in exactly two hops, providing predictable, uniform latency.
For larger deployments, a super-spine (third tier) is added. Cisco's three-tier design scales to 32 planes with 32 switches each, maintaining 1:1 oversubscription within an SU and typically 3:1 at the spine-to-super-spine boundary.
When 1,024 GPUs perform AllReduce, every GPU sends data to many other GPUs across the fabric. This east-west traffic can saturate spine links if the network lacks sufficient bisection bandwidth.
Figure 5.3: East-West vs. North-South Traffic Patterns
Rail-Optimized Topology
In a rail-optimized design, each "rail" is a dedicated parallel data path connecting one specific GPU position across all servers to a single leaf switch. For an 8-GPU-per-server cluster: GPU 0 in every server connects to Leaf A (Rail 0), GPU 1 to Leaf B (Rail 1), and so on.
This exploits sparse communication patterns in LLM training. Rail-only networks (without spine) can achieve the same training performance while reducing network cost by 38-77% and power by 37-75%.
Oversubscription is the ratio of total downlink to total uplink bandwidth on a leaf switch -- the single most critical capacity planning metric.
Ratio
Meaning
Use Case
1:1
Non-blocking; full bandwidth
AI training backend (mandatory)
2:1
Half bandwidth at peak
Inference, storage networks
3:1
One-third bandwidth at peak
General enterprise, management
> 3:1
Significant contention likely
Not recommended for AI
Exam Tip: A 128-port leaf at 400G with 64 ports to GPUs (25.6 Tbps) and 64 ports to spines (25.6 Tbps) = 1:1. Change to 96 GPU ports and 32 spine ports = 3:1. Know how to calculate this ratio.
Animation slot: Interactive oversubscription calculator -- adjust GPU ports vs. spine ports to see the ratio change in real time.
Post-Quiz: Verify Your Understanding
1. Why does a three-tier fabric add a super-spine layer?
To reduce the number of leaf switches neededTo scale beyond what a two-tier spine-leaf can supportTo eliminate ECMP routingTo add firewall capability
2. What port split does the Nexus 9364D-GX2A use to achieve 1:1 oversubscription?
3. Why do traditional north-south-optimized data centers fail under AI workloads?
They lack enough server portsThey have east-west bottlenecks from oversubscribed aggregation/core tiersThey cannot run RDMA protocolsThey use incompatible cabling
4. What communication property of LLM training makes rail-optimized topologies effective?
All GPUs communicate with all other GPUs equallySparse, predictable communication patterns aligned with parallelism dimensionsOnly adjacent GPUs communicateCommunication only occurs during inference
5. What is the typical oversubscription ratio at the spine-to-super-spine boundary in Cisco's three-tier AI design?
1:12:13:15:1
Section 3: Key Network Challenges for AI/ML
Pre-Quiz: Test Your Baseline Knowledge
1. What is the "straggler problem" in distributed AI training?
A GPU running outdated firmwareThe slowest GPU in a sync step forces all other GPUs to idleA spine switch dropping packets randomlyA storage node running out of disk space
2. Which mechanism pauses traffic per-priority to prevent buffer overflow?
ECNDCQCNPFC (Priority Flow Control)ECMP
3. What collective communication operation is most commonly used in distributed training?
BroadcastAllReduceScatterPoint-to-point
4. What MTU size must be configured end-to-end for AI lossless fabrics?
Analogy: Imagine an orchestra where every musician must play in perfect unison. If even one musician's sheet music arrives late (packet delay), the entire orchestra must pause and wait. A lossless network is like a perfectly reliable music distribution system.
Connectivity Models: RoCE v2 vs. InfiniBand
Feature
RoCE v2
InfiniBand
Latency
7-10 microseconds
1-2 microseconds
Jitter
Moderate (requires tuning)
Extremely low (by design)
Packet loss handling
Requires PFC/ECN/DCQCN config
Built-in credit-based flow control
Ecosystem
Standard Ethernet (Cisco Nexus)
Dedicated IB switches (NVIDIA)
Cost
Lower (existing Ethernet)
Higher (separate fabric)
Scale proof
Meta (Llama), IBM (Granite)
NVIDIA DGX SuperPOD
Exam Tip: Cisco's AI networking strategy centers on RoCE v2 over Nexus platforms. Know the three non-negotiable requirements: lossless transport (PFC + ECN + DCQCN), non-blocking leaf-tier bandwidth, and jumbo MTU 9000 end-to-end.
Animation slot: Step-by-step visualization of PFC pause frame and ECN marking during a congestion event.
Post-Quiz: Verify Your Understanding
1. Why is synchronized congestion more damaging than random congestion in AI fabrics?
It happens less frequentlyAll GPUs burst traffic simultaneously during collective ops, creating correlated congestionIt only affects the spine layerRandom congestion does not exist in data centers
2. What does ECN do when queue depth exceeds a threshold?
Drops the packet immediatelyMarks the packet with an ECN bit, signaling the sender to slow downReroutes the packet to a different spineCompresses the packet to reduce size
3. Which companies have proven RoCE v2 at hyperscale for AI training?
Google and AmazonMeta (Llama) and IBM (Granite)Microsoft and OracleNVIDIA and AMD
4. What advantage does InfiniBand have over RoCE v2 for packet loss handling?
It uses TCP retransmissionBuilt-in credit-based flow control (no PFC/ECN configuration needed)It ignores packet loss entirelyIt uses forward error correction only
5. A single dropped packet in an RDMA fabric causes:
No impact -- RDMA handles loss gracefullyRetransmission delay that forces all GPUs in the cluster to waitAutomatic failover to TCPOnly the affected GPU pauses
Section 4: Optical and Copper Technologies for AI
Pre-Quiz: Test Your Baseline Knowledge
1. What is the maximum practical reach of an 800G DAC (passive copper cable)?
100 meters50 meters3-5 meters500 meters
2. What modulation technique do 800G transceivers use to double the bit rate per lane?
5. QSFP-DD supports up to how many electrical lanes?
4 lanes8 lanes16 lanes2 lanes
Key Points
DACs (passive copper): lowest cost, zero power, up to 3-5 m -- ideal for intra-rack GPU-to-leaf connections.
AECs (active electrical copper): 5-10 m reach, 25-50% less power than AOCs -- good for top-of-rack to end-of-row.
Copper cannot practically scale beyond 800G due to skin effect; 1.6 Tbps and above will be optical.
QSFP-DD and OSFP: both use 8 lanes at 100G each with PAM4 modulation for 800 Gbps total.
Transceiver reach: SR8 (100 m multimode), DR4 (500 m single-mode), FR4 (2 km), LR4 (10 km).
Copper Interconnects
Direct Attach Cables (DACs) are passive copper assemblies with pre-attached transceivers, saving 50-70% vs. optical at distances up to 3-5 meters. Zero power consumption and near-zero latency contribution make them ideal for within-rack connections.
Active Electrical Cables (AECs) embed signal-conditioning electronics, extending copper reach to 5-10 meters while consuming 25-50% less power than Active Optical Cables. AECs are emerging as preferred for top-of-rack to end-of-row in AI clusters.
Copper limitation: At 800G+, the skin effect causes signal loss to increase with frequency. Industry consensus: copper cannot scale beyond 800G -- at 1.6 Tbps, cables become too short, thick, and impractical.
Optical Interconnects
Fiber optics deliver data faster and over longer distances with minimal signal loss. Optical cables are thinner and lighter, enabling higher density. However, they cost up to 7x equivalent copper connections and consume more power.
For AI workloads, optical is essential for any link exceeding 5-10 meters: all spine uplinks, inter-rack, and inter-building links.
Transceiver Form Factors
Form Factor
Electrical Lanes
Max Speed
Best For
QSFP-DD
8 x 100G
800 Gbps
Dense leaf switches; backward compatible with QSFP+/28/56
OSFP
8 x 100G
800 Gbps
Spine switches; high-power 800G deployments
Both use PAM4 modulation (4-level Pulse Amplitude Modulation), encoding 2 bits per symbol to double the bit rate per lane compared to NRZ.
Figure 5.5: Interconnect Technology Selection by Distance
Emerging Technologies
Co-Packaged Optics (CPO) integrates optical transceivers directly onto the switch ASIC package, eliminating pluggable transceivers. Expected to be critical for 1.6 Tbps and beyond, reducing electrical-to-optical conversion losses.
e-Tube uses RF data transmission through a plastic dielectric waveguide -- a third option between copper and optical offering copper-like simplicity with optical-like bandwidth density for short reach.
The optical interconnect market for AI data centers reached $3.75 billion in 2025, projected to reach $18.36 billion by 2033 (21.87% CAGR). Analysts predict all AI DC interconnects will be optical within five years.
Exam Tip: For the 300-640 DCAI exam, know QSFP-DD and OSFP form factors, understand PAM4 modulation at 800G, and be able to match transceiver reach designations (SR8/DR4/FR4/LR4) to deployment distances.
Animation slot: Interactive distance-to-transceiver selector -- input a distance and see the recommended interconnect technology with cost/power tradeoffs.
Post-Quiz: Verify Your Understanding
1. For a GPU NIC-to-leaf connection within a rack (under 3 m), which interconnect is optimal?
2. Why can copper not practically scale beyond 800G for data center interconnects?
Copper is too expensive at high speedsThe skin effect causes signal loss at high frequencies, requiring impractically thick/short cablesCopper does not support PAM4 modulationRegulatory limits on copper bandwidth
3. What fiber type does SR8 use, and what is its maximum reach?
Single-mode, 500 mMultimode (OM4/OM5), 100 mMultimode, 2 kmSingle-mode, 100 m
4. How does PAM4 modulation double the bit rate compared to NRZ?
By using two separate fiber strandsBy encoding 2 bits per symbol using 4 amplitude levelsBy doubling the clock frequencyBy compressing the data before transmission
5. For a leaf-to-spine link spanning 50 meters across data center rows, which transceiver is most appropriate?
800G DAC800G AEC800G SR8 over multimode fiber800G LR4 over single-mode fiber