Chapter 6: Compute and GPU Architecture for AI

Learning Objectives

Section 1: Compute Resource Evaluation for AI

Pre-Quiz: Compute Resource Evaluation

1. For AI training workloads, which resource is the primary scaling dimension?

A) CPU core count
B) System RAM capacity
C) GPU count and GPU memory capacity
D) Storage IOPS

2. What is the recommended ratio for provisioning system RAM relative to total GPU memory?

A) 1:1 ratio
B) 2-4x total GPU memory
C) 8-16x total GPU memory
D) Equal to the model parameter count in GB

3. Which memory type provides the highest bandwidth for AI training on NVIDIA data center GPUs?

A) DDR5 system RAM
B) GDDR6X
C) HBM3/HBM3E
D) LPDDR5

4. NVIDIA vGPU technology on NVSwitch-equipped systems can allocate how many GPUs to a single VM?

A) Only 1 GPU
B) Up to 4 GPUs
C) 1, 2, 4, or 8 GPUs
D) Unlimited GPUs

5. When scaling AI workloads, why should architects maximize per-node GPU density before adding nodes?

A) It reduces power consumption
B) Intra-node NVLink communication is an order of magnitude faster than inter-node networking
C) It simplifies software licensing
D) More nodes always perform worse than fewer nodes

Key Points

CPU vs. GPU Roles in AI Workloads

Think of a data center AI system like a factory. The CPU is the factory manager -- it coordinates tasks, moves data, and handles logistics. The GPU is the assembly line -- it performs the same operation on thousands of items simultaneously. AI training is overwhelmingly an assembly-line problem: matrix multiplication across millions of parameters maps naturally to the massively parallel GPU architecture.

Workload PhasePrimary ComputeCPU RoleGPU RoleMemory Priority
Data preprocessingCPU-heavyETL, augmentation, tokenizationMinimalSystem RAM
Model training (large)GPU-dominantData loading, orchestrationMatrix math, gradient computeHBM capacity + bandwidth
Model fine-tuningGPU-moderateCheckpoint managementSmaller-scale gradient updatesHBM capacity
Inference (batch)GPU-moderateRequest batching, post-processingForward pass executionHBM bandwidth
Inference (real-time)GPU + CPURequest routing, pre/post-processingLow-latency forward passHBM latency

Memory Hierarchy: HBM, GDDR, and System RAM

graph TD subgraph Tier1["Tier 1: GPU On-Package Memory"] HBM["HBM3 / HBM3E\n2-3+ TB/s bandwidth\n40-288 GB per GPU"] end subgraph Tier2["Tier 2: GPU Card Memory"] GDDR["GDDR6X\n500-1,000 GB/s bandwidth\n12-24 GB per GPU"] end subgraph Tier3["Tier 3: CPU System Memory"] DDR5["DDR5 System RAM\n50-100 GB/s bandwidth\n256 GB - 2 TB per server"] end HBM -->|"Feeds Tensor Cores"| GPU["GPU Compute"] GDDR -->|"Inference and edge AI"| GPU DDR5 -->|"Data staging"| CPU["CPU / Data Pipeline"] CPU -->|"Feeds data to GPU"| GPU style Tier1 fill:#1a5c1a,stroke:#333,color:#fff style Tier2 fill:#1a3d6e,stroke:#333,color:#fff style Tier3 fill:#6e3d1a,stroke:#333,color:#fff
Memory TypeTypical BandwidthCapacity RangePrimary Use
HBM3/HBM3E2-3+ TB/s40-288 GB per GPUTraining, large-scale inference
GDDR6X500-1,000 GB/s12-24 GB per GPUInference, edge AI
DDR5 (System)50-100 GB/s256 GB - 2 TB per serverData staging, preprocessing

Virtualization and vGPU

NVIDIA vGPU technology enables multiple VMs to share a single physical GPU, or a single VM to harness multiple GPUs. On NVSwitch-equipped systems (Ampere, Hopper, Blackwell HGX platforms), NVIDIA Fabric Manager creates a unified memory domain enabling multi-GPU VMs with 1, 2, 4, or 8 GPUs.

Scalability: Scale-Up vs. Scale-Out

Scale-up adds more GPUs within a single node (e.g., 4 to 8 GPUs via NVLink). Scale-out adds more nodes via high-speed networking. Intra-node GPU-to-GPU communication via NVLink is an order of magnitude faster than inter-node networking, so maximize per-node GPU density first.

Animation: Scale-up vs. scale-out data flow showing NVLink bandwidth advantage within a node compared to inter-node network links
Key Takeaway: AI compute evaluation centers on GPU count, HBM capacity, and interconnect bandwidth -- not traditional CPU metrics. Provision system RAM at 2-4x total GPU memory, and leverage vGPU for multi-tenant flexibility.
Post-Quiz: Compute Resource Evaluation

1. During large-scale model training, the GPU primarily performs which function?

A) Data loading and ETL
B) Matrix math and gradient computation
C) Request routing
D) Checkpoint management

2. HBM3E on the NVIDIA Blackwell Ultra architecture supports up to how much memory per GPU?

A) 40 GB
B) 80 GB
C) 288 GB
D) 512 GB

3. Which NVIDIA component configures NVSwitch memory fabric for vGPU multi-GPU VMs?

A) CUDA Runtime
B) NVIDIA Fabric Manager
C) NCCL Library
D) TensorRT

4. If a server has 8 GPUs each with 80 GB HBM, what is the recommended minimum system RAM?

A) 640 GB (1x GPU memory)
B) 1,280 GB (2x GPU memory)
C) 5,120 GB (8x GPU memory)
D) 320 GB (0.5x GPU memory)

5. What vGPU capability allows GPU-accelerated VMs to be moved between hosts during maintenance?

A) Fractional GPU sharing
B) Multi-GPU aggregation
C) Live migration
D) Snapshot and recovery

Section 2: GPU Deployment and Interconnects

Pre-Quiz: GPU Deployment and Interconnects

1. What problem does NVLink solve compared to PCIe-based GPU-to-GPU communication?

A) Reduces GPU power consumption
B) Eliminates the bottleneck of routing GPU traffic through the CPU and PCIe bus
C) Adds encryption to GPU data transfers
D) Enables GPUs from different vendors to communicate

2. NVLink 5.0 (Blackwell) provides how much bidirectional bandwidth per GPU?

A) 600 GB/s
B) 900 GB/s
C) 1,800 GB/s
D) 3,600 GB/s

3. What is the primary function of NVSwitch?

A) Convert PCIe signals to NVLink signals
B) Enable any-to-any GPU communication at full NVLink speed via a crossbar fabric
C) Manage GPU power distribution
D) Provide inter-node networking between servers

4. How many GPUs does the NVIDIA NVL72 rack-scale architecture connect?

A) 8 GPUs
B) 36 GPUs
C) 72 GPUs
D) 576 GPUs

5. For which workload type is PCIe-based GPU connectivity still cost-effective?

A) Large-scale distributed training
B) Single-GPU inference and development environments
C) Multi-node LLM training
D) Real-time multi-GPU inference

Key Points

NVLink: Direct GPU-to-GPU Interconnect

NVLink is NVIDIA's proprietary high-bandwidth, low-latency point-to-point interconnect. Without NVLink, GPU-to-GPU data transfers must route through the PCIe bus and CPU -- like shipping parts between two factories via a shared highway through a central depot. NVLink is the dedicated private conveyor belt between them.

flowchart LR subgraph PCIe_Path["PCIe Path - Indirect"] direction LR G1a["GPU 1"] -->|"PCIe Bus"| CPU1["CPU"] -->|"PCIe Bus"| G2a["GPU 2"] end subgraph NVLink_Path["NVLink Path - Direct"] direction LR G1b["GPU 1"] <-->|"NVLink Direct\nUp to 1,800 GB/s"| G2b["GPU 2"] end style PCIe_Path fill:#6e1a1a,stroke:#333,color:#fff style NVLink_Path fill:#1a5c1a,stroke:#333,color:#fff

NVLink Generational Bandwidth Progression

GenerationGPU ArchitectureBandwidth per GPU (Bidir)Links per GPU
NVLink 1.0Pascal (P100)160 GB/s4
NVLink 3.0Ampere (A100)600 GB/s12
NVLink 4.0Hopper (H100)900 GB/s18
NVLink 5.0Blackwell (GB200)1,800 GB/s18
NVLink 6.0Rubin3,600 GB/s (3.6 TB/s)--

NVSwitch: Full-Mesh GPU Communication

NVSwitch acts as a high-speed crossbar switch enabling any GPU to communicate with any other GPU at full NVLink speed. Without NVSwitch, connecting 8 GPUs in a full mesh would exhaust available NVLink lanes. NVSwitch centralizes the switching function.

graph TD subgraph With_NVSwitch["With NVSwitch: Centralized Crossbar"] SW["NVSwitch\nCrossbar Fabric"] B1["GPU 0"] <--> SW B2["GPU 1"] <--> SW B3["GPU 2"] <--> SW B4["GPU 3"] <--> SW B5["GPU 4"] <--> SW B6["GPU 5"] <--> SW B7["GPU 6"] <--> SW B8["GPU 7"] <--> SW end style SW fill:#d4a017,stroke:#333,color:#000 style With_NVSwitch fill:#1a3d6e,stroke:#666,color:#fff
NVSwitch GenArchitectureBidirectional BandwidthPorts per Chip
Gen 1Volta (V100)900 GB/s18 NVLink ports
Gen 2Ampere (A100)600 GB/s36 NVLink ports
Gen 3Hopper (H100/H200)25.6 Tb/s--
Gen 4Blackwell14.4 TB/s non-blocking72 NVLink 5.0 ports

The NVL72 connects 72 GPUs in an all-to-all topology delivering 260 TB/s of aggregate bandwidth. A GPT-scale 1.8 trillion-parameter model trains approximately 4x faster and serves inference approximately 30x faster on NVL72 compared to 8-GPU systems.

PCIe vs. NVLink Decision Guide

CharacteristicPCIe Gen 5 x16NVLink 5.0
Bidirectional bandwidth~128 GB/s1,800 GB/s
Bandwidth ratio1x (baseline)~14x PCIe Gen 5
TopologyShared bus through CPUDirect GPU-to-GPU
CostLower (standard servers)Higher (HGX/NVSwitch systems)
Best forSingle-GPU inference, general computeMulti-GPU training, large model inference

Multi-GPU Communication Hierarchy

graph TD subgraph Node["Single Server Node"] subgraph GPU["Single GPU"] TC["Tensor Cores"] <-->|"Intra-GPU\nFastest"| HBM_M["HBM"] end GPU <-->|"Intra-Node / Inter-GPU\nNVLink: 900-1,800 GB/s"| GPU2["Other GPUs\nin Same Node"] end Node <-->|"Inter-Node\nInfiniBand / Ethernet\n200-400 Gb/s per link"| Node2["Other Server Nodes"] style GPU fill:#1a5c1a,stroke:#333,color:#fff style Node fill:#1a3d6e,stroke:#333,color:#fff

NVIDIA NCCL (Collective Communications Library) is topology-aware: tensor parallelism (splitting a single layer across GPUs) stays within a node on NVLink, while pipeline parallelism (splitting sequential layers across nodes) tolerates higher inter-node latency.

Animation: Data flow comparison showing tensor parallelism within an NVLink-connected node vs. pipeline parallelism across inter-node network links
Key Takeaway: NVLink provides up to 14x the bandwidth of PCIe Gen 5 for GPU-to-GPU communication. NVSwitch enables any-to-any GPU connectivity at full NVLink speed. PCIe remains cost-effective for single-GPU inference, but multi-GPU training demands NVLink/NVSwitch.
Post-Quiz: GPU Deployment and Interconnects

1. NVLink 5.0 delivers approximately how many times the bandwidth of PCIe Gen 5?

A) 4x
B) 7x
C) 14x
D) 28x

2. In the HGX H100 system, how many NVSwitch chips provide 3.6 TB/s across eight GPUs?

A) 1
B) 2
C) 4
D) 8

3. Which parallelism strategy is typically confined within a single node because it requires the highest bandwidth?

A) Data parallelism
B) Pipeline parallelism
C) Tensor parallelism
D) Expert parallelism

4. Which Cisco product line is recommended for cost-effective single-GPU inference with PCIe GPUs?

A) Cisco Dense AI GPU servers with HGX
B) Cisco UCS C-Series with PCIe GPUs
C) Cisco Nexus Hyperfabric switches
D) Cisco MDS storage directors

5. The NVIDIA NVLink Switch system can connect up to how many GPUs in a non-blocking fabric?

A) 72 GPUs
B) 256 GPUs
C) 576 GPUs
D) 1,024 GPUs

Section 3: AI-Enabling Hardware

Pre-Quiz: AI-Enabling Hardware

1. For AI workloads, which GPU component matters more than raw CUDA Core count?

A) Clock speed
B) Tensor Core count and HBM capacity
C) GDDR memory bandwidth
D) Number of PCIe lanes

2. What is the core function of Tensor Cores?

A) General-purpose floating point computation
B) Matrix multiply-accumulate (MMA) operations in a single clock cycle
C) Video encoding and decoding
D) Memory management and caching

3. What does a DPU (Data Processing Unit) offload from the host CPU?

A) AI model training computation
B) Networking, storage I/O, and security functions
C) GPU memory management
D) Database query processing

4. Which Tensor Core generation introduced FP8 precision and the Transformer Engine?

A) 2nd Gen (Turing)
B) 3rd Gen (Ampere)
C) 4th Gen (Hopper)
D) 5th Gen (Blackwell)

5. Approximately what percentage of AI model training tasks are now offloaded to DPUs?

A) 10%
B) 20%
C) 35%
D) 50%

Key Points

CUDA Cores vs. Tensor Cores

CUDA Cores are general-purpose parallel processors (128 per SM) handling FP32, INT32, FP16/BF16, activation functions, and preprocessing. Tensor Cores are purpose-built for deep learning matrix multiply-accumulate (MMA) operations -- D = A x B + C -- in a single clock cycle.

Critical exam insight: An A100 with 6,912 CUDA Cores outperforms an RTX 3090 with 10,496 CUDA Cores in deep learning because the A100 has 432 third-generation Tensor Cores and 40 GB of HBM2e.

Tensor Core Generational Evolution

GenerationArchitectureKey Precision AdditionsNotable Feature
1st GenVolta (V100)FP16 mixed-precisionFirst Tensor Cores
2nd GenTuringINT8, INT4Inference optimization
3rd GenAmpere (A100)TF32, BF16, FP64Sparsity support (2:4)
4th GenHopper (H100)FP8Transformer Engine
5th GenBlackwellFP4, FP6Extended low-precision
Animation: Visual comparison of CUDA Core (single operations) vs. Tensor Core (matrix multiply-accumulate in one cycle) execution patterns

DPUs (Data Processing Units)

DPUs are the third pillar of modern data center compute alongside CPUs and GPUs. A DPU combines a high-speed network interface, programmable Arm CPU cores, and hardware acceleration engines for networking, storage, and security.

flowchart TD subgraph Without_DPU["Without DPU"] CPU1["CPU"] -->|"Data pipeline"| GPU1["GPU"] CPU1 -->|"Also handles"| NET1["Networking"] CPU1 -->|"Also handles"| SEC1["Encryption / Firewall"] CPU1 -->|"Also handles"| STO1["Storage I/O"] end subgraph With_DPU["With DPU"] CPU2["CPU\n100% focused on\ndata pipeline"] -->|"Full bandwidth\ndata feed"| GPU2["GPU\nFully utilized"] DPU["DPU\nBlueField"] -->|"Offloads"| NET2["Networking"] DPU -->|"Offloads"| SEC2["Encryption / Firewall"] DPU -->|"Offloads"| STO2["Storage I/O"] end style Without_DPU fill:#6e1a1a,stroke:#333,color:#fff style With_DPU fill:#1a5c1a,stroke:#333,color:#fff style DPU fill:#d4a017,stroke:#333,color:#000

NVIDIA BlueField DPU Family

ModelNetwork SpeedComputeKey Capabilities
BlueField-2200 Gb/s8 Arm A72 coresNetworking, storage, security offload
BlueField-3400 Gb/s16 Arm A78 cores, 16 GB DDR5Line-rate SDN, crypto
BlueField-4800 Gb/s6x compute of BF-3Gigascale AI factory support

SmartNICs for AI Infrastructure

SmartNICs and DPUs are closely related -- a DPU is essentially a SmartNIC with a full programmable compute complex. Performance impacts:

Hardware Selection: Training vs. Inference

DimensionTrainingInference
GPU tierHigh-end (H100, B200, GB200)Mid-range acceptable (A30, L4, L40S)
GPU memoryMaximum HBM (80-288 GB)Moderate (24-80 GB)
InterconnectNVLink/NVSwitch requiredPCIe often sufficient
PrecisionFP16/BF16/TF32 (mixed)INT8/FP8/INT4 (quantized)
Batch sizeLarge (maximize utilization)Variable (latency vs. throughput)
DurationHours to weeksContinuous, 24/7
Cost priorityPerformance per dollar per training hourPerformance per watt for ongoing costs
Key Takeaway: Tensor Cores and HBM capacity -- not raw CUDA Core count -- determine AI training performance. DPUs offload infrastructure tasks to prevent CPU bottlenecks, and hardware selection must align with the distinct resource profiles of training vs. inference.
Post-Quiz: AI-Enabling Hardware

1. The Transformer Engine in Hopper GPUs automatically selects between which precisions on a per-layer basis?

A) FP32 and FP16
B) FP8 and higher precision
C) INT8 and INT4
D) TF32 and BF16

2. What network speed does the BlueField-3 DPU support?

A) 100 Gb/s
B) 200 Gb/s
C) 400 Gb/s
D) 800 Gb/s

3. Which Tensor Core generation introduced sparsity support (2:4 structured sparsity)?

A) 1st Gen (Volta)
B) 2nd Gen (Turing)
C) 3rd Gen (Ampere)
D) 4th Gen (Hopper)

4. For inference workloads, which numerical precision formats are typically used for maximum throughput?

A) FP32 and FP64
B) INT8/FP8/INT4 (quantized)
C) FP16 and BF16 only
D) TF32 exclusively

5. Why does the A100 outperform the RTX 3090 for deep learning despite having fewer CUDA Cores?

A) Higher clock speed
B) More Tensor Cores (432 vs. fewer) and HBM2e memory
C) More PCIe lanes
D) Better cooling design

Section 4: Compute Resource Solutions

Pre-Quiz: Compute Resource Solutions

1. Virtualized GPU deployments achieve what percentage of bare-metal performance in MLPerf benchmarks?

A) 60-75%
B) 75-85%
C) 95-100%
D) 100% with no overhead

2. Which deployment model is recommended for latency-critical production inference?

A) Virtualized vGPU only
B) Bare-metal
C) Containers on shared VMs
D) Serverless GPU

3. What is the typical virtualization overhead range in real-world GPU deployments?

A) 0-5%
B) 5-10%
C) 15-25%
D) 30-50%

4. In a hybrid AI infrastructure, where should large-scale LLM training be directed?

A) vGPU tier
B) Bare-metal tier
C) Cloud burst tier
D) CPU-only tier

5. Virtualized configurations use what percentage of CPU cores to deliver near bare-metal GPU performance?

A) 100% of CPU cores
B) 85-95% of CPU cores
C) 28.5-67% of CPU cores
D) 10-20% of CPU cores

Key Points

GPU Deployment Model Decision Flow

flowchart TD Start["AI Workload\nDeployment Decision"] --> Q1{"Large-scale\ndistributed training?"} Q1 -->|"Yes"| BM["Bare-Metal\nHGX + NVLink/NVSwitch"] Q1 -->|"No"| Q2{"Latency-critical\nproduction inference?"} Q2 -->|"Yes"| BM Q2 -->|"No"| Q3{"Multi-tenant or\nshared environment?"} Q3 -->|"Yes"| VGPU["Virtualized vGPU\nUCS C-Series + NVIDIA vGPU"] Q3 -->|"No"| Q4{"Dev/test or\nfine-tuning workload?"} Q4 -->|"Yes"| VGPU Q4 -->|"No"| HYBRID["Hybrid Strategy\nBare-metal for training\nvGPU for inference/dev"] style BM fill:#1a5c1a,stroke:#333,color:#fff style VGPU fill:#1a3d6e,stroke:#333,color:#fff style HYBRID fill:#d4a017,stroke:#333,color:#000

Bare-Metal GPU Deployment

Applications access GPU resources directly on the OS with no hypervisor layer, delivering maximum performance. Recommended for:

Tradeoff: no live migration, snapshots, or dynamic resource reallocation without additional orchestration.

Virtualized GPU (vGPU) Deployment

NVIDIA vGPU with VMware VCF enables GPU resources to be shared, partitioned, and dynamically allocated. Key performance findings:

Recommended for: development/experimentation, variable inference workloads, multi-tenant environments, and small-to-medium model fine-tuning.

Hybrid Deployment Strategy

Most mature enterprise AI organizations adopt a hybrid model:

Bare-Metal TiervGPU Tier
LLM trainingDevelopment/test
Production servingModel fine-tuning
Real-time inferenceBatch inference
Multi-node trainingMulti-tenant shared clusters
HGX + NVLink/NVSwitchUCS C-Series + NVIDIA vGPU

Cisco AI Infrastructure Portfolio

Solution CategoryProductsUse Case
Dense AI GPU ServersCisco servers with NVIDIA HGX / AMD OAMLarge-scale training, multi-GPU
PCIe GPU ServersUCS C-Series with NVIDIA RTX Pro GPUsInference, development, VDI
Validated AcceleratorsNVIDIA A30 Tensor Core GPU for UCS C-SeriesMainstream AI inference, training, HPC
Network FabricCisco Nexus HyperfabricAI-optimized networking
VirtualizationVMware VCF with NVIDIA vGPU on Cisco UCSMulti-tenant AI environments
Animation: Decision tree walkthrough showing how different workload types map to bare-metal, vGPU, or hybrid deployment tiers
Key Takeaway: Virtualized GPU deployments now achieve 95-100% of bare-metal performance in benchmarks, but bare-metal remains essential for the most demanding training workloads. A hybrid strategy delivers the best balance of performance, utilization, and operational efficiency.
Post-Quiz: Compute Resource Solutions

1. Which deployment model trades operational flexibility for maximum GPU performance?

A) Virtualized vGPU
B) Bare-metal
C) Hybrid
D) Serverless

2. In a hybrid strategy, which workloads belong on the vGPU tier?

A) LLM training and production serving
B) Development, testing, fine-tuning, and batch inference
C) Real-time inference only
D) Multi-node distributed training

3. The Cisco Nexus Hyperfabric AI reference architecture is compliant with which vendor's validated designs?

A) AMD
B) Intel
C) NVIDIA
D) Broadcom

4. What key advantage do virtualized configurations offer by using only 28.5-67% of CPU cores?

A) Lower power consumption only
B) Remaining CPU and memory capacity can run other workloads, improving ROI
C) Faster GPU performance
D) Simplified networking

5. Which Cisco product line supports NVIDIA HGX and AMD OAM for large-scale AI training?

A) UCS C-Series rack servers
B) Cisco Dense AI GPU servers
C) Cisco Nexus 9000 switches
D) Cisco Catalyst access points

Your Progress

Answer Explanations