Study Guide: Chapter 6 — Compute and GPU Architecture for AI

Pre-Quiz: Compute Resource Evaluation

1. For AI training workloads, which resource is the primary scaling dimension?

A) CPU core count

B) System RAM capacity

C) GPU count and GPU memory capacity

D) Storage IOPS

2. What is the recommended ratio for provisioning system RAM relative to total GPU memory?

A) 1:1 ratio

B) 2-4x total GPU memory

C) 8-16x total GPU memory

D) Equal to the model parameter count in GB

3. Which memory type provides the highest bandwidth for AI training on NVIDIA data center GPUs?

A) DDR5 system RAM

B) GDDR6X

C) HBM3/HBM3E

D) LPDDR5

4. NVIDIA vGPU technology on NVSwitch-equipped systems can allocate how many GPUs to a single VM?

A) Only 1 GPU

B) Up to 4 GPUs

C) 1, 2, 4, or 8 GPUs

D) Unlimited GPUs

5. When scaling AI workloads, why should architects maximize per-node GPU density before adding nodes?

A) It reduces power consumption

B) Intra-node NVLink communication is an order of magnitude faster than inter-node networking

C) It simplifies software licensing

D) More nodes always perform worse than fewer nodes

CPU vs. GPU Roles in AI Workloads

Think of a data center AI system like a factory. The CPU is the factory manager -- it coordinates tasks, moves data, and handles logistics. The GPU is the assembly line -- it performs the same operation on thousands of items simultaneously. AI training is overwhelmingly an assembly-line problem: matrix multiplication across millions of parameters maps naturally to the massively parallel GPU architecture.

Workload Phase	Primary Compute	CPU Role	GPU Role	Memory Priority
Data preprocessing	CPU-heavy	ETL, augmentation, tokenization	Minimal	System RAM
Model training (large)	GPU-dominant	Data loading, orchestration	Matrix math, gradient compute	HBM capacity + bandwidth
Model fine-tuning	GPU-moderate	Checkpoint management	Smaller-scale gradient updates	HBM capacity
Inference (batch)	GPU-moderate	Request batching, post-processing	Forward pass execution	HBM bandwidth
Inference (real-time)	GPU + CPU	Request routing, pre/post-processing	Low-latency forward pass	HBM latency

Memory Hierarchy: HBM, GDDR, and System RAM

graph TD subgraph Tier1["Tier 1: GPU On-Package Memory"] HBM["HBM3 / HBM3E\n2-3+ TB/s bandwidth\n40-288 GB per GPU"] end subgraph Tier2["Tier 2: GPU Card Memory"] GDDR["GDDR6X\n500-1,000 GB/s bandwidth\n12-24 GB per GPU"] end subgraph Tier3["Tier 3: CPU System Memory"] DDR5["DDR5 System RAM\n50-100 GB/s bandwidth\n256 GB - 2 TB per server"] end HBM -->|"Feeds Tensor Cores"| GPU["GPU Compute"] GDDR -->|"Inference and edge AI"| GPU DDR5 -->|"Data staging"| CPU["CPU / Data Pipeline"] CPU -->|"Feeds data to GPU"| GPU style Tier1 fill:#1a5c1a,stroke:#333,color:#fff style Tier2 fill:#1a3d6e,stroke:#333,color:#fff style Tier3 fill:#6e3d1a,stroke:#333,color:#fff

Memory Type	Typical Bandwidth	Capacity Range	Primary Use
HBM3/HBM3E	2-3+ TB/s	40-288 GB per GPU	Training, large-scale inference
GDDR6X	500-1,000 GB/s	12-24 GB per GPU	Inference, edge AI
DDR5 (System)	50-100 GB/s	256 GB - 2 TB per server	Data staging, preprocessing

Virtualization and vGPU

NVIDIA vGPU technology enables multiple VMs to share a single physical GPU, or a single VM to harness multiple GPUs. On NVSwitch-equipped systems (Ampere, Hopper, Blackwell HGX platforms), NVIDIA Fabric Manager creates a unified memory domain enabling multi-GPU VMs with 1, 2, 4, or 8 GPUs.

Fractional GPU sharing -- allocate a portion of a GPU to lightweight inference workloads
Multi-GPU aggregation -- assign 2, 4, or 8 GPUs to a single VM for training
Live migration -- move GPU-accelerated VMs between hosts for maintenance
Snapshot and recovery -- protect AI environments with standard VM tooling

Scalability: Scale-Up vs. Scale-Out

Scale-up adds more GPUs within a single node (e.g., 4 to 8 GPUs via NVLink). Scale-out adds more nodes via high-speed networking. Intra-node GPU-to-GPU communication via NVLink is an order of magnitude faster than inter-node networking, so maximize per-node GPU density first.

Animation: Scale-up vs. scale-out data flow showing NVLink bandwidth advantage within a node compared to inter-node network links

Key Takeaway: AI compute evaluation centers on GPU count, HBM capacity, and interconnect bandwidth -- not traditional CPU metrics. Provision system RAM at 2-4x total GPU memory, and leverage vGPU for multi-tenant flexibility.

Post-Quiz: Compute Resource Evaluation

1. During large-scale model training, the GPU primarily performs which function?

A) Data loading and ETL

B) Matrix math and gradient computation

C) Request routing

D) Checkpoint management

2. HBM3E on the NVIDIA Blackwell Ultra architecture supports up to how much memory per GPU?

A) 40 GB

B) 80 GB

C) 288 GB

D) 512 GB

3. Which NVIDIA component configures NVSwitch memory fabric for vGPU multi-GPU VMs?

A) CUDA Runtime

B) NVIDIA Fabric Manager

C) NCCL Library

D) TensorRT

4. If a server has 8 GPUs each with 80 GB HBM, what is the recommended minimum system RAM?

A) 640 GB (1x GPU memory)

B) 1,280 GB (2x GPU memory)

C) 5,120 GB (8x GPU memory)

D) 320 GB (0.5x GPU memory)

5. What vGPU capability allows GPU-accelerated VMs to be moved between hosts during maintenance?

A) Fractional GPU sharing

B) Multi-GPU aggregation

C) Live migration

D) Snapshot and recovery

Section 2: GPU Deployment and Interconnects

Pre-Quiz: GPU Deployment and Interconnects

1. What problem does NVLink solve compared to PCIe-based GPU-to-GPU communication?

A) Reduces GPU power consumption

B) Eliminates the bottleneck of routing GPU traffic through the CPU and PCIe bus

C) Adds encryption to GPU data transfers

D) Enables GPUs from different vendors to communicate

2. NVLink 5.0 (Blackwell) provides how much bidirectional bandwidth per GPU?

A) 600 GB/s

B) 900 GB/s

C) 1,800 GB/s

D) 3,600 GB/s

3. What is the primary function of NVSwitch?

A) Convert PCIe signals to NVLink signals

B) Enable any-to-any GPU communication at full NVLink speed via a crossbar fabric

C) Manage GPU power distribution

D) Provide inter-node networking between servers

4. How many GPUs does the NVIDIA NVL72 rack-scale architecture connect?

A) 8 GPUs

B) 36 GPUs

C) 72 GPUs

D) 576 GPUs

5. For which workload type is PCIe-based GPU connectivity still cost-effective?

A) Large-scale distributed training

B) Single-GPU inference and development environments

C) Multi-node LLM training

D) Real-time multi-GPU inference

NVLink: Direct GPU-to-GPU Interconnect

NVLink is NVIDIA's proprietary high-bandwidth, low-latency point-to-point interconnect. Without NVLink, GPU-to-GPU data transfers must route through the PCIe bus and CPU -- like shipping parts between two factories via a shared highway through a central depot. NVLink is the dedicated private conveyor belt between them.

flowchart LR subgraph PCIe_Path["PCIe Path - Indirect"] direction LR G1a["GPU 1"] -->|"PCIe Bus"| CPU1["CPU"] -->|"PCIe Bus"| G2a["GPU 2"] end subgraph NVLink_Path["NVLink Path - Direct"] direction LR G1b["GPU 1"] <-->|"NVLink Direct\nUp to 1,800 GB/s"| G2b["GPU 2"] end style PCIe_Path fill:#6e1a1a,stroke:#333,color:#fff style NVLink_Path fill:#1a5c1a,stroke:#333,color:#fff

NVLink Generational Bandwidth Progression

Generation	GPU Architecture	Bandwidth per GPU (Bidir)	Links per GPU
NVLink 1.0	Pascal (P100)	160 GB/s	4
NVLink 3.0	Ampere (A100)	600 GB/s	12
NVLink 4.0	Hopper (H100)	900 GB/s	18
NVLink 5.0	Blackwell (GB200)	1,800 GB/s	18
NVLink 6.0	Rubin	3,600 GB/s (3.6 TB/s)	--

NVSwitch: Full-Mesh GPU Communication

NVSwitch acts as a high-speed crossbar switch enabling any GPU to communicate with any other GPU at full NVLink speed. Without NVSwitch, connecting 8 GPUs in a full mesh would exhaust available NVLink lanes. NVSwitch centralizes the switching function.

graph TD subgraph With_NVSwitch["With NVSwitch: Centralized Crossbar"] SW["NVSwitch\nCrossbar Fabric"] B1["GPU 0"] <--> SW B2["GPU 1"] <--> SW B3["GPU 2"] <--> SW B4["GPU 3"] <--> SW B5["GPU 4"] <--> SW B6["GPU 5"] <--> SW B7["GPU 6"] <--> SW B8["GPU 7"] <--> SW end style SW fill:#d4a017,stroke:#333,color:#000 style With_NVSwitch fill:#1a3d6e,stroke:#666,color:#fff

NVSwitch Gen	Architecture	Bidirectional Bandwidth	Ports per Chip
Gen 1	Volta (V100)	900 GB/s	18 NVLink ports
Gen 2	Ampere (A100)	600 GB/s	36 NVLink ports
Gen 3	Hopper (H100/H200)	25.6 Tb/s	--
Gen 4	Blackwell	14.4 TB/s non-blocking	72 NVLink 5.0 ports

The NVL72 connects 72 GPUs in an all-to-all topology delivering 260 TB/s of aggregate bandwidth. A GPT-scale 1.8 trillion-parameter model trains approximately 4x faster and serves inference approximately 30x faster on NVL72 compared to 8-GPU systems.

PCIe vs. NVLink Decision Guide

Characteristic	PCIe Gen 5 x16	NVLink 5.0
Bidirectional bandwidth	~128 GB/s	1,800 GB/s
Bandwidth ratio	1x (baseline)	~14x PCIe Gen 5
Topology	Shared bus through CPU	Direct GPU-to-GPU
Cost	Lower (standard servers)	Higher (HGX/NVSwitch systems)
Best for	Single-GPU inference, general compute	Multi-GPU training, large model inference

Multi-GPU Communication Hierarchy

graph TD subgraph Node["Single Server Node"] subgraph GPU["Single GPU"] TC["Tensor Cores"] <-->|"Intra-GPU\nFastest"| HBM_M["HBM"] end GPU <-->|"Intra-Node / Inter-GPU\nNVLink: 900-1,800 GB/s"| GPU2["Other GPUs\nin Same Node"] end Node <-->|"Inter-Node\nInfiniBand / Ethernet\n200-400 Gb/s per link"| Node2["Other Server Nodes"] style GPU fill:#1a5c1a,stroke:#333,color:#fff style Node fill:#1a3d6e,stroke:#333,color:#fff

NVIDIA NCCL (Collective Communications Library) is topology-aware: tensor parallelism (splitting a single layer across GPUs) stays within a node on NVLink, while pipeline parallelism (splitting sequential layers across nodes) tolerates higher inter-node latency.

Animation: Data flow comparison showing tensor parallelism within an NVLink-connected node vs. pipeline parallelism across inter-node network links

Key Takeaway: NVLink provides up to 14x the bandwidth of PCIe Gen 5 for GPU-to-GPU communication. NVSwitch enables any-to-any GPU connectivity at full NVLink speed. PCIe remains cost-effective for single-GPU inference, but multi-GPU training demands NVLink/NVSwitch.

Post-Quiz: GPU Deployment and Interconnects

1. NVLink 5.0 delivers approximately how many times the bandwidth of PCIe Gen 5?

A) 4x

B) 7x

C) 14x

D) 28x

2. In the HGX H100 system, how many NVSwitch chips provide 3.6 TB/s across eight GPUs?

A) 1

B) 2

C) 4

D) 8

3. Which parallelism strategy is typically confined within a single node because it requires the highest bandwidth?

A) Data parallelism

B) Pipeline parallelism

C) Tensor parallelism

D) Expert parallelism

4. Which Cisco product line is recommended for cost-effective single-GPU inference with PCIe GPUs?

A) Cisco Dense AI GPU servers with HGX

B) Cisco UCS C-Series with PCIe GPUs

C) Cisco Nexus Hyperfabric switches

D) Cisco MDS storage directors

5. The NVIDIA NVLink Switch system can connect up to how many GPUs in a non-blocking fabric?

A) 72 GPUs

B) 256 GPUs

C) 576 GPUs

D) 1,024 GPUs

Section 3: AI-Enabling Hardware

Pre-Quiz: AI-Enabling Hardware

1. For AI workloads, which GPU component matters more than raw CUDA Core count?

A) Clock speed

B) Tensor Core count and HBM capacity

C) GDDR memory bandwidth

D) Number of PCIe lanes

2. What is the core function of Tensor Cores?

A) General-purpose floating point computation

B) Matrix multiply-accumulate (MMA) operations in a single clock cycle

C) Video encoding and decoding

D) Memory management and caching

3. What does a DPU (Data Processing Unit) offload from the host CPU?

A) AI model training computation

B) Networking, storage I/O, and security functions

C) GPU memory management

D) Database query processing

4. Which Tensor Core generation introduced FP8 precision and the Transformer Engine?

A) 2nd Gen (Turing)

B) 3rd Gen (Ampere)

C) 4th Gen (Hopper)

D) 5th Gen (Blackwell)

5. Approximately what percentage of AI model training tasks are now offloaded to DPUs?

A) 10%

B) 20%

C) 35%

D) 50%

CUDA Cores vs. Tensor Cores

CUDA Cores are general-purpose parallel processors (128 per SM) handling FP32, INT32, FP16/BF16, activation functions, and preprocessing. Tensor Cores are purpose-built for deep learning matrix multiply-accumulate (MMA) operations -- D = A x B + C -- in a single clock cycle.

Critical exam insight: An A100 with 6,912 CUDA Cores outperforms an RTX 3090 with 10,496 CUDA Cores in deep learning because the A100 has 432 third-generation Tensor Cores and 40 GB of HBM2e.

Tensor Core Generational Evolution

Generation	Architecture	Key Precision Additions	Notable Feature
1st Gen	Volta (V100)	FP16 mixed-precision	First Tensor Cores
2nd Gen	Turing	INT8, INT4	Inference optimization
3rd Gen	Ampere (A100)	TF32, BF16, FP64	Sparsity support (2:4)
4th Gen	Hopper (H100)	FP8	Transformer Engine
5th Gen	Blackwell	FP4, FP6	Extended low-precision

Animation: Visual comparison of CUDA Core (single operations) vs. Tensor Core (matrix multiply-accumulate in one cycle) execution patterns

DPUs (Data Processing Units)

DPUs are the third pillar of modern data center compute alongside CPUs and GPUs. A DPU combines a high-speed network interface, programmable Arm CPU cores, and hardware acceleration engines for networking, storage, and security.

NVIDIA BlueField DPU Family

Model	Network Speed	Compute	Key Capabilities
BlueField-2	200 Gb/s	8 Arm A72 cores	Networking, storage, security offload
BlueField-3	400 Gb/s	16 Arm A78 cores, 16 GB DDR5	Line-rate SDN, crypto
BlueField-4	800 Gb/s	6x compute of BF-3	Gigascale AI factory support

SmartNICs for AI Infrastructure

SmartNICs and DPUs are closely related -- a DPU is essentially a SmartNIC with a full programmable compute complex. Performance impacts:

Encapsulation and encryption offload reduces latency by up to 4x
Storage read bandwidth improves by up to 48%
NVIDIA Spectrum-X delivers 1.6x faster AI network performance over traditional Ethernet
Close to 50% of cloud service providers now rely on DPUs for workload optimization

Hardware Selection: Training vs. Inference

Dimension	Training	Inference
GPU tier	High-end (H100, B200, GB200)	Mid-range acceptable (A30, L4, L40S)
GPU memory	Maximum HBM (80-288 GB)	Moderate (24-80 GB)
Interconnect	NVLink/NVSwitch required	PCIe often sufficient
Precision	FP16/BF16/TF32 (mixed)	INT8/FP8/INT4 (quantized)
Batch size	Large (maximize utilization)	Variable (latency vs. throughput)
Duration	Hours to weeks	Continuous, 24/7
Cost priority	Performance per dollar per training hour	Performance per watt for ongoing costs

Key Takeaway: Tensor Cores and HBM capacity -- not raw CUDA Core count -- determine AI training performance. DPUs offload infrastructure tasks to prevent CPU bottlenecks, and hardware selection must align with the distinct resource profiles of training vs. inference.

Post-Quiz: AI-Enabling Hardware

1. The Transformer Engine in Hopper GPUs automatically selects between which precisions on a per-layer basis?

A) FP32 and FP16

B) FP8 and higher precision

C) INT8 and INT4

D) TF32 and BF16

2. What network speed does the BlueField-3 DPU support?

A) 100 Gb/s

B) 200 Gb/s

C) 400 Gb/s

D) 800 Gb/s

3. Which Tensor Core generation introduced sparsity support (2:4 structured sparsity)?

A) 1st Gen (Volta)

B) 2nd Gen (Turing)

C) 3rd Gen (Ampere)

D) 4th Gen (Hopper)

4. For inference workloads, which numerical precision formats are typically used for maximum throughput?

A) FP32 and FP64

B) INT8/FP8/INT4 (quantized)

C) FP16 and BF16 only

D) TF32 exclusively

5. Why does the A100 outperform the RTX 3090 for deep learning despite having fewer CUDA Cores?

A) Higher clock speed

B) More Tensor Cores (432 vs. fewer) and HBM2e memory

C) More PCIe lanes

D) Better cooling design

Section 4: Compute Resource Solutions

Pre-Quiz: Compute Resource Solutions

1. Virtualized GPU deployments achieve what percentage of bare-metal performance in MLPerf benchmarks?

A) 60-75%

B) 75-85%

C) 95-100%

D) 100% with no overhead

2. Which deployment model is recommended for latency-critical production inference?

A) Virtualized vGPU only

B) Bare-metal

C) Containers on shared VMs

D) Serverless GPU

3. What is the typical virtualization overhead range in real-world GPU deployments?

A) 0-5%

B) 5-10%

C) 15-25%

D) 30-50%

4. In a hybrid AI infrastructure, where should large-scale LLM training be directed?

A) vGPU tier

B) Bare-metal tier

C) Cloud burst tier

D) CPU-only tier

5. Virtualized configurations use what percentage of CPU cores to deliver near bare-metal GPU performance?

A) 100% of CPU cores

B) 85-95% of CPU cores

C) 28.5-67% of CPU cores

D) 10-20% of CPU cores

GPU Deployment Model Decision Flow

flowchart TD Start["AI Workload\nDeployment Decision"] --> Q1{"Large-scale\ndistributed training?"} Q1 -->|"Yes"| BM["Bare-Metal\nHGX + NVLink/NVSwitch"] Q1 -->|"No"| Q2{"Latency-critical\nproduction inference?"} Q2 -->|"Yes"| BM Q2 -->|"No"| Q3{"Multi-tenant or\nshared environment?"} Q3 -->|"Yes"| VGPU["Virtualized vGPU\nUCS C-Series + NVIDIA vGPU"] Q3 -->|"No"| Q4{"Dev/test or\nfine-tuning workload?"} Q4 -->|"Yes"| VGPU Q4 -->|"No"| HYBRID["Hybrid Strategy\nBare-metal for training\nvGPU for inference/dev"] style BM fill:#1a5c1a,stroke:#333,color:#fff style VGPU fill:#1a3d6e,stroke:#333,color:#fff style HYBRID fill:#d4a017,stroke:#333,color:#000

Bare-Metal GPU Deployment

Applications access GPU resources directly on the OS with no hypervisor layer, delivering maximum performance. Recommended for:

Large-scale distributed training -- every percentage point of GPU utilization matters
Production inference services -- latency-sensitive 24/7 endpoints (e.g., robotic surgery, high-frequency trading)
Maximum throughput workloads -- where 5-25% virtualization overhead is unacceptable

Tradeoff: no live migration, snapshots, or dynamic resource reallocation without additional orchestration.

Virtualized GPU (vGPU) Deployment

NVIDIA vGPU with VMware VCF enables GPU resources to be shared, partitioned, and dynamically allocated. Key performance findings:

MLPerf benchmarks: 95-100% of bare-metal performance
Containers on VM platforms: up to 99% of bare-metal throughput
Real-world overhead: 15-25% depending on workload characteristics
Uses only 28.5-67% of CPU cores and 50-83% of physical memory -- remaining capacity can run other workloads

Recommended for: development/experimentation, variable inference workloads, multi-tenant environments, and small-to-medium model fine-tuning.

Hybrid Deployment Strategy

Most mature enterprise AI organizations adopt a hybrid model:

Bare-Metal Tier	vGPU Tier
LLM training	Development/test
Production serving	Model fine-tuning
Real-time inference	Batch inference
Multi-node training	Multi-tenant shared clusters
HGX + NVLink/NVSwitch	UCS C-Series + NVIDIA vGPU

Cisco AI Infrastructure Portfolio

Solution Category	Products	Use Case
Dense AI GPU Servers	Cisco servers with NVIDIA HGX / AMD OAM	Large-scale training, multi-GPU
PCIe GPU Servers	UCS C-Series with NVIDIA RTX Pro GPUs	Inference, development, VDI
Validated Accelerators	NVIDIA A30 Tensor Core GPU for UCS C-Series	Mainstream AI inference, training, HPC
Network Fabric	Cisco Nexus Hyperfabric	AI-optimized networking
Virtualization	VMware VCF with NVIDIA vGPU on Cisco UCS	Multi-tenant AI environments

Animation: Decision tree walkthrough showing how different workload types map to bare-metal, vGPU, or hybrid deployment tiers

Key Takeaway: Virtualized GPU deployments now achieve 95-100% of bare-metal performance in benchmarks, but bare-metal remains essential for the most demanding training workloads. A hybrid strategy delivers the best balance of performance, utilization, and operational efficiency.

Post-Quiz: Compute Resource Solutions

1. Which deployment model trades operational flexibility for maximum GPU performance?

A) Virtualized vGPU

B) Bare-metal

C) Hybrid

D) Serverless

2. In a hybrid strategy, which workloads belong on the vGPU tier?

A) LLM training and production serving

B) Development, testing, fine-tuning, and batch inference

C) Real-time inference only

D) Multi-node distributed training

3. The Cisco Nexus Hyperfabric AI reference architecture is compliant with which vendor's validated designs?

A) AMD

B) Intel

C) NVIDIA

D) Broadcom

4. What key advantage do virtualized configurations offer by using only 28.5-67% of CPU cores?

A) Lower power consumption only

B) Remaining CPU and memory capacity can run other workloads, improving ROI

C) Faster GPU performance

D) Simplified networking

5. Which Cisco product line supports NVIDIA HGX and AMD OAM for large-scale AI training?

A) UCS C-Series rack servers

B) Cisco Dense AI GPU servers

C) Cisco Nexus 9000 switches

D) Cisco Catalyst access points

Chapter 6: Compute and GPU Architecture for AI

Learning Objectives

Section 1: Compute Resource Evaluation for AI

Key Points

CPU vs. GPU Roles in AI Workloads

Memory Hierarchy: HBM, GDDR, and System RAM

Virtualization and vGPU

Scalability: Scale-Up vs. Scale-Out

Section 2: GPU Deployment and Interconnects

Key Points

NVLink: Direct GPU-to-GPU Interconnect

NVLink Generational Bandwidth Progression

NVSwitch: Full-Mesh GPU Communication

PCIe vs. NVLink Decision Guide

Multi-GPU Communication Hierarchy

Section 3: AI-Enabling Hardware

Key Points

CUDA Cores vs. Tensor Cores

Tensor Core Generational Evolution

DPUs (Data Processing Units)

NVIDIA BlueField DPU Family

SmartNICs for AI Infrastructure

Hardware Selection: Training vs. Inference

Section 4: Compute Resource Solutions

Key Points

GPU Deployment Model Decision Flow

Bare-Metal GPU Deployment

Virtualized GPU (vGPU) Deployment

Hybrid Deployment Strategy

Cisco AI Infrastructure Portfolio

Your Progress

Answer Explanations