Provision system RAM at 2-4x total GPU memory to prevent data pipeline bottlenecks
vGPU enables fractional GPU sharing, multi-GPU aggregation (1/2/4/8 GPUs per VM), live migration, and snapshots
Scale-up (more GPUs per node via NVLink) before scale-out (more nodes via network) for optimal performance
CPU vs. GPU Roles in AI Workloads
Think of a data center AI system like a factory. The CPU is the factory manager -- it coordinates tasks, moves data, and handles logistics. The GPU is the assembly line -- it performs the same operation on thousands of items simultaneously. AI training is overwhelmingly an assembly-line problem: matrix multiplication across millions of parameters maps naturally to the massively parallel GPU architecture.
Workload Phase
Primary Compute
CPU Role
GPU Role
Memory Priority
Data preprocessing
CPU-heavy
ETL, augmentation, tokenization
Minimal
System RAM
Model training (large)
GPU-dominant
Data loading, orchestration
Matrix math, gradient compute
HBM capacity + bandwidth
Model fine-tuning
GPU-moderate
Checkpoint management
Smaller-scale gradient updates
HBM capacity
Inference (batch)
GPU-moderate
Request batching, post-processing
Forward pass execution
HBM bandwidth
Inference (real-time)
GPU + CPU
Request routing, pre/post-processing
Low-latency forward pass
HBM latency
Memory Hierarchy: HBM, GDDR, and System RAM
graph TD
subgraph Tier1["Tier 1: GPU On-Package Memory"]
HBM["HBM3 / HBM3E\n2-3+ TB/s bandwidth\n40-288 GB per GPU"]
end
subgraph Tier2["Tier 2: GPU Card Memory"]
GDDR["GDDR6X\n500-1,000 GB/s bandwidth\n12-24 GB per GPU"]
end
subgraph Tier3["Tier 3: CPU System Memory"]
DDR5["DDR5 System RAM\n50-100 GB/s bandwidth\n256 GB - 2 TB per server"]
end
HBM -->|"Feeds Tensor Cores"| GPU["GPU Compute"]
GDDR -->|"Inference and edge AI"| GPU
DDR5 -->|"Data staging"| CPU["CPU / Data Pipeline"]
CPU -->|"Feeds data to GPU"| GPU
style Tier1 fill:#1a5c1a,stroke:#333,color:#fff
style Tier2 fill:#1a3d6e,stroke:#333,color:#fff
style Tier3 fill:#6e3d1a,stroke:#333,color:#fff
Memory Type
Typical Bandwidth
Capacity Range
Primary Use
HBM3/HBM3E
2-3+ TB/s
40-288 GB per GPU
Training, large-scale inference
GDDR6X
500-1,000 GB/s
12-24 GB per GPU
Inference, edge AI
DDR5 (System)
50-100 GB/s
256 GB - 2 TB per server
Data staging, preprocessing
Virtualization and vGPU
NVIDIA vGPU technology enables multiple VMs to share a single physical GPU, or a single VM to harness multiple GPUs. On NVSwitch-equipped systems (Ampere, Hopper, Blackwell HGX platforms), NVIDIA Fabric Manager creates a unified memory domain enabling multi-GPU VMs with 1, 2, 4, or 8 GPUs.
Fractional GPU sharing -- allocate a portion of a GPU to lightweight inference workloads
Multi-GPU aggregation -- assign 2, 4, or 8 GPUs to a single VM for training
Live migration -- move GPU-accelerated VMs between hosts for maintenance
Snapshot and recovery -- protect AI environments with standard VM tooling
Scalability: Scale-Up vs. Scale-Out
Scale-up adds more GPUs within a single node (e.g., 4 to 8 GPUs via NVLink). Scale-out adds more nodes via high-speed networking. Intra-node GPU-to-GPU communication via NVLink is an order of magnitude faster than inter-node networking, so maximize per-node GPU density first.
Animation: Scale-up vs. scale-out data flow showing NVLink bandwidth advantage within a node compared to inter-node network links
Key Takeaway: AI compute evaluation centers on GPU count, HBM capacity, and interconnect bandwidth -- not traditional CPU metrics. Provision system RAM at 2-4x total GPU memory, and leverage vGPU for multi-tenant flexibility.
Post-Quiz: Compute Resource Evaluation
1. During large-scale model training, the GPU primarily performs which function?
A) Data loading and ETL
B) Matrix math and gradient computation
C) Request routing
D) Checkpoint management
2. HBM3E on the NVIDIA Blackwell Ultra architecture supports up to how much memory per GPU?
A) 40 GB
B) 80 GB
C) 288 GB
D) 512 GB
3. Which NVIDIA component configures NVSwitch memory fabric for vGPU multi-GPU VMs?
A) CUDA Runtime
B) NVIDIA Fabric Manager
C) NCCL Library
D) TensorRT
4. If a server has 8 GPUs each with 80 GB HBM, what is the recommended minimum system RAM?
A) 640 GB (1x GPU memory)
B) 1,280 GB (2x GPU memory)
C) 5,120 GB (8x GPU memory)
D) 320 GB (0.5x GPU memory)
5. What vGPU capability allows GPU-accelerated VMs to be moved between hosts during maintenance?
A) Fractional GPU sharing
B) Multi-GPU aggregation
C) Live migration
D) Snapshot and recovery
Section 2: GPU Deployment and Interconnects
Pre-Quiz: GPU Deployment and Interconnects
1. What problem does NVLink solve compared to PCIe-based GPU-to-GPU communication?
A) Reduces GPU power consumption
B) Eliminates the bottleneck of routing GPU traffic through the CPU and PCIe bus
C) Adds encryption to GPU data transfers
D) Enables GPUs from different vendors to communicate
2. NVLink 5.0 (Blackwell) provides how much bidirectional bandwidth per GPU?
A) 600 GB/s
B) 900 GB/s
C) 1,800 GB/s
D) 3,600 GB/s
3. What is the primary function of NVSwitch?
A) Convert PCIe signals to NVLink signals
B) Enable any-to-any GPU communication at full NVLink speed via a crossbar fabric
C) Manage GPU power distribution
D) Provide inter-node networking between servers
4. How many GPUs does the NVIDIA NVL72 rack-scale architecture connect?
A) 8 GPUs
B) 36 GPUs
C) 72 GPUs
D) 576 GPUs
5. For which workload type is PCIe-based GPU connectivity still cost-effective?
A) Large-scale distributed training
B) Single-GPU inference and development environments
C) Multi-node LLM training
D) Real-time multi-GPU inference
Key Points
NVLink provides direct GPU-to-GPU communication, bypassing CPU and PCIe -- up to 1,800 GB/s (NVLink 5.0), which is 14x PCIe Gen 5
NVSwitch is a crossbar switch enabling any-to-any GPU communication at full NVLink speed
Gen 4 NVSwitch has 72 NVLink 5.0 ports, enabling rack-scale NVL72 with 260 TB/s aggregate bandwidth
PCIe remains cost-effective for single-GPU inference; NVLink/NVSwitch is essential for multi-GPU training
NVLink: Direct GPU-to-GPU Interconnect
NVLink is NVIDIA's proprietary high-bandwidth, low-latency point-to-point interconnect. Without NVLink, GPU-to-GPU data transfers must route through the PCIe bus and CPU -- like shipping parts between two factories via a shared highway through a central depot. NVLink is the dedicated private conveyor belt between them.
flowchart LR
subgraph PCIe_Path["PCIe Path - Indirect"]
direction LR
G1a["GPU 1"] -->|"PCIe Bus"| CPU1["CPU"] -->|"PCIe Bus"| G2a["GPU 2"]
end
subgraph NVLink_Path["NVLink Path - Direct"]
direction LR
G1b["GPU 1"] <-->|"NVLink Direct\nUp to 1,800 GB/s"| G2b["GPU 2"]
end
style PCIe_Path fill:#6e1a1a,stroke:#333,color:#fff
style NVLink_Path fill:#1a5c1a,stroke:#333,color:#fff
NVLink Generational Bandwidth Progression
Generation
GPU Architecture
Bandwidth per GPU (Bidir)
Links per GPU
NVLink 1.0
Pascal (P100)
160 GB/s
4
NVLink 3.0
Ampere (A100)
600 GB/s
12
NVLink 4.0
Hopper (H100)
900 GB/s
18
NVLink 5.0
Blackwell (GB200)
1,800 GB/s
18
NVLink 6.0
Rubin
3,600 GB/s (3.6 TB/s)
--
NVSwitch: Full-Mesh GPU Communication
NVSwitch acts as a high-speed crossbar switch enabling any GPU to communicate with any other GPU at full NVLink speed. Without NVSwitch, connecting 8 GPUs in a full mesh would exhaust available NVLink lanes. NVSwitch centralizes the switching function.
The NVL72 connects 72 GPUs in an all-to-all topology delivering 260 TB/s of aggregate bandwidth. A GPT-scale 1.8 trillion-parameter model trains approximately 4x faster and serves inference approximately 30x faster on NVL72 compared to 8-GPU systems.
PCIe vs. NVLink Decision Guide
Characteristic
PCIe Gen 5 x16
NVLink 5.0
Bidirectional bandwidth
~128 GB/s
1,800 GB/s
Bandwidth ratio
1x (baseline)
~14x PCIe Gen 5
Topology
Shared bus through CPU
Direct GPU-to-GPU
Cost
Lower (standard servers)
Higher (HGX/NVSwitch systems)
Best for
Single-GPU inference, general compute
Multi-GPU training, large model inference
Multi-GPU Communication Hierarchy
graph TD
subgraph Node["Single Server Node"]
subgraph GPU["Single GPU"]
TC["Tensor Cores"] <-->|"Intra-GPU\nFastest"| HBM_M["HBM"]
end
GPU <-->|"Intra-Node / Inter-GPU\nNVLink: 900-1,800 GB/s"| GPU2["Other GPUs\nin Same Node"]
end
Node <-->|"Inter-Node\nInfiniBand / Ethernet\n200-400 Gb/s per link"| Node2["Other Server Nodes"]
style GPU fill:#1a5c1a,stroke:#333,color:#fff
style Node fill:#1a3d6e,stroke:#333,color:#fff
NVIDIA NCCL (Collective Communications Library) is topology-aware: tensor parallelism (splitting a single layer across GPUs) stays within a node on NVLink, while pipeline parallelism (splitting sequential layers across nodes) tolerates higher inter-node latency.
Animation: Data flow comparison showing tensor parallelism within an NVLink-connected node vs. pipeline parallelism across inter-node network links
Key Takeaway: NVLink provides up to 14x the bandwidth of PCIe Gen 5 for GPU-to-GPU communication. NVSwitch enables any-to-any GPU connectivity at full NVLink speed. PCIe remains cost-effective for single-GPU inference, but multi-GPU training demands NVLink/NVSwitch.
Post-Quiz: GPU Deployment and Interconnects
1. NVLink 5.0 delivers approximately how many times the bandwidth of PCIe Gen 5?
A) 4x
B) 7x
C) 14x
D) 28x
2. In the HGX H100 system, how many NVSwitch chips provide 3.6 TB/s across eight GPUs?
A) 1
B) 2
C) 4
D) 8
3. Which parallelism strategy is typically confined within a single node because it requires the highest bandwidth?
A) Data parallelism
B) Pipeline parallelism
C) Tensor parallelism
D) Expert parallelism
4. Which Cisco product line is recommended for cost-effective single-GPU inference with PCIe GPUs?
A) Cisco Dense AI GPU servers with HGX
B) Cisco UCS C-Series with PCIe GPUs
C) Cisco Nexus Hyperfabric switches
D) Cisco MDS storage directors
5. The NVIDIA NVLink Switch system can connect up to how many GPUs in a non-blocking fabric?
A) 72 GPUs
B) 256 GPUs
C) 576 GPUs
D) 1,024 GPUs
Section 3: AI-Enabling Hardware
Pre-Quiz: AI-Enabling Hardware
1. For AI workloads, which GPU component matters more than raw CUDA Core count?
A) Clock speed
B) Tensor Core count and HBM capacity
C) GDDR memory bandwidth
D) Number of PCIe lanes
2. What is the core function of Tensor Cores?
A) General-purpose floating point computation
B) Matrix multiply-accumulate (MMA) operations in a single clock cycle
C) Video encoding and decoding
D) Memory management and caching
3. What does a DPU (Data Processing Unit) offload from the host CPU?
A) AI model training computation
B) Networking, storage I/O, and security functions
C) GPU memory management
D) Database query processing
4. Which Tensor Core generation introduced FP8 precision and the Transformer Engine?
A) 2nd Gen (Turing)
B) 3rd Gen (Ampere)
C) 4th Gen (Hopper)
D) 5th Gen (Blackwell)
5. Approximately what percentage of AI model training tasks are now offloaded to DPUs?
A) 10%
B) 20%
C) 35%
D) 50%
Key Points
Tensor Cores (not CUDA Cores) determine deep learning performance -- up to 20x faster than CUDA Cores alone for neural network workloads
The Transformer Engine (Hopper/Blackwell) auto-selects FP8 vs. higher precision per layer
DPUs (BlueField) offload networking, storage, and security so CPU cycles can focus on feeding GPUs -- ~35% of training tasks offloaded
Training demands high-end GPUs with max HBM and NVLink; inference can use mid-range GPUs with PCIe and quantized precision (INT8/FP8)
CUDA Cores vs. Tensor Cores
CUDA Cores are general-purpose parallel processors (128 per SM) handling FP32, INT32, FP16/BF16, activation functions, and preprocessing. Tensor Cores are purpose-built for deep learning matrix multiply-accumulate (MMA) operations -- D = A x B + C -- in a single clock cycle.
Critical exam insight: An A100 with 6,912 CUDA Cores outperforms an RTX 3090 with 10,496 CUDA Cores in deep learning because the A100 has 432 third-generation Tensor Cores and 40 GB of HBM2e.
Tensor Core Generational Evolution
Generation
Architecture
Key Precision Additions
Notable Feature
1st Gen
Volta (V100)
FP16 mixed-precision
First Tensor Cores
2nd Gen
Turing
INT8, INT4
Inference optimization
3rd Gen
Ampere (A100)
TF32, BF16, FP64
Sparsity support (2:4)
4th Gen
Hopper (H100)
FP8
Transformer Engine
5th Gen
Blackwell
FP4, FP6
Extended low-precision
Animation: Visual comparison of CUDA Core (single operations) vs. Tensor Core (matrix multiply-accumulate in one cycle) execution patterns
DPUs (Data Processing Units)
DPUs are the third pillar of modern data center compute alongside CPUs and GPUs. A DPU combines a high-speed network interface, programmable Arm CPU cores, and hardware acceleration engines for networking, storage, and security.
SmartNICs and DPUs are closely related -- a DPU is essentially a SmartNIC with a full programmable compute complex. Performance impacts:
Encapsulation and encryption offload reduces latency by up to 4x
Storage read bandwidth improves by up to 48%
NVIDIA Spectrum-X delivers 1.6x faster AI network performance over traditional Ethernet
Close to 50% of cloud service providers now rely on DPUs for workload optimization
Hardware Selection: Training vs. Inference
Dimension
Training
Inference
GPU tier
High-end (H100, B200, GB200)
Mid-range acceptable (A30, L4, L40S)
GPU memory
Maximum HBM (80-288 GB)
Moderate (24-80 GB)
Interconnect
NVLink/NVSwitch required
PCIe often sufficient
Precision
FP16/BF16/TF32 (mixed)
INT8/FP8/INT4 (quantized)
Batch size
Large (maximize utilization)
Variable (latency vs. throughput)
Duration
Hours to weeks
Continuous, 24/7
Cost priority
Performance per dollar per training hour
Performance per watt for ongoing costs
Key Takeaway: Tensor Cores and HBM capacity -- not raw CUDA Core count -- determine AI training performance. DPUs offload infrastructure tasks to prevent CPU bottlenecks, and hardware selection must align with the distinct resource profiles of training vs. inference.
Post-Quiz: AI-Enabling Hardware
1. The Transformer Engine in Hopper GPUs automatically selects between which precisions on a per-layer basis?
A) FP32 and FP16
B) FP8 and higher precision
C) INT8 and INT4
D) TF32 and BF16
2. What network speed does the BlueField-3 DPU support?
A) 100 Gb/s
B) 200 Gb/s
C) 400 Gb/s
D) 800 Gb/s
3. Which Tensor Core generation introduced sparsity support (2:4 structured sparsity)?
A) 1st Gen (Volta)
B) 2nd Gen (Turing)
C) 3rd Gen (Ampere)
D) 4th Gen (Hopper)
4. For inference workloads, which numerical precision formats are typically used for maximum throughput?
A) FP32 and FP64
B) INT8/FP8/INT4 (quantized)
C) FP16 and BF16 only
D) TF32 exclusively
5. Why does the A100 outperform the RTX 3090 for deep learning despite having fewer CUDA Cores?
A) Higher clock speed
B) More Tensor Cores (432 vs. fewer) and HBM2e memory
C) More PCIe lanes
D) Better cooling design
Section 4: Compute Resource Solutions
Pre-Quiz: Compute Resource Solutions
1. Virtualized GPU deployments achieve what percentage of bare-metal performance in MLPerf benchmarks?
A) 60-75%
B) 75-85%
C) 95-100%
D) 100% with no overhead
2. Which deployment model is recommended for latency-critical production inference?
A) Virtualized vGPU only
B) Bare-metal
C) Containers on shared VMs
D) Serverless GPU
3. What is the typical virtualization overhead range in real-world GPU deployments?
A) 0-5%
B) 5-10%
C) 15-25%
D) 30-50%
4. In a hybrid AI infrastructure, where should large-scale LLM training be directed?
A) vGPU tier
B) Bare-metal tier
C) Cloud burst tier
D) CPU-only tier
5. Virtualized configurations use what percentage of CPU cores to deliver near bare-metal GPU performance?
A) 100% of CPU cores
B) 85-95% of CPU cores
C) 28.5-67% of CPU cores
D) 10-20% of CPU cores
Key Points
Bare-metal: maximum performance, no hypervisor overhead, best for large-scale training and latency-critical inference
vGPU: 95-100% bare-metal performance in benchmarks (15-25% real-world overhead), uses only 28.5-67% of CPU cores
Containers on VM platforms retain up to 99% of bare-metal throughput
Hybrid strategy: bare-metal for training/production, vGPU for dev/test and flexible workloads
Cisco portfolio: Dense AI GPU servers (HGX) for training, UCS C-Series (PCIe) for inference, Nexus Hyperfabric for networking
Maximum throughput workloads -- where 5-25% virtualization overhead is unacceptable
Tradeoff: no live migration, snapshots, or dynamic resource reallocation without additional orchestration.
Virtualized GPU (vGPU) Deployment
NVIDIA vGPU with VMware VCF enables GPU resources to be shared, partitioned, and dynamically allocated. Key performance findings:
MLPerf benchmarks: 95-100% of bare-metal performance
Containers on VM platforms: up to 99% of bare-metal throughput
Real-world overhead: 15-25% depending on workload characteristics
Uses only 28.5-67% of CPU cores and 50-83% of physical memory -- remaining capacity can run other workloads
Recommended for: development/experimentation, variable inference workloads, multi-tenant environments, and small-to-medium model fine-tuning.
Hybrid Deployment Strategy
Most mature enterprise AI organizations adopt a hybrid model:
Bare-Metal Tier
vGPU Tier
LLM training
Development/test
Production serving
Model fine-tuning
Real-time inference
Batch inference
Multi-node training
Multi-tenant shared clusters
HGX + NVLink/NVSwitch
UCS C-Series + NVIDIA vGPU
Cisco AI Infrastructure Portfolio
Solution Category
Products
Use Case
Dense AI GPU Servers
Cisco servers with NVIDIA HGX / AMD OAM
Large-scale training, multi-GPU
PCIe GPU Servers
UCS C-Series with NVIDIA RTX Pro GPUs
Inference, development, VDI
Validated Accelerators
NVIDIA A30 Tensor Core GPU for UCS C-Series
Mainstream AI inference, training, HPC
Network Fabric
Cisco Nexus Hyperfabric
AI-optimized networking
Virtualization
VMware VCF with NVIDIA vGPU on Cisco UCS
Multi-tenant AI environments
Animation: Decision tree walkthrough showing how different workload types map to bare-metal, vGPU, or hybrid deployment tiers
Key Takeaway: Virtualized GPU deployments now achieve 95-100% of bare-metal performance in benchmarks, but bare-metal remains essential for the most demanding training workloads. A hybrid strategy delivers the best balance of performance, utilization, and operational efficiency.
Post-Quiz: Compute Resource Solutions
1. Which deployment model trades operational flexibility for maximum GPU performance?
A) Virtualized vGPU
B) Bare-metal
C) Hybrid
D) Serverless
2. In a hybrid strategy, which workloads belong on the vGPU tier?
A) LLM training and production serving
B) Development, testing, fine-tuning, and batch inference
C) Real-time inference only
D) Multi-node distributed training
3. The Cisco Nexus Hyperfabric AI reference architecture is compliant with which vendor's validated designs?
A) AMD
B) Intel
C) NVIDIA
D) Broadcom
4. What key advantage do virtualized configurations offer by using only 28.5-67% of CPU cores?
A) Lower power consumption only
B) Remaining CPU and memory capacity can run other workloads, improving ROI
C) Faster GPU performance
D) Simplified networking
5. Which Cisco product line supports NVIDIA HGX and AMD OAM for large-scale AI training?