Study Guide: Chapter 3 — AI Infrastructure Types and Components

Cloud AI Infrastructure

Cloud-based AI infrastructure leverages distributed computing resources from providers like AWS, Google Cloud, and Microsoft Azure. These platforms offer virtualized resources, GPU clusters, AI-specialized accelerators (such as TPUs and custom ASIC chips), and managed AI services on demand.

Elastic scalability -- Provision hundreds of GPUs for a week of training, then release them, paying only for what you use.
Lower upfront costs -- Eliminates large capital expenditures, though sustained usage increases OpEx.
Managed services -- Pre-built AI/ML services, managed Kubernetes clusters, and turnkey GPU instances.
Global availability -- Teams across geographies access the same infrastructure.

Analogy: Cloud AI is like renting a commercial kitchen by the hour. You get professional-grade equipment without buying it, but the hourly rate adds up if you cook all day, every day.

On-Premises AI Infrastructure

In the on-premises model, specialized hardware is deployed within the organization's own data center, providing maximum control, customization, and compliance assurance.

Data sovereignty and compliance -- Healthcare (HIPAA), finance (GDPR), government (FedRAMP) benefit from data never leaving physical control.
Predictable performance -- Dedicated hardware eliminates noisy-neighbor effects.
Higher upfront CapEx -- Significant capital expenditure, but can be more cost-effective at scale.
Full control -- Complete authority over hardware, network, security, and software stack.

Hybrid AI Infrastructure

The hybrid model integrates both cloud and on-premises resources with unified orchestration. This is increasingly the default architecture for enterprise AI.

Workload optimization -- Training in the cloud, inference on-premises.
Burst capacity -- Baseline on-premises, cloud for peak demand.
Unified management -- Consistent deployment and monitoring across environments.

Worked Example: A hospital trains its diagnostic imaging model on a cloud GPU cluster using anonymized data. The trained model deploys to on-premises inference servers in radiology to satisfy HIPAA. Periodic retraining uses new anonymized data in the cloud.

Edge AI and Fog Computing

Edge AI processes data and runs inference directly on devices or local servers at or near the point of data generation. Key benefits include ultra-low latency, bandwidth efficiency, enhanced privacy, and operational resilience.

Cisco's Fog Computing Three-Layer Architecture:

graph TD Cloud["Cloud / Data Center Layer\n(Central Data Center)\nLong-term storage, ML training,\nglobal analytics"] Fog["Edge / Fog Layer\n(Routers, Switches, Gateways)\nAggregates, filters, and\npre-processes data locally"] Device["Device Layer\n(IoT Endpoints)\nCollects raw data from sensors,\ncameras, and IoT devices"] Device -->|"Raw data\n(high volume)"| Fog Fog -->|"Summarized results\n(reduced bandwidth)"| Cloud Cloud -->|"Model updates\n& policies"| Fog Fog -->|"Commands &\nconfigurations"| Device

Layer	Location	Function
Device Layer	IoT endpoints	Collects raw data from sensors, cameras, and IoT devices
Edge/Fog Layer	Routers, switches, gateways	Aggregates, filters, and pre-processes data locally
Cloud/Data Center	Central data center	Long-term storage, ML training, and global analytics

Deployment Model Comparison

flowchart LR subgraph Cloud["Cloud AI"] C1["Elastic scalability"] C2["OpEx model"] C3["Best for: Training &\nexperimentation"] end subgraph OnPrem["On-Premises AI"] O1["Fixed capacity"] O2["CapEx model"] O3["Best for: Regulated\nindustries & inference"] end subgraph Hybrid["Hybrid AI"] H1["Burst-capable"] H2["Balanced OpEx/CapEx"] H3["Best for: Mixed\ntraining & inference"] end subgraph Edge["Edge AI"] E1["Limited per site"] E2["Ultra-low latency"] E3["Best for: Real-time\ninference & IoT"] end Cloud -->|"Train in cloud,\ndeploy on-prem"| Hybrid OnPrem -->|"Burst to cloud\nfor peak demand"| Hybrid Hybrid -->|"Deploy models\nto edge"| Edge

Factor	Cloud AI	On-Premises AI	Hybrid AI	Edge AI
Upfront Cost	Low (OpEx)	High (CapEx)	Medium (balanced)	Low to medium
Scalability	Elastic, on-demand	Fixed capacity	Burst-capable	Limited per site
Latency	Variable	Predictable, low	Varies by placement	Ultra-low
Data Control	Provider-dependent	Full control	Split control	Local control
Best For	Training, experimentation	Regulated industries	Mixed training/inference	Real-time inference, IoT

Animation: Interactive deployment model selector -- drag workload characteristics to see the recommended infrastructure type

Post-Quiz: Types of AI Infrastructure

1. A startup with no hardware budget needs to train a large language model over two weeks. Which deployment model is most appropriate?

A) On-Premises AI B) Cloud AI C) Edge AI D) Fog Computing

2. In a hybrid AI architecture, where does model training typically occur and where is inference typically deployed?

A) Training on-premises, inference in cloud B) Training in cloud, inference on-premises C) Both training and inference in cloud D) Both training and inference at the edge

3. Which layer in Cisco's fog computing architecture reduces bandwidth by aggregating and filtering data before sending it to the cloud?

A) Device Layer B) Cloud/Data Center Layer C) Edge/Fog Layer D) Application Layer

4. An autonomous vehicle factory needs AI inference that continues working even when internet connectivity drops. Which deployment model is best?

A) Cloud AI B) Hybrid AI C) Edge AI D) On-Premises AI

5. What year did Cisco coin the term "fog computing"?

A) 2010 B) 2012 C) 2015 D) 2018

Section 2: Core Components of AI Environments

Network Components

AI models require constant data exchange between compute nodes and storage, making low-latency, high-bandwidth networking crucial. AI data center networks include three distinct planes:

graph TD subgraph Backend["Backend Plane (GPU-to-GPU)"] GPU1["GPU Node 1"] <-->|"InfiniBand / RoCE\nGradient sync"| GPU2["GPU Node 2"] GPU2 <-->|"InfiniBand / RoCE\nGradient sync"| GPU3["GPU Node N"] end subgraph Frontend["Frontend Plane (Storage Access)"] CN["Compute Nodes"] <-->|"100/400GbE\nNVMe-oF"| ST["Storage Systems"] end subgraph Mgmt["Management Plane"] ORCH["Orchestration\n(Kubernetes / Slurm)"] --- MON["Monitoring &\nTelemetry"] MON --- ADM["Administration\n& Security"] end Backend ~~~ Frontend Frontend ~~~ Mgmt

Network Plane	Purpose	Typical Technology	Key Requirement
Backend (GPU-to-GPU)	Gradient synchronization during distributed training	InfiniBand, RoCE	Highest bandwidth, lowest latency
Frontend (storage access)	Connects compute to storage	100/400GbE, NVMe-oF	Sustained high throughput
Management	Cluster orchestration, monitoring	Standard Ethernet	Reliability, security

Exam Tip: RoCE (RDMA over Converged Ethernet) enables remote direct memory access over standard Ethernet. Building high-performance, lossless RoCE networks is a key DCAI exam topic.

Compute: GPUs, Accelerators, and NVLink

GPUs are the primary compute engine for AI, offering thousands of cores for parallel processing. Key considerations when selecting GPUs:

Memory capacity -- HBM (High Bandwidth Memory) determines model and batch sizes a GPU can handle
Interconnect speed -- NVLink and PCIe Gen5 determine GPU communication speed
Vendor ecosystem -- NVIDIA dominates with CUDA, cuDNN, and the NGC container catalog

NVLink specifications:

Specification	Value
Bidirectional bandwidth (NVLink 5.0)	1.8 TB/s per GPU
Bandwidth advantage over PCIe 5.0	14x
NVLink connections per Blackwell GPU	Up to 18 x 100 GB/s links
NVLink Switch capacity	14.4 TB/s switching (144 ports)
Maximum GPUs in non-blocking fabric	Up to 576 GPUs
Vera Rubin NVL72 aggregate bandwidth	260 TB/s across 72 GPUs

Multi-Level Interconnect Hierarchy

graph TD subgraph Domain1["NVLink Domain (Rack 1)"] NVS1["NVLink Switch\n3.6 TB/s"] --- G1A["GPU 1"] NVS1 --- G1B["GPU 2"] NVS1 --- G1C["GPU ..."] NVS1 --- G1D["GPU 72"] end subgraph Domain2["NVLink Domain (Rack 2)"] NVS2["NVLink Switch\n3.6 TB/s"] --- G2A["GPU 1"] NVS2 --- G2B["GPU 2"] NVS2 --- G2C["GPU ..."] NVS2 --- G2D["GPU 72"] end Domain1 <-->|"Level 2: Inter-domain\nEthernet / InfiniBand\n400G-800G per NIC"| Domain2

Level 1 -- Intra-domain (NVLink/NVSwitch): Sub-microsecond latency, TB/s bandwidth within a single NVLink domain
Level 2 -- Inter-domain (Ethernet/InfiniBand): Microsecond latency, 400G-800G per NIC across racks

DPUs (Data Processing Units): NVIDIA BlueField DPUs offload networking, storage acceleration, and security tasks from CPUs and GPUs -- essential for high-performance, secure AI factories.

Analogy: A single GPU is like a single worker on an assembly line -- fast, but limited. A GPU cluster is the entire factory floor, with hundreds of workers coordinating through high-speed communication channels.

Virtualization and Containerization

Characteristic	Virtual Machines	Containers
Isolation	Full OS per instance	Shared host kernel
Startup Time	Minutes	Seconds
Resource Overhead	Higher (full OS)	Lower (shared kernel)
Density	Fewer per host	Many more per host
Portability	Hardware-dependent	Highly portable
Best AI Use Case	Legacy workloads, strong isolation	Inference services, CI/CD pipelines

GPU Sharing Mechanisms

graph LR CS["CUDA Streams\n(None)"] --> TS["Time-Slicing\n(Low)"] TS --> MPS["CUDA MPS\n(Medium)"] MPS --> MIG["MIG\n(High)"] MIG --> VGPU["vGPU\n(Highest)"]

Mechanism	Description	Isolation	Best For
CUDA Streams	Multiple operations concurrently on one GPU	None	Single-app parallelism
Time-slicing	Containers take turns with rapid context switching	Low	Light inference
CUDA MPS	Multiple processes share a GPU concurrently	Medium	Mixed small workloads
MIG	Partitions one GPU into isolated instances	High	Multi-tenant environments
vGPU	Full GPU virtualization for VM workloads	Highest	VM-based enterprise

Exam Tip: The NVIDIA GPU Operator enables GPU-accelerated containers and VMs in the same Kubernetes cluster. Worker nodes use the label nvidia.com/gpu.workload.config set to container, vm-passthrough, or vm-vgpu. A node runs one workload type at a time.

Orchestration and Monitoring

Kubernetes is the de facto standard for container orchestration in AI environments. For AI workloads it provides:

GPU scheduling -- Native support through device plugins
Autoscaling -- Horizontal pod autoscaling adjusts inference replicas based on demand
Job management -- Kubernetes Jobs/CronJobs manage batch training with retry and completion tracking
Resource quotas -- Namespaces and quotas ensure fair sharing of GPU resources

Slurm is the dominant scheduler in HPC-style environments for large-scale GPU training jobs.

Physical Infrastructure

An NVIDIA H100 GPU draws 700W per unit
A rack with 8 GPU nodes approaches 30-50 kW (vs. 5-10 kW for traditional servers)
Air cooling becomes insufficient above ~30 kW per rack
Direct liquid cooling (coolant through cold plates on GPUs) is increasingly standard

Animation: Interactive GPU sharing mechanism comparison -- toggle between isolation levels to see resource allocation

Section 3: Storage Components for AI

SAN and Fibre Channel

A Storage Area Network (SAN) is a dedicated high-speed network that provides block-level storage access. SANs remain relevant in AI infrastructure through the adoption of NVMe over Fibre Channel.

Fibre Channel provides a mature, stable, and well-tooled storage fabric. FC-NVMe delivers latencies in the 50-100 microsecond range and allows NVMe commands over existing FC fabrics without ripping and replacing infrastructure.

Analogy: If NVMe is a sports car engine, Fibre Channel SAN is the established highway system. FC-NVMe lets you drive that sports car on the highways you already built, rather than constructing entirely new roads.

NVMe: Block and File Storage

NVMe (Non-Volatile Memory Express) is a storage protocol designed specifically for flash memory. NVMe SSDs connect directly to the CPU via PCIe, bypassing legacy storage controllers for maximum performance.

NVMe over Fabrics (NVMe-oF) extends the NVMe protocol across network fabrics, delivering high-speed storage access to remote servers. Key benefits:

Reduced CPU utilization on application host servers
Shared access to centralized high-performance storage pools
Efficient data orchestration and better resource utilization
Flexible expansion of storage configuration

Storage Technology	Latency	Protocol	Typical AI Use Case
Local NVMe SSD	~10 microseconds	PCIe direct	Active training datasets, checkpoints
NVMe-oF (Ethernet)	~30-50 microseconds	NVMe over TCP/RoCE	Shared dataset access, inference model serving
FC-NVMe (Fibre Channel)	~50-100 microseconds	NVMe over FC	Enterprise SAN integration, mixed workloads
Traditional SAS/SATA SSD	~100+ microseconds	SCSI/AHCI	Cold storage, archival

Storage Architecture Considerations

Tiered Storage Architecture

Tier	Temperature	Technology	Use Case
Tier 1	Hot	Local NVMe SSDs	Active training data, model checkpoints
Tier 2	Warm	NVMe-oF / Parallel FS	Shared datasets, model repositories
Tier 3	Cool	SAN / NAS	Completed experiments, archived models
Tier 4	Cold	Object storage / Tape	Long-term data lake, compliance archives

Key Design Principles

Match storage tier to access pattern. Training checkpoints need local NVMe speed. Historical datasets accessed weekly can reside on networked storage.
Plan for data movement. AI pipelines move data between tiers -- data movement tooling and bandwidth must be planned accordingly.
Consider data gravity. Large datasets are expensive to move. Place compute as close to data as possible, or use NVMe-oF to extend high-speed access.
Protect checkpoints. Training runs on large models can take days or weeks. Losing a checkpoint can mean restarting days of GPU-hours.

Animation: Interactive storage tier explorer -- visualize data flow through the AI pipeline from ingestion to archival

Post-Quiz: Storage Components for AI

1. What is the key advantage of FC-NVMe for organizations with existing Fibre Channel infrastructure?

A) It replaces FC entirely with Ethernet B) It runs NVMe commands over existing FC fabrics without rip-and-replace C) It provides sub-microsecond latency D) It eliminates the need for storage administrators

2. Which storage protocol connects directly to the CPU via PCIe, bypassing legacy storage controllers?

A) SCSI B) SATA C) NVMe D) SAS

3. In a tiered AI storage architecture, where should shared datasets and model repositories be stored?

A) Tier 1 (Hot) -- Local NVMe B) Tier 2 (Warm) -- NVMe-oF / Parallel FS C) Tier 3 (Cool) -- SAN / NAS D) Tier 4 (Cold) -- Object storage

4. Why is checkpoint protection critical in AI training?

A) Checkpoints contain sensitive user data B) Losing a checkpoint can mean restarting days or weeks of GPU-hours C) Checkpoints are required for regulatory compliance D) Checkpoints cannot be regenerated

5. Which parallel file systems are commonly used in distributed AI training for high-throughput concurrent data access?

A) NTFS and ext4 B) Lustre and BeeGFS C) FAT32 and exFAT D) ZFS and Btrfs

Chapter 3: AI Infrastructure Types and Components

Learning Objectives

Section 1: Types of AI Infrastructure

Key Points

Cloud AI Infrastructure

On-Premises AI Infrastructure

Hybrid AI Infrastructure

Edge AI and Fog Computing

Deployment Model Comparison

Section 2: Core Components of AI Environments

Key Points

Network Components

Compute: GPUs, Accelerators, and NVLink

Multi-Level Interconnect Hierarchy

Virtualization and Containerization

GPU Sharing Mechanisms

Orchestration and Monitoring

Physical Infrastructure

Section 3: Storage Components for AI

Key Points

SAN and Fibre Channel

NVMe: Block and File Storage

Storage Architecture Considerations

Tiered Storage Architecture

Key Design Principles

Your Progress

Answer Explanations