Chapter 3: AI Infrastructure Types and Components

Learning Objectives

Section 1: Types of AI Infrastructure

Pre-Quiz: Types of AI Infrastructure

1. Which AI infrastructure model uses an OpEx-based pricing approach with elastic scalability?

A) On-Premises AI B) Cloud AI C) Edge AI D) Fog Computing

2. A hospital needs to process patient diagnostic images locally to comply with HIPAA. Which deployment model best fits this inference workload?

A) Cloud AI B) On-Premises AI C) Edge AI D) Hybrid AI (inference portion)

3. Which organization co-founded the OpenFog Consortium in 2015?

A) Google B) Amazon C) Cisco D) IBM

4. What is the primary advantage of edge AI over cloud AI?

A) Higher computational capacity B) Lower upfront cost C) Ultra-low latency for real-time inference D) Better managed services

5. In the Cisco fog computing architecture, what is the role of the edge/fog layer?

A) Long-term storage and ML training B) Collecting raw data from sensors C) Aggregating, filtering, and pre-processing data locally D) Running global analytics dashboards

Key Points

Cloud AI Infrastructure

Cloud-based AI infrastructure leverages distributed computing resources from providers like AWS, Google Cloud, and Microsoft Azure. These platforms offer virtualized resources, GPU clusters, AI-specialized accelerators (such as TPUs and custom ASIC chips), and managed AI services on demand.

Analogy: Cloud AI is like renting a commercial kitchen by the hour. You get professional-grade equipment without buying it, but the hourly rate adds up if you cook all day, every day.

On-Premises AI Infrastructure

In the on-premises model, specialized hardware is deployed within the organization's own data center, providing maximum control, customization, and compliance assurance.

Hybrid AI Infrastructure

The hybrid model integrates both cloud and on-premises resources with unified orchestration. This is increasingly the default architecture for enterprise AI.

Worked Example: A hospital trains its diagnostic imaging model on a cloud GPU cluster using anonymized data. The trained model deploys to on-premises inference servers in radiology to satisfy HIPAA. Periodic retraining uses new anonymized data in the cloud.

Edge AI and Fog Computing

Edge AI processes data and runs inference directly on devices or local servers at or near the point of data generation. Key benefits include ultra-low latency, bandwidth efficiency, enhanced privacy, and operational resilience.

Cisco's Fog Computing Three-Layer Architecture:

graph TD Cloud["Cloud / Data Center Layer\n(Central Data Center)\nLong-term storage, ML training,\nglobal analytics"] Fog["Edge / Fog Layer\n(Routers, Switches, Gateways)\nAggregates, filters, and\npre-processes data locally"] Device["Device Layer\n(IoT Endpoints)\nCollects raw data from sensors,\ncameras, and IoT devices"] Device -->|"Raw data\n(high volume)"| Fog Fog -->|"Summarized results\n(reduced bandwidth)"| Cloud Cloud -->|"Model updates\n& policies"| Fog Fog -->|"Commands &\nconfigurations"| Device
LayerLocationFunction
Device LayerIoT endpointsCollects raw data from sensors, cameras, and IoT devices
Edge/Fog LayerRouters, switches, gatewaysAggregates, filters, and pre-processes data locally
Cloud/Data CenterCentral data centerLong-term storage, ML training, and global analytics

Deployment Model Comparison

flowchart LR subgraph Cloud["Cloud AI"] C1["Elastic scalability"] C2["OpEx model"] C3["Best for: Training &\nexperimentation"] end subgraph OnPrem["On-Premises AI"] O1["Fixed capacity"] O2["CapEx model"] O3["Best for: Regulated\nindustries & inference"] end subgraph Hybrid["Hybrid AI"] H1["Burst-capable"] H2["Balanced OpEx/CapEx"] H3["Best for: Mixed\ntraining & inference"] end subgraph Edge["Edge AI"] E1["Limited per site"] E2["Ultra-low latency"] E3["Best for: Real-time\ninference & IoT"] end Cloud -->|"Train in cloud,\ndeploy on-prem"| Hybrid OnPrem -->|"Burst to cloud\nfor peak demand"| Hybrid Hybrid -->|"Deploy models\nto edge"| Edge
FactorCloud AIOn-Premises AIHybrid AIEdge AI
Upfront CostLow (OpEx)High (CapEx)Medium (balanced)Low to medium
ScalabilityElastic, on-demandFixed capacityBurst-capableLimited per site
LatencyVariablePredictable, lowVaries by placementUltra-low
Data ControlProvider-dependentFull controlSplit controlLocal control
Best ForTraining, experimentationRegulated industriesMixed training/inferenceReal-time inference, IoT
Animation: Interactive deployment model selector -- drag workload characteristics to see the recommended infrastructure type
Post-Quiz: Types of AI Infrastructure

1. A startup with no hardware budget needs to train a large language model over two weeks. Which deployment model is most appropriate?

A) On-Premises AI B) Cloud AI C) Edge AI D) Fog Computing

2. In a hybrid AI architecture, where does model training typically occur and where is inference typically deployed?

A) Training on-premises, inference in cloud B) Training in cloud, inference on-premises C) Both training and inference in cloud D) Both training and inference at the edge

3. Which layer in Cisco's fog computing architecture reduces bandwidth by aggregating and filtering data before sending it to the cloud?

A) Device Layer B) Cloud/Data Center Layer C) Edge/Fog Layer D) Application Layer

4. An autonomous vehicle factory needs AI inference that continues working even when internet connectivity drops. Which deployment model is best?

A) Cloud AI B) Hybrid AI C) Edge AI D) On-Premises AI

5. What year did Cisco coin the term "fog computing"?

A) 2010 B) 2012 C) 2015 D) 2018

Section 2: Core Components of AI Environments

Pre-Quiz: Core Components of AI Environments

1. Which network plane handles GPU-to-GPU gradient synchronization during distributed training?

A) Management Plane B) Frontend Plane C) Backend Plane D) Control Plane

2. What is the bidirectional bandwidth of NVLink 5.0 per GPU?

A) 600 GB/s B) 900 GB/s C) 1.8 TB/s D) 3.6 TB/s

3. Which GPU sharing mechanism provides the highest level of isolation?

A) Time-slicing B) CUDA MPS C) MIG (Multi-Instance GPU) D) vGPU

4. How much power does a typical rack with 8 GPU nodes consume?

A) 5-10 kW B) 10-20 kW C) 30-50 kW D) 100-150 kW

5. What key advantage do containers have over virtual machines for AI inference workloads?

A) Stronger isolation B) Full OS per instance C) Faster startup time (seconds vs. minutes) and lower resource overhead D) Better hardware-dependent portability

Key Points

Network Components

AI models require constant data exchange between compute nodes and storage, making low-latency, high-bandwidth networking crucial. AI data center networks include three distinct planes:

graph TD subgraph Backend["Backend Plane (GPU-to-GPU)"] GPU1["GPU Node 1"] <-->|"InfiniBand / RoCE\nGradient sync"| GPU2["GPU Node 2"] GPU2 <-->|"InfiniBand / RoCE\nGradient sync"| GPU3["GPU Node N"] end subgraph Frontend["Frontend Plane (Storage Access)"] CN["Compute Nodes"] <-->|"100/400GbE\nNVMe-oF"| ST["Storage Systems"] end subgraph Mgmt["Management Plane"] ORCH["Orchestration\n(Kubernetes / Slurm)"] --- MON["Monitoring &\nTelemetry"] MON --- ADM["Administration\n& Security"] end Backend ~~~ Frontend Frontend ~~~ Mgmt
Network PlanePurposeTypical TechnologyKey Requirement
Backend (GPU-to-GPU)Gradient synchronization during distributed trainingInfiniBand, RoCEHighest bandwidth, lowest latency
Frontend (storage access)Connects compute to storage100/400GbE, NVMe-oFSustained high throughput
ManagementCluster orchestration, monitoringStandard EthernetReliability, security
Exam Tip: RoCE (RDMA over Converged Ethernet) enables remote direct memory access over standard Ethernet. Building high-performance, lossless RoCE networks is a key DCAI exam topic.

Compute: GPUs, Accelerators, and NVLink

GPUs are the primary compute engine for AI, offering thousands of cores for parallel processing. Key considerations when selecting GPUs:

NVLink specifications:

SpecificationValue
Bidirectional bandwidth (NVLink 5.0)1.8 TB/s per GPU
Bandwidth advantage over PCIe 5.014x
NVLink connections per Blackwell GPUUp to 18 x 100 GB/s links
NVLink Switch capacity14.4 TB/s switching (144 ports)
Maximum GPUs in non-blocking fabricUp to 576 GPUs
Vera Rubin NVL72 aggregate bandwidth260 TB/s across 72 GPUs

Multi-Level Interconnect Hierarchy

graph TD subgraph Domain1["NVLink Domain (Rack 1)"] NVS1["NVLink Switch\n3.6 TB/s"] --- G1A["GPU 1"] NVS1 --- G1B["GPU 2"] NVS1 --- G1C["GPU ..."] NVS1 --- G1D["GPU 72"] end subgraph Domain2["NVLink Domain (Rack 2)"] NVS2["NVLink Switch\n3.6 TB/s"] --- G2A["GPU 1"] NVS2 --- G2B["GPU 2"] NVS2 --- G2C["GPU ..."] NVS2 --- G2D["GPU 72"] end Domain1 <-->|"Level 2: Inter-domain\nEthernet / InfiniBand\n400G-800G per NIC"| Domain2

DPUs (Data Processing Units): NVIDIA BlueField DPUs offload networking, storage acceleration, and security tasks from CPUs and GPUs -- essential for high-performance, secure AI factories.

Analogy: A single GPU is like a single worker on an assembly line -- fast, but limited. A GPU cluster is the entire factory floor, with hundreds of workers coordinating through high-speed communication channels.

Virtualization and Containerization

CharacteristicVirtual MachinesContainers
IsolationFull OS per instanceShared host kernel
Startup TimeMinutesSeconds
Resource OverheadHigher (full OS)Lower (shared kernel)
DensityFewer per hostMany more per host
PortabilityHardware-dependentHighly portable
Best AI Use CaseLegacy workloads, strong isolationInference services, CI/CD pipelines

GPU Sharing Mechanisms

graph LR CS["CUDA Streams\n(None)"] --> TS["Time-Slicing\n(Low)"] TS --> MPS["CUDA MPS\n(Medium)"] MPS --> MIG["MIG\n(High)"] MIG --> VGPU["vGPU\n(Highest)"]
MechanismDescriptionIsolationBest For
CUDA StreamsMultiple operations concurrently on one GPUNoneSingle-app parallelism
Time-slicingContainers take turns with rapid context switchingLowLight inference
CUDA MPSMultiple processes share a GPU concurrentlyMediumMixed small workloads
MIGPartitions one GPU into isolated instancesHighMulti-tenant environments
vGPUFull GPU virtualization for VM workloadsHighestVM-based enterprise
Exam Tip: The NVIDIA GPU Operator enables GPU-accelerated containers and VMs in the same Kubernetes cluster. Worker nodes use the label nvidia.com/gpu.workload.config set to container, vm-passthrough, or vm-vgpu. A node runs one workload type at a time.

Orchestration and Monitoring

Kubernetes is the de facto standard for container orchestration in AI environments. For AI workloads it provides:

Slurm is the dominant scheduler in HPC-style environments for large-scale GPU training jobs.

Physical Infrastructure

Animation: Interactive GPU sharing mechanism comparison -- toggle between isolation levels to see resource allocation
Post-Quiz: Core Components of AI Environments

1. Which two technologies are commonly used for the backend (GPU-to-GPU) network plane?

A) Standard Ethernet and NFS B) InfiniBand and RoCE C) Fibre Channel and iSCSI D) Wi-Fi 6 and Bluetooth

2. How many times faster is NVLink 5.0 compared to PCIe 5.0?

A) 4x B) 8x C) 14x D) 20x

3. Which GPU sharing mechanism partitions a single GPU into isolated instances with dedicated compute, memory, and cache?

A) Time-slicing B) CUDA MPS C) MIG (Multi-Instance GPU) D) CUDA Streams

4. What is the Kubernetes node label used by the NVIDIA GPU Operator to configure GPU workload types?

A) gpu.workload.type B) nvidia.com/gpu.workload.config C) kubernetes.io/gpu-mode D) accelerator.nvidia.com/type

5. What cooling technology becomes necessary when rack power density exceeds approximately 30 kW?

A) Enhanced air cooling with raised floors B) Direct liquid cooling C) Passive heat sinks D) Outdoor ambient cooling

Section 3: Storage Components for AI

Pre-Quiz: Storage Components for AI

1. What is a SAN (Storage Area Network)?

A) A wireless storage solution for IoT devices B) A dedicated high-speed network providing block-level storage access C) A cloud-only object storage service D) A file-based backup system

2. What is the typical latency of a local NVMe SSD?

A) ~1 microsecond B) ~10 microseconds C) ~100 microseconds D) ~1 millisecond

3. What does NVMe-oF stand for?

A) NVMe over Fibre only B) NVMe over Fabrics C) NVMe over Flash D) NVMe optimized Format

4. Which storage tier is best for active training datasets and model checkpoints?

A) Tier 4 (Cold) -- Object storage B) Tier 3 (Cool) -- SAN/NAS C) Tier 1 (Hot) -- Local NVMe SSDs D) Tier 2 (Warm) -- Parallel file systems

5. What type of file system stripes data across multiple storage servers for aggregate bandwidth in distributed AI training?

A) NTFS B) ext4 C) Parallel file system (Lustre, BeeGFS) D) FAT32

Key Points

SAN and Fibre Channel

A Storage Area Network (SAN) is a dedicated high-speed network that provides block-level storage access. SANs remain relevant in AI infrastructure through the adoption of NVMe over Fibre Channel.

Fibre Channel provides a mature, stable, and well-tooled storage fabric. FC-NVMe delivers latencies in the 50-100 microsecond range and allows NVMe commands over existing FC fabrics without ripping and replacing infrastructure.

Analogy: If NVMe is a sports car engine, Fibre Channel SAN is the established highway system. FC-NVMe lets you drive that sports car on the highways you already built, rather than constructing entirely new roads.

NVMe: Block and File Storage

NVMe (Non-Volatile Memory Express) is a storage protocol designed specifically for flash memory. NVMe SSDs connect directly to the CPU via PCIe, bypassing legacy storage controllers for maximum performance.

NVMe over Fabrics (NVMe-oF) extends the NVMe protocol across network fabrics, delivering high-speed storage access to remote servers. Key benefits:

Storage TechnologyLatencyProtocolTypical AI Use Case
Local NVMe SSD~10 microsecondsPCIe directActive training datasets, checkpoints
NVMe-oF (Ethernet)~30-50 microsecondsNVMe over TCP/RoCEShared dataset access, inference model serving
FC-NVMe (Fibre Channel)~50-100 microsecondsNVMe over FCEnterprise SAN integration, mixed workloads
Traditional SAS/SATA SSD~100+ microsecondsSCSI/AHCICold storage, archival

Storage Architecture Considerations

Tiered Storage Architecture

graph TD DL["Data Lake\n(Object Storage / Tape)\nTier 4 - Cold"] -->|"Data ingestion\n& preparation"| PFS["Parallel File System\n(Lustre / BeeGFS)\nTier 2 - Warm"] PFS -->|"Load active\ntraining batch"| NVME["Local NVMe SSDs\nTier 1 - Hot"] NVME -->|"Checkpoints &\nmodel artifacts"| NVMEOF["NVMe-oF / SAN\nTier 2-3 - Warm/Cool"] NVMEOF -->|"Archive completed\nexperiments"| DL NVME <-->|"GPU reads/writes\n~10 us latency"| GPU["GPU Cluster\n(Training / Inference)"] NVMEOF <-->|"Shared access\n~30-100 us latency"| GPU
TierTemperatureTechnologyUse Case
Tier 1HotLocal NVMe SSDsActive training data, model checkpoints
Tier 2WarmNVMe-oF / Parallel FSShared datasets, model repositories
Tier 3CoolSAN / NASCompleted experiments, archived models
Tier 4ColdObject storage / TapeLong-term data lake, compliance archives

Key Design Principles

  1. Match storage tier to access pattern. Training checkpoints need local NVMe speed. Historical datasets accessed weekly can reside on networked storage.
  2. Plan for data movement. AI pipelines move data between tiers -- data movement tooling and bandwidth must be planned accordingly.
  3. Consider data gravity. Large datasets are expensive to move. Place compute as close to data as possible, or use NVMe-oF to extend high-speed access.
  4. Protect checkpoints. Training runs on large models can take days or weeks. Losing a checkpoint can mean restarting days of GPU-hours.
Animation: Interactive storage tier explorer -- visualize data flow through the AI pipeline from ingestion to archival
Post-Quiz: Storage Components for AI

1. What is the key advantage of FC-NVMe for organizations with existing Fibre Channel infrastructure?

A) It replaces FC entirely with Ethernet B) It runs NVMe commands over existing FC fabrics without rip-and-replace C) It provides sub-microsecond latency D) It eliminates the need for storage administrators

2. Which storage protocol connects directly to the CPU via PCIe, bypassing legacy storage controllers?

A) SCSI B) SATA C) NVMe D) SAS

3. In a tiered AI storage architecture, where should shared datasets and model repositories be stored?

A) Tier 1 (Hot) -- Local NVMe B) Tier 2 (Warm) -- NVMe-oF / Parallel FS C) Tier 3 (Cool) -- SAN / NAS D) Tier 4 (Cold) -- Object storage

4. Why is checkpoint protection critical in AI training?

A) Checkpoints contain sensitive user data B) Losing a checkpoint can mean restarting days or weeks of GPU-hours C) Checkpoints are required for regulatory compliance D) Checkpoints cannot be regenerated

5. Which parallel file systems are commonly used in distributed AI training for high-throughput concurrent data access?

A) NTFS and ext4 B) Lustre and BeeGFS C) FAT32 and exFAT D) ZFS and Btrfs

Your Progress

Answer Explanations