Cloud AI Infrastructure
Cloud-based AI infrastructure leverages distributed computing resources from providers like AWS, Google Cloud, and Microsoft Azure. These platforms offer virtualized resources, GPU clusters, AI-specialized accelerators (such as TPUs and custom ASIC chips), and managed AI services on demand.
- Elastic scalability -- Provision hundreds of GPUs for a week of training, then release them, paying only for what you use.
- Lower upfront costs -- Eliminates large capital expenditures, though sustained usage increases OpEx.
- Managed services -- Pre-built AI/ML services, managed Kubernetes clusters, and turnkey GPU instances.
- Global availability -- Teams across geographies access the same infrastructure.
Analogy: Cloud AI is like renting a commercial kitchen by the hour. You get professional-grade equipment without buying it, but the hourly rate adds up if you cook all day, every day.
On-Premises AI Infrastructure
In the on-premises model, specialized hardware is deployed within the organization's own data center, providing maximum control, customization, and compliance assurance.
- Data sovereignty and compliance -- Healthcare (HIPAA), finance (GDPR), government (FedRAMP) benefit from data never leaving physical control.
- Predictable performance -- Dedicated hardware eliminates noisy-neighbor effects.
- Higher upfront CapEx -- Significant capital expenditure, but can be more cost-effective at scale.
- Full control -- Complete authority over hardware, network, security, and software stack.
Hybrid AI Infrastructure
The hybrid model integrates both cloud and on-premises resources with unified orchestration. This is increasingly the default architecture for enterprise AI.
- Workload optimization -- Training in the cloud, inference on-premises.
- Burst capacity -- Baseline on-premises, cloud for peak demand.
- Unified management -- Consistent deployment and monitoring across environments.
Worked Example: A hospital trains its diagnostic imaging model on a cloud GPU cluster using anonymized data. The trained model deploys to on-premises inference servers in radiology to satisfy HIPAA. Periodic retraining uses new anonymized data in the cloud.
Edge AI and Fog Computing
Edge AI processes data and runs inference directly on devices or local servers at or near the point of data generation. Key benefits include ultra-low latency, bandwidth efficiency, enhanced privacy, and operational resilience.
Cisco's Fog Computing Three-Layer Architecture:
graph TD
Cloud["Cloud / Data Center Layer\n(Central Data Center)\nLong-term storage, ML training,\nglobal analytics"]
Fog["Edge / Fog Layer\n(Routers, Switches, Gateways)\nAggregates, filters, and\npre-processes data locally"]
Device["Device Layer\n(IoT Endpoints)\nCollects raw data from sensors,\ncameras, and IoT devices"]
Device -->|"Raw data\n(high volume)"| Fog
Fog -->|"Summarized results\n(reduced bandwidth)"| Cloud
Cloud -->|"Model updates\n& policies"| Fog
Fog -->|"Commands &\nconfigurations"| Device
| Layer | Location | Function |
| Device Layer | IoT endpoints | Collects raw data from sensors, cameras, and IoT devices |
| Edge/Fog Layer | Routers, switches, gateways | Aggregates, filters, and pre-processes data locally |
| Cloud/Data Center | Central data center | Long-term storage, ML training, and global analytics |
Deployment Model Comparison
flowchart LR
subgraph Cloud["Cloud AI"]
C1["Elastic scalability"]
C2["OpEx model"]
C3["Best for: Training &\nexperimentation"]
end
subgraph OnPrem["On-Premises AI"]
O1["Fixed capacity"]
O2["CapEx model"]
O3["Best for: Regulated\nindustries & inference"]
end
subgraph Hybrid["Hybrid AI"]
H1["Burst-capable"]
H2["Balanced OpEx/CapEx"]
H3["Best for: Mixed\ntraining & inference"]
end
subgraph Edge["Edge AI"]
E1["Limited per site"]
E2["Ultra-low latency"]
E3["Best for: Real-time\ninference & IoT"]
end
Cloud -->|"Train in cloud,\ndeploy on-prem"| Hybrid
OnPrem -->|"Burst to cloud\nfor peak demand"| Hybrid
Hybrid -->|"Deploy models\nto edge"| Edge
| Factor | Cloud AI | On-Premises AI | Hybrid AI | Edge AI |
| Upfront Cost | Low (OpEx) | High (CapEx) | Medium (balanced) | Low to medium |
| Scalability | Elastic, on-demand | Fixed capacity | Burst-capable | Limited per site |
| Latency | Variable | Predictable, low | Varies by placement | Ultra-low |
| Data Control | Provider-dependent | Full control | Split control | Local control |
| Best For | Training, experimentation | Regulated industries | Mixed training/inference | Real-time inference, IoT |
Animation: Interactive deployment model selector -- drag workload characteristics to see the recommended infrastructure type
Network Components
AI models require constant data exchange between compute nodes and storage, making low-latency, high-bandwidth networking crucial. AI data center networks include three distinct planes:
graph TD
subgraph Backend["Backend Plane (GPU-to-GPU)"]
GPU1["GPU Node 1"] <-->|"InfiniBand / RoCE\nGradient sync"| GPU2["GPU Node 2"]
GPU2 <-->|"InfiniBand / RoCE\nGradient sync"| GPU3["GPU Node N"]
end
subgraph Frontend["Frontend Plane (Storage Access)"]
CN["Compute Nodes"] <-->|"100/400GbE\nNVMe-oF"| ST["Storage Systems"]
end
subgraph Mgmt["Management Plane"]
ORCH["Orchestration\n(Kubernetes / Slurm)"] --- MON["Monitoring &\nTelemetry"]
MON --- ADM["Administration\n& Security"]
end
Backend ~~~ Frontend
Frontend ~~~ Mgmt
| Network Plane | Purpose | Typical Technology | Key Requirement |
| Backend (GPU-to-GPU) | Gradient synchronization during distributed training | InfiniBand, RoCE | Highest bandwidth, lowest latency |
| Frontend (storage access) | Connects compute to storage | 100/400GbE, NVMe-oF | Sustained high throughput |
| Management | Cluster orchestration, monitoring | Standard Ethernet | Reliability, security |
Exam Tip: RoCE (RDMA over Converged Ethernet) enables remote direct memory access over standard Ethernet. Building high-performance, lossless RoCE networks is a key DCAI exam topic.
Compute: GPUs, Accelerators, and NVLink
GPUs are the primary compute engine for AI, offering thousands of cores for parallel processing. Key considerations when selecting GPUs:
- Memory capacity -- HBM (High Bandwidth Memory) determines model and batch sizes a GPU can handle
- Interconnect speed -- NVLink and PCIe Gen5 determine GPU communication speed
- Vendor ecosystem -- NVIDIA dominates with CUDA, cuDNN, and the NGC container catalog
NVLink specifications:
| Specification | Value |
| Bidirectional bandwidth (NVLink 5.0) | 1.8 TB/s per GPU |
| Bandwidth advantage over PCIe 5.0 | 14x |
| NVLink connections per Blackwell GPU | Up to 18 x 100 GB/s links |
| NVLink Switch capacity | 14.4 TB/s switching (144 ports) |
| Maximum GPUs in non-blocking fabric | Up to 576 GPUs |
| Vera Rubin NVL72 aggregate bandwidth | 260 TB/s across 72 GPUs |
Multi-Level Interconnect Hierarchy
graph TD
subgraph Domain1["NVLink Domain (Rack 1)"]
NVS1["NVLink Switch\n3.6 TB/s"] --- G1A["GPU 1"]
NVS1 --- G1B["GPU 2"]
NVS1 --- G1C["GPU ..."]
NVS1 --- G1D["GPU 72"]
end
subgraph Domain2["NVLink Domain (Rack 2)"]
NVS2["NVLink Switch\n3.6 TB/s"] --- G2A["GPU 1"]
NVS2 --- G2B["GPU 2"]
NVS2 --- G2C["GPU ..."]
NVS2 --- G2D["GPU 72"]
end
Domain1 <-->|"Level 2: Inter-domain\nEthernet / InfiniBand\n400G-800G per NIC"| Domain2
- Level 1 -- Intra-domain (NVLink/NVSwitch): Sub-microsecond latency, TB/s bandwidth within a single NVLink domain
- Level 2 -- Inter-domain (Ethernet/InfiniBand): Microsecond latency, 400G-800G per NIC across racks
DPUs (Data Processing Units): NVIDIA BlueField DPUs offload networking, storage acceleration, and security tasks from CPUs and GPUs -- essential for high-performance, secure AI factories.
Analogy: A single GPU is like a single worker on an assembly line -- fast, but limited. A GPU cluster is the entire factory floor, with hundreds of workers coordinating through high-speed communication channels.
Virtualization and Containerization
| Characteristic | Virtual Machines | Containers |
| Isolation | Full OS per instance | Shared host kernel |
| Startup Time | Minutes | Seconds |
| Resource Overhead | Higher (full OS) | Lower (shared kernel) |
| Density | Fewer per host | Many more per host |
| Portability | Hardware-dependent | Highly portable |
| Best AI Use Case | Legacy workloads, strong isolation | Inference services, CI/CD pipelines |
GPU Sharing Mechanisms
graph LR
CS["CUDA Streams\n(None)"] --> TS["Time-Slicing\n(Low)"]
TS --> MPS["CUDA MPS\n(Medium)"]
MPS --> MIG["MIG\n(High)"]
MIG --> VGPU["vGPU\n(Highest)"]
| Mechanism | Description | Isolation | Best For |
| CUDA Streams | Multiple operations concurrently on one GPU | None | Single-app parallelism |
| Time-slicing | Containers take turns with rapid context switching | Low | Light inference |
| CUDA MPS | Multiple processes share a GPU concurrently | Medium | Mixed small workloads |
| MIG | Partitions one GPU into isolated instances | High | Multi-tenant environments |
| vGPU | Full GPU virtualization for VM workloads | Highest | VM-based enterprise |
Exam Tip: The NVIDIA GPU Operator enables GPU-accelerated containers and VMs in the same Kubernetes cluster. Worker nodes use the label nvidia.com/gpu.workload.config set to container, vm-passthrough, or vm-vgpu. A node runs one workload type at a time.
Orchestration and Monitoring
Kubernetes is the de facto standard for container orchestration in AI environments. For AI workloads it provides:
- GPU scheduling -- Native support through device plugins
- Autoscaling -- Horizontal pod autoscaling adjusts inference replicas based on demand
- Job management -- Kubernetes Jobs/CronJobs manage batch training with retry and completion tracking
- Resource quotas -- Namespaces and quotas ensure fair sharing of GPU resources
Slurm is the dominant scheduler in HPC-style environments for large-scale GPU training jobs.
Physical Infrastructure
- An NVIDIA H100 GPU draws 700W per unit
- A rack with 8 GPU nodes approaches 30-50 kW (vs. 5-10 kW for traditional servers)
- Air cooling becomes insufficient above ~30 kW per rack
- Direct liquid cooling (coolant through cold plates on GPUs) is increasingly standard
Animation: Interactive GPU sharing mechanism comparison -- toggle between isolation levels to see resource allocation
SAN and Fibre Channel
A Storage Area Network (SAN) is a dedicated high-speed network that provides block-level storage access. SANs remain relevant in AI infrastructure through the adoption of NVMe over Fibre Channel.
Fibre Channel provides a mature, stable, and well-tooled storage fabric. FC-NVMe delivers latencies in the 50-100 microsecond range and allows NVMe commands over existing FC fabrics without ripping and replacing infrastructure.
Analogy: If NVMe is a sports car engine, Fibre Channel SAN is the established highway system. FC-NVMe lets you drive that sports car on the highways you already built, rather than constructing entirely new roads.
NVMe: Block and File Storage
NVMe (Non-Volatile Memory Express) is a storage protocol designed specifically for flash memory. NVMe SSDs connect directly to the CPU via PCIe, bypassing legacy storage controllers for maximum performance.
NVMe over Fabrics (NVMe-oF) extends the NVMe protocol across network fabrics, delivering high-speed storage access to remote servers. Key benefits:
- Reduced CPU utilization on application host servers
- Shared access to centralized high-performance storage pools
- Efficient data orchestration and better resource utilization
- Flexible expansion of storage configuration
| Storage Technology | Latency | Protocol | Typical AI Use Case |
| Local NVMe SSD | ~10 microseconds | PCIe direct | Active training datasets, checkpoints |
| NVMe-oF (Ethernet) | ~30-50 microseconds | NVMe over TCP/RoCE | Shared dataset access, inference model serving |
| FC-NVMe (Fibre Channel) | ~50-100 microseconds | NVMe over FC | Enterprise SAN integration, mixed workloads |
| Traditional SAS/SATA SSD | ~100+ microseconds | SCSI/AHCI | Cold storage, archival |
Storage Architecture Considerations
Tiered Storage Architecture
graph TD
DL["Data Lake\n(Object Storage / Tape)\nTier 4 - Cold"] -->|"Data ingestion\n& preparation"| PFS["Parallel File System\n(Lustre / BeeGFS)\nTier 2 - Warm"]
PFS -->|"Load active\ntraining batch"| NVME["Local NVMe SSDs\nTier 1 - Hot"]
NVME -->|"Checkpoints &\nmodel artifacts"| NVMEOF["NVMe-oF / SAN\nTier 2-3 - Warm/Cool"]
NVMEOF -->|"Archive completed\nexperiments"| DL
NVME <-->|"GPU reads/writes\n~10 us latency"| GPU["GPU Cluster\n(Training / Inference)"]
NVMEOF <-->|"Shared access\n~30-100 us latency"| GPU
| Tier | Temperature | Technology | Use Case |
| Tier 1 | Hot | Local NVMe SSDs | Active training data, model checkpoints |
| Tier 2 | Warm | NVMe-oF / Parallel FS | Shared datasets, model repositories |
| Tier 3 | Cool | SAN / NAS | Completed experiments, archived models |
| Tier 4 | Cold | Object storage / Tape | Long-term data lake, compliance archives |
Key Design Principles
- Match storage tier to access pattern. Training checkpoints need local NVMe speed. Historical datasets accessed weekly can reside on networked storage.
- Plan for data movement. AI pipelines move data between tiers -- data movement tooling and bandwidth must be planned accordingly.
- Consider data gravity. Large datasets are expensive to move. Place compute as close to data as possible, or use NVMe-oF to extend high-speed access.
- Protect checkpoints. Training runs on large models can take days or weeks. Losing a checkpoint can mean restarting days of GPU-hours.
Animation: Interactive storage tier explorer -- visualize data flow through the AI pipeline from ingestion to archival