Chapter 7: Storage Architecture for AI Workloads

Learning Objectives

Section 1: Storage Requirements for AI

Pre-Quiz: Storage Requirements for AI

1. Which storage performance metric is most critical for AI training workloads?

A) IOPS
B) Throughput (GB/s)
C) Latency
D) Metadata operations per second

2. What is a practical rule of thumb for capacity planning relative to raw dataset size in AI environments?

A) 1.5x the raw dataset size
B) 2x the raw dataset size
C) 3-5x the raw dataset size
D) 10x the raw dataset size

3. Which scaling pattern is preferred for modern AI storage architectures?

A) Scale-up (vertical)
B) Scale-out (horizontal)
C) Scale-down
D) Both are equally preferred

4. Which redundancy technique distributes data and parity across multiple nodes and is more storage-efficient than replication?

A) RAID 10
B) Multi-site replication
C) Erasure coding
D) Mirroring

5. AI inference workloads primarily demand which performance characteristic?

A) Large sequential throughput
B) High IOPS and low latency
C) Maximum capacity
D) Write-optimized storage

Key Points

Performance Metrics: IOPS, Throughput, and Latency

Storage in an AI data center is not a passive repository -- it is the fuel line feeding the GPU engines. When that fuel line cannot deliver data fast enough, even the most powerful GPU cluster sits idle.

MetricDefinitionPrimary AI PhaseTarget Range
IOPSDiscrete read/write operations per secondInference (many small, random I/Os)Hundreds of thousands to millions
ThroughputVolume of data transferred per unit time (GB/s)Training (large sequential reads)Tens to hundreds of GB/s aggregate
LatencyTime between I/O request and responseBoth (idle GPU cycles = wasted money)Sub-millisecond; microseconds for NVMe

Highway analogy: IOPS = number of cars entering per second. Throughput = total cargo tonnage per second. Latency = travel time per car. Training needs wide highways with heavy trucks (throughput). Inference needs fast sports cars arriving constantly (high IOPS, low latency).

flowchart LR subgraph AI_Workload["AI Workload"] direction TB T["Training Phase"] I["Inference Phase"] end subgraph Metrics["Storage Metrics"] direction TB IOPS["IOPS\n(Operations/sec)"] TP["Throughput\n(GB/s)"] LAT["Latency\n(Response Time)"] end subgraph Targets["Performance Targets"] direction TB T1["100s of thousands\nto millions IOPS"] T2["Tens to hundreds\nof GB/s"] T3["Sub-millisecond;\nmicroseconds for NVMe"] end T -- "Large sequential reads" --> TP T -- "Checkpoint writes" --> TP I -- "Many small random I/Os" --> IOPS I -- "Real-time responses" --> LAT IOPS --> T1 TP --> T2 LAT --> T3 style T fill:#4a90d9,color:#fff style I fill:#d94a4a,color:#fff style IOPS fill:#f0ad4e,color:#000 style TP fill:#f0ad4e,color:#000 style LAT fill:#f0ad4e,color:#000

Training vs. Inference Storage Demands

CharacteristicTrainingInference
I/O PatternLarge sequential readsSmall random reads/writes
Critical MetricThroughput (GB/s)IOPS and latency
Data VolumeTerabytes to petabytes per jobKilobytes to megabytes per request
ConcurrencyModerate (batch feeding)Very high (thousands of simultaneous requests)
Checkpoint WritesPeriodic, large sequential writesRarely applicable

Capacity Planning

AI datasets are growing exponentially. A single large language model may train on datasets measured in hundreds of terabytes. Capacity planning must account for:

Plan for 3-5x the raw dataset size when accounting for all copies, preprocessing outputs, and checkpoint storage.

Scalability Patterns

flowchart TD subgraph ScaleUp["Scale-Up (Vertical)"] direction TB SU1["Single Node\n4 drives, 10 GB/s"] SU2["Single Node\n8 drives, 20 GB/s"] SU3["Single Node\n16 drives, 40 GB/s"] SU1 -->|"Add faster/more\ndrives"| SU2 SU2 -->|"Add faster/more\ndrives"| SU3 SU3 -->|"Hardware\nceiling reached"| LIMIT["Cannot scale\nfurther"] end subgraph ScaleOut["Scale-Out (Horizontal)"] direction TB N1["Node 1\n10 GB/s"] N2["Node 2\n10 GB/s"] N3["Node 3\n10 GB/s"] N4["Node N\n10 GB/s"] CLUSTER["Cluster Total:\nN x 10 GB/s\nLinear scaling"] N1 --> CLUSTER N2 --> CLUSTER N3 --> CLUSTER N4 --> CLUSTER end style LIMIT fill:#d94a4a,color:#fff style CLUSTER fill:#5cb85c,color:#fff

Modern AI storage architectures favor scale-out designs because they allow capacity and performance to grow linearly with node count.

Redundancy and Availability

The cost of retraining a model due to data loss far exceeds the cost of implementing proper redundancy. Key techniques include:

Key Takeaway: AI storage evaluation requires balancing four interdependent factors -- performance (IOPS, throughput, latency), capacity, scalability, and redundancy. Training prioritizes sequential throughput; inference demands high IOPS and ultra-low latency.
Animation: Interactive visualization showing how GPU utilization drops as storage latency increases, illustrating the cost of I/O starvation
Post-Quiz: Storage Requirements for AI

1. A training job processes a 50 TB dataset. Approximately how much total storage should be provisioned accounting for versioning, checkpoints, and preprocessing?

A) 50 TB
B) 100 TB
C) 150-250 TB
D) 500 TB

2. An inference service handling thousands of simultaneous requests with kilobyte-sized payloads should optimize for which metric?

A) Sequential throughput (GB/s)
B) IOPS and latency
C) Raw capacity (PB)
D) Write throughput

3. Why do modern AI storage architectures favor scale-out over scale-up?

A) Scale-out is always cheaper per TB
B) Scale-out avoids hardware ceilings and allows linear growth in capacity and performance
C) Scale-up is no longer supported by vendors
D) Scale-out requires less network bandwidth

4. Which redundancy technique is most storage-efficient for large-scale AI deployments?

A) RAID 10 (mirroring + striping)
B) Triple replication
C) Erasure coding
D) RAID 0 (striping only)

5. During AI training, periodic checkpoint writes are characterized as:

A) Small random writes at high IOPS
B) Large sequential writes
C) Metadata-only operations
D) Read-only operations

Section 2: Storage Protocols and Technologies

Pre-Quiz: Storage Protocols and Technologies

1. What type of flow control does Fibre Channel use to prevent congestion?

A) Drop-and-retransmit (like Ethernet)
B) Credit-based flow control
C) Token-based flow control
D) Sliding window flow control

2. How many command queues does NVMe support compared to legacy SCSI?

A) 32 queues vs. 1 queue
B) 256 queues vs. 32 queues
C) Up to 65,535 queues vs. 1 queue (with 32 entries)
D) 1,024 queues vs. 64 queues

3. Which NVMe-oF transport offers the lowest latency?

A) NVMe/FC
B) NVMe/TCP
C) NVMe/RDMA (RoCEv2 or InfiniBand)
D) NVMe/iSCSI

4. Which storage type offers virtually unlimited scalability at the lowest cost per GB?

A) Block storage (SAN)
B) File storage (NAS)
C) Object storage
D) Direct-attached storage (DAS)

5. What unique capability do Cisco MDS 9000 Series switches provide for storage analytics?

A) Software-based packet sampling
B) On-chip analytics calculating 70+ metrics per I/O flow
C) NetFlow-based traffic analysis
D) SNMP trap-based monitoring only

Key Points

Storage Area Networks (SAN) and Fibre Channel

A SAN is a dedicated high-speed network providing block-level access to storage. Fibre Channel (FC) is the traditional SAN backbone with key characteristics:

The Cisco MDS 9000 Series inspects FC and SCSI/NVMe headers of all I/O exchanges, calculating more than 70 metrics per I/O flow using dedicated on-chip hardware -- the industry's first on-chip analytics for NVMe, FC, and FC-SCSI traffic.

NVMe and NVMe over Fabrics (NVMe-oF)

NVMe replaced the legacy SCSI command set with a protocol designed for flash storage. Where SCSI uses a single queue with 32 entries, NVMe supports up to 65,535 queues with 65,536 entries each.

NVMe-oF extends NVMe across a network, achieving 20-30 microsecond latency by avoiding SCSI emulation layers entirely.

TransportNetworkTypical LatencyKey AdvantageKey Limitation
NVMe/FCFibre Channel50-100 usLeverages existing FC infrastructureRequires FC switches and HBAs
NVMe/RDMARoCEv2 or InfiniBand20-30 usLowest latency; bypasses CPURequires lossless Ethernet (PFC/ECN)
NVMe/TCPStandard Ethernet/TCPHigher than RDMAWorks on any TCP/IP networkCPU overhead from TCP processing
flowchart TD NVMe["NVMe Protocol\n65,535 queues x 65,536 entries"] NVMe --> NVMeoF["NVMe over Fabrics\n(NVMe-oF)"] NVMeoF --> FC["NVMe/FC\nFibre Channel\n50-100 us"] NVMeoF --> RDMA["NVMe/RDMA\nRoCEv2 or InfiniBand\n20-30 us"] NVMeoF --> TCP["NVMe/TCP\nStandard Ethernet\nHigher latency"] FC --> MDS["Cisco MDS 9000\nFC Switches"] RDMA --> NEXUS1["Cisco Nexus 9000\nLossless Ethernet\n(PFC/ECN)"] TCP --> NEXUS2["Cisco Nexus 9000\nStandard Ethernet"] MDS --> STORAGE["NVMe Storage\nArrays"] NEXUS1 --> STORAGE NEXUS2 --> STORAGE style NVMe fill:#4a90d9,color:#fff style NVMeoF fill:#5bc0de,color:#000 style FC fill:#f0ad4e,color:#000 style RDMA fill:#5cb85c,color:#fff style TCP fill:#d9534f,color:#fff style STORAGE fill:#6c757d,color:#fff

NVMe-oF enables composable disaggregated infrastructure (CDI), where storage, compute, GPU, and FPGA resources can be independently scaled. Direct GPU-to-storage data access through DPUs bypasses CPU bottlenecks entirely.

Block vs. File vs. Object Storage

flowchart TD AI["AI Storage\nRequirements"] AI --> BLOCK["Block Storage (SAN)\nFC, iSCSI, NVMe-oF"] AI --> FILE["File Storage (NAS)\nNFS, SMB/CIFS"] AI --> OBJ["Object Storage\nS3, REST API"] BLOCK --> B_USE["GPU training feeds\nCheckpoints\nDatabases"] FILE --> F_USE["Inference serving\nShared datasets\nParallel FS for HPC"] OBJ --> O_USE["Data lakes\nRaw archives\nUnstructured data"] style BLOCK fill:#d94a4a,color:#fff style FILE fill:#f0ad4e,color:#000 style OBJ fill:#5cb85c,color:#fff style B_USE fill:#f5c6cb,color:#000 style F_USE fill:#ffeeba,color:#000 style O_USE fill:#c3e6cb,color:#000
FeatureBlock (SAN)File (NAS)Object
ProtocolFC, iSCSI, NVMe-oFNFS, SMB/CIFSS3, REST API
LatencyMicroseconds (NVMe-oF) to low msLow ms to tens of msMs to hundreds of ms
IOPSHighestModerateLower
ScalabilityModerate (scale-up)Moderate to highVirtually unlimited
Cost per GBHighestModerateLowest
Best AI UseTraining GPU feeds, checkpointsInference, shared datasetsData lakes, archives

Parallel file systems (GPFS/Spectrum Scale, Lustre, WEKA, BeeGFS) extend NAS concepts for HPC by striping data across multiple storage nodes, providing concurrent access from thousands of clients.

Key Takeaway: No single storage type fits all AI workload needs. Block storage with NVMe-oF delivers peak performance for training, file storage/parallel FS provides flexible shared access for inference, and object storage offers unmatched scalability for data lakes. Most production AI environments use all three in a tiered architecture.
Animation: Interactive comparison showing data flow through block, file, and object storage paths with latency and throughput meters
Post-Quiz: Storage Protocols and Technologies

1. A data center is deploying NVMe-oF and needs the absolute lowest latency. Which transport should they choose, and what infrastructure does it require?

A) NVMe/TCP -- requires standard Ethernet switches
B) NVMe/RDMA -- requires lossless Ethernet with PFC/ECN or InfiniBand
C) NVMe/FC -- requires Fibre Channel switches
D) NVMe/iSCSI -- requires iSCSI initiators

2. Which Cisco switch platform would you use for FC-NVMe connectivity with on-chip SAN analytics?

A) Cisco Nexus 9000
B) Cisco Catalyst 9000
C) Cisco MDS 9000 Series
D) Cisco Nexus 3000

3. An organization needs to store 10 PB of raw training data cost-effectively with S3-compatible access. Which storage type is most appropriate?

A) Block storage (SAN)
B) File storage (NAS)
C) Object storage
D) Direct-attached NVMe

4. What fundamental difference in flow control distinguishes Fibre Channel from standard Ethernet?

A) FC uses larger frame sizes
B) FC uses credit-based flow control (lossless) vs. Ethernet's drop-and-retransmit
C) FC operates at higher speeds
D) FC uses IP-based addressing

5. What does NVMe-oF enable in terms of infrastructure design?

A) Fixed server-storage ratios
B) Composable disaggregated infrastructure (CDI) with independently scalable resources
C) Elimination of all network switches
D) Direct-attached storage only

Section 3: Data Preparation for AI

Pre-Quiz: Data Preparation for AI

1. What percentage of a typical ML project is spent on data preparation?

A) 10-20%
B) 30-40%
C) 60-80%
D) 90-95%

2. In the medallion architecture, which layer contains raw, unprocessed data?

A) Gold layer
B) Silver layer
C) Bronze layer
D) Platinum layer

3. What is "data leakage" in the context of ML data preparation?

A) Data being stolen by attackers
B) Data loss due to hardware failure
C) Model accessing information during training that would not be available at inference time
D) Data being duplicated across storage tiers

4. How many core steps are in the data preparation process?

A) 3
B) 5
C) 7
D) 10

5. For time-series data, how should the train/test split be performed to avoid data leakage?

A) Random shuffling
B) Stratified random sampling
C) Chronological splitting
D) K-fold cross-validation

Key Points

The Data Preparation Process

Data preparation transforms raw, messy data into clean, structured inputs for ML algorithms. The process follows seven core steps:

StepDescriptionStorage Implication
1. CollectionGather data from databases, APIs, IoT sensors, logsHigh-throughput object storage for landing zone
2. CleaningHandle missing values, remove duplicates, fix inconsistenciesFast read/write on working tier
3. IntegrationCombine data from multiple sources via ETL/ELTSchema mapping across systems; network bandwidth critical
4. TransformationNormalize, encode, scale featuresHigh-IOPS storage for iterative processing
5. Feature EngineeringExtract, select, and create predictive featuresIntermediate storage for feature stores
6. ValidationVerify data quality, detect drift, check schemaMetadata-rich storage for lineage tracking
7. SplittingDivide into training, validation, and test setsMultiple copies on training-tier storage

Pipeline Architecture and the Medallion Pattern

flowchart LR subgraph Bronze["Bronze Layer\nObject Storage (S3)"] C1["1. Collection\nRaw data ingest"] end subgraph Silver["Silver Layer\nParallel File System"] C2["2. Cleaning"] C3["3. Integration"] C4["4. Transformation"] C5["5. Feature Engineering"] C6["6. Validation"] end subgraph Gold["Gold Layer\nNVMe Block Storage"] C7["7. Splitting\nTrain / Val / Test"] GPU["GPU Cluster\nModel Training"] end C1 --> C2 --> C3 --> C4 --> C5 --> C6 --> C7 --> GPU style Bronze fill:#cd7f32,color:#fff style Silver fill:#c0c0c0,color:#000 style Gold fill:#ffd700,color:#000

Storage tiers mapped to pipeline stages:

Data Quality, Labeling, and Common Pitfalls

Data Leakage is the most dangerous data preparation failure. The model appears to perform brilliantly in testing but fails in production because it accessed information during training that would not be available at inference time. For time-series data, always split chronologically -- never randomly.

Data Versioning is essential for reproducibility. Without tracking dataset changes over time, teams cannot trace model outputs back to specific data states.

Privacy and Compliance under GDPR, CCPA, and similar regulations impose strict controls. Storage must support access controls, encryption at rest, and audit logging throughout the pipeline.

Key Takeaway: Data preparation is deeply intertwined with storage architecture. Each pipeline stage has distinct storage requirements. The medallion architecture (bronze/silver/gold) maps storage tiers to data maturity stages.
Animation: Step-by-step walkthrough of data flowing through the medallion architecture, showing transformations at each layer with storage tier indicators
Post-Quiz: Data Preparation for AI

1. In the worked example of an image classification pipeline, where do raw images initially land?

A) NVMe block storage
B) Parallel file system
C) S3-compatible object storage (bronze layer)
D) GPU local storage

2. Which medallion layer maps to NVMe block storage for GPU-ready training data?

A) Bronze layer
B) Silver layer
C) Gold layer
D) Archive layer

3. Why is random shuffling dangerous for time-series train/test splits?

A) It reduces dataset size
B) It leaks future information into the training set
C) It causes class imbalance
D) It requires more storage

4. What storage characteristic is most important for the data cleaning and transformation steps?

A) Maximum capacity at lowest cost
B) High IOPS for iterative read/write processing
C) Rich custom metadata
D) S3-compatible API access

5. Why is data versioning essential in ML workflows?

A) It reduces storage costs
B) It enables reproducibility by tracing model outputs to specific data states
C) It speeds up training
D) It replaces the need for backups

Section 4: Software-Defined Storage and Data Strategies

Pre-Quiz: Software-Defined Storage and Data Strategies

1. What does software-defined storage (SDS) decouple?

A) Compute from networking
B) Storage management from underlying hardware
C) Applications from operating systems
D) GPUs from CPUs

2. Which storage tier (Tier 0) provides nanosecond-to-microsecond latency for active model weights?

A) NVMe SSD arrays
B) Object storage (S3)
C) GPU HBM / local NVMe
D) Parallel file system

3. What is the primary purpose of a caching layer (like Alluxio or WEKA) in a tiered AI storage architecture?

A) Replace object storage entirely
B) Keep frequently accessed data on faster media while fetching less-used data from slower tiers
C) Provide backup copies of data
D) Compress data for long-term archival

4. In a Cisco-based tiered storage deployment, which platform handles the hot tier with NVMe/FC?

A) Cisco Nexus 9000
B) Cisco MDS 9000
C) Cisco Catalyst 9000
D) Cisco UCS

5. What architecture uses a flat structure with object storage to store data with metadata tags and unique identifiers?

A) Data warehouse
B) Data lake
C) Data mart
D) Data cube

Key Points

Software-Defined Storage (SDS) for AI

Software-defined storage decouples storage management from hardware, placing intelligence in a software layer that manages heterogeneous devices through a unified control plane.

Analogy: Think of SDS like a smart traffic management system. Traditional storage is like fixed, dedicated roads for each destination. SDS acts as an intelligent routing layer that dynamically directs traffic across whatever roads exist, adding new routes as needed.

BenefitDescriptionAI Impact
DisaggregationIndependent scaling of compute, storage, networkingScale storage without buying new GPU servers
Hardware abstractionManage diverse devices through single interfaceMix NVMe, SSD, and HDD tiers seamlessly
AutomationProgrammatic provisioning, monitoring, maintenanceReduce OpEx; faster experiment iteration
Cost efficiencyEliminate overprovisioning; use commodity hardwareLower per-TB cost for multi-PB data lakes
Policy-driven tieringAuto-move data between tiers based on access patternsHot training data on NVMe; cold archives on object

Data Tiering and Caching Strategies

TierMediaLatencyCostAI Use Case
Tier 0 (Cache)GPU HBM / local NVMeNanoseconds to microsecondsHighestActive model weights, current mini-batch
Tier 1 (Hot)NVMe SSD arraysMicrosecondsHighActive training datasets, checkpoints
Tier 2 (Warm)Parallel FS / SAS SSDLow millisecondsModerateShared datasets, feature stores
Tier 3 (Cold)Object storage (S3)Milliseconds to secondsLowestData lake, raw archives, compliance
flowchart TD GPU["GPU Cluster"] GPU ---|"Nanoseconds-\nmicroseconds"| T0 subgraph T0["Tier 0 - Cache"] HBM["GPU HBM /\nLocal NVMe"] end T0 ---|"Microseconds"| T1 subgraph T1["Tier 1 - Hot"] NVME["NVMe SSD Arrays\n(Cisco MDS 9000 + NVMe/FC)"] end T1 ---|"Low\nmilliseconds"| T2 subgraph T2["Tier 2 - Warm"] PFS["Parallel File System\n(Cisco Nexus 9000 + NFS)"] end T2 ---|"Milliseconds\nto seconds"| T3 subgraph T3["Tier 3 - Cold"] OBJ["Object Storage - S3\n(Cisco Nexus 9000)"] end CACHE["Caching Layer\n(WEKA / Alluxio)"] T3 -.->|"Pre-stage next\nepoch data"| CACHE CACHE -.->|"Promote to\nhot tier"| T1 style T0 fill:#d94a4a,color:#fff style T1 fill:#f0ad4e,color:#000 style T2 fill:#5bc0de,color:#000 style T3 fill:#5cb85c,color:#fff style CACHE fill:#9b59b6,color:#fff

Storage Integration with Cisco AI Infrastructure

Fibre Channel SAN (Cisco MDS 9000 Series)

Ethernet Storage Fabric (Cisco Nexus 9000 Series / ACI)

Worked Example: 64-GPU Training Cluster

  1. Hot tier (MDS 9000 + NVMe/FC): 200 TB NVMe all-flash via 32 Gbps FC at 100+ GB/s aggregate
  2. Warm tier (Nexus 9000 + NFS): 1 PB parallel file system via 100 GbE for shared datasets and checkpoints
  3. Cold tier (Nexus 9000 + S3): 10 PB object storage for complete data lake and archives
  4. Caching layer: Distributed cache (WEKA/Alluxio) on NVMe nodes pre-stages next epoch data
Key Takeaway: Software-defined storage provides the abstraction and automation layer needed to manage multi-tier AI storage architectures at scale. Combined with Cisco MDS (Fibre Channel) and Nexus 9000 (Ethernet) switching, organizations deliver NVMe-class performance where needed while leveraging cost-effective object storage for bulk data.
Animation: Interactive tiered storage simulator showing data movement between tiers as training progresses, with cost and latency indicators per tier
Post-Quiz: Software-Defined Storage and Data Strategies

1. A 64-GPU training cluster needs 100+ GB/s aggregate throughput for its hot storage tier. Which Cisco platform and protocol combination is described for this role?

A) Cisco Nexus 9000 with NFS
B) Cisco MDS 9000 with NVMe/FC over 32 Gbps Fibre Channel
C) Cisco Catalyst 9000 with iSCSI
D) Cisco Nexus 3000 with NVMe/TCP

2. Which SDS benefit allows organizations to scale storage independently without purchasing new GPU servers?

A) Automation
B) Cost efficiency
C) Disaggregation
D) Hardware abstraction

3. In the tiered architecture, what does the caching layer (WEKA/Alluxio) specifically do for AI training?

A) Replaces NVMe storage entirely
B) Pre-stages the next training epoch's data from cold to hot tier
C) Compresses checkpoint files
D) Provides backup copies to a remote site

4. What Cisco Nexus 9000 feature creates the lossless Ethernet environment required for NVMe/RDMA?

A) VXLAN overlay
B) Priority Flow Control (PFC) and Explicit Congestion Notification (ECN)
C) OSPF routing
D) Spanning Tree Protocol

5. Why have data lakes built on S3-compatible object storage become the de facto standard for AI/ML data repositories?

A) They provide the lowest latency
B) They use flat architecture with metadata tags, enabling virtually limitless scalability at low cost with decoupled compute
C) They support Fibre Channel natively
D) They require no networking infrastructure

Your Progress

Answer Explanations