Study Guide: Chapter 7 — Storage Architecture for AI Workloads

Pre-Quiz: Storage Requirements for AI

1. Which storage performance metric is most critical for AI training workloads?

A) IOPS

B) Throughput (GB/s)

C) Latency

D) Metadata operations per second

2. What is a practical rule of thumb for capacity planning relative to raw dataset size in AI environments?

A) 1.5x the raw dataset size

B) 2x the raw dataset size

C) 3-5x the raw dataset size

D) 10x the raw dataset size

3. Which scaling pattern is preferred for modern AI storage architectures?

A) Scale-up (vertical)

B) Scale-out (horizontal)

C) Scale-down

D) Both are equally preferred

4. Which redundancy technique distributes data and parity across multiple nodes and is more storage-efficient than replication?

A) RAID 10

B) Multi-site replication

C) Erasure coding

D) Mirroring

5. AI inference workloads primarily demand which performance characteristic?

A) Large sequential throughput

B) High IOPS and low latency

C) Maximum capacity

D) Write-optimized storage

Performance Metrics: IOPS, Throughput, and Latency

Storage in an AI data center is not a passive repository -- it is the fuel line feeding the GPU engines. When that fuel line cannot deliver data fast enough, even the most powerful GPU cluster sits idle.

Metric	Definition	Primary AI Phase	Target Range
IOPS	Discrete read/write operations per second	Inference (many small, random I/Os)	Hundreds of thousands to millions
Throughput	Volume of data transferred per unit time (GB/s)	Training (large sequential reads)	Tens to hundreds of GB/s aggregate
Latency	Time between I/O request and response	Both (idle GPU cycles = wasted money)	Sub-millisecond; microseconds for NVMe

Highway analogy: IOPS = number of cars entering per second. Throughput = total cargo tonnage per second. Latency = travel time per car. Training needs wide highways with heavy trucks (throughput). Inference needs fast sports cars arriving constantly (high IOPS, low latency).

flowchart LR subgraph AI_Workload["AI Workload"] direction TB T["Training Phase"] I["Inference Phase"] end subgraph Metrics["Storage Metrics"] direction TB IOPS["IOPS\n(Operations/sec)"] TP["Throughput\n(GB/s)"] LAT["Latency\n(Response Time)"] end subgraph Targets["Performance Targets"] direction TB T1["100s of thousands\nto millions IOPS"] T2["Tens to hundreds\nof GB/s"] T3["Sub-millisecond;\nmicroseconds for NVMe"] end T -- "Large sequential reads" --> TP T -- "Checkpoint writes" --> TP I -- "Many small random I/Os" --> IOPS I -- "Real-time responses" --> LAT IOPS --> T1 TP --> T2 LAT --> T3 style T fill:#4a90d9,color:#fff style I fill:#d94a4a,color:#fff style IOPS fill:#f0ad4e,color:#000 style TP fill:#f0ad4e,color:#000 style LAT fill:#f0ad4e,color:#000

Training vs. Inference Storage Demands

Characteristic	Training	Inference
I/O Pattern	Large sequential reads	Small random reads/writes
Critical Metric	Throughput (GB/s)	IOPS and latency
Data Volume	Terabytes to petabytes per job	Kilobytes to megabytes per request
Concurrency	Moderate (batch feeding)	Very high (thousands of simultaneous requests)
Checkpoint Writes	Periodic, large sequential writes	Rarely applicable

Capacity Planning

AI datasets are growing exponentially. A single large language model may train on datasets measured in hundreds of terabytes. Capacity planning must account for:

Raw dataset storage: Original collected data in object storage or data lakes
Preprocessed datasets: Cleaned, transformed data on higher-performance tiers
Model checkpoints: Periodic snapshots of model weights (tens of GB each for large models)
Experiment tracking: Multiple versions across dozens or hundreds of experiments

Plan for 3-5x the raw dataset size when accounting for all copies, preprocessing outputs, and checkpoint storage.

Scalability Patterns

flowchart TD subgraph ScaleUp["Scale-Up (Vertical)"] direction TB SU1["Single Node\n4 drives, 10 GB/s"] SU2["Single Node\n8 drives, 20 GB/s"] SU3["Single Node\n16 drives, 40 GB/s"] SU1 -->|"Add faster/more\ndrives"| SU2 SU2 -->|"Add faster/more\ndrives"| SU3 SU3 -->|"Hardware\nceiling reached"| LIMIT["Cannot scale\nfurther"] end subgraph ScaleOut["Scale-Out (Horizontal)"] direction TB N1["Node 1\n10 GB/s"] N2["Node 2\n10 GB/s"] N3["Node 3\n10 GB/s"] N4["Node N\n10 GB/s"] CLUSTER["Cluster Total:\nN x 10 GB/s\nLinear scaling"] N1 --> CLUSTER N2 --> CLUSTER N3 --> CLUSTER N4 --> CLUSTER end style LIMIT fill:#d94a4a,color:#fff style CLUSTER fill:#5cb85c,color:#fff

Modern AI storage architectures favor scale-out designs because they allow capacity and performance to grow linearly with node count.

Redundancy and Availability

The cost of retraining a model due to data loss far exceeds the cost of implementing proper redundancy. Key techniques include:

RAID: Traditional disk-level redundancy within a single node (RAID 5, RAID 6, RAID 10)
Erasure coding: Distributes data and parity across multiple nodes; more storage-efficient than replication
Multi-site replication: Copies data across geographically separated locations
Automated failover: Ensures storage remains accessible when nodes or paths fail

Key Takeaway: AI storage evaluation requires balancing four interdependent factors -- performance (IOPS, throughput, latency), capacity, scalability, and redundancy. Training prioritizes sequential throughput; inference demands high IOPS and ultra-low latency.

Post-Quiz: Storage Requirements for AI

1. A training job processes a 50 TB dataset. Approximately how much total storage should be provisioned accounting for versioning, checkpoints, and preprocessing?

A) 50 TB

B) 100 TB

C) 150-250 TB

D) 500 TB

2. An inference service handling thousands of simultaneous requests with kilobyte-sized payloads should optimize for which metric?

A) Sequential throughput (GB/s)

B) IOPS and latency

C) Raw capacity (PB)

D) Write throughput

3. Why do modern AI storage architectures favor scale-out over scale-up?

A) Scale-out is always cheaper per TB

B) Scale-out avoids hardware ceilings and allows linear growth in capacity and performance

C) Scale-up is no longer supported by vendors

D) Scale-out requires less network bandwidth

4. Which redundancy technique is most storage-efficient for large-scale AI deployments?

A) RAID 10 (mirroring + striping)

B) Triple replication

C) Erasure coding

D) RAID 0 (striping only)

5. During AI training, periodic checkpoint writes are characterized as:

A) Small random writes at high IOPS

B) Large sequential writes

C) Metadata-only operations

D) Read-only operations

Section 2: Storage Protocols and Technologies

Pre-Quiz: Storage Protocols and Technologies

1. What type of flow control does Fibre Channel use to prevent congestion?

A) Drop-and-retransmit (like Ethernet)

B) Credit-based flow control

C) Token-based flow control

D) Sliding window flow control

2. How many command queues does NVMe support compared to legacy SCSI?

A) 32 queues vs. 1 queue

B) 256 queues vs. 32 queues

C) Up to 65,535 queues vs. 1 queue (with 32 entries)

D) 1,024 queues vs. 64 queues

3. Which NVMe-oF transport offers the lowest latency?

A) NVMe/FC

B) NVMe/TCP

C) NVMe/RDMA (RoCEv2 or InfiniBand)

D) NVMe/iSCSI

4. Which storage type offers virtually unlimited scalability at the lowest cost per GB?

A) Block storage (SAN)

B) File storage (NAS)

C) Object storage

D) Direct-attached storage (DAS)

5. What unique capability do Cisco MDS 9000 Series switches provide for storage analytics?

A) Software-based packet sampling

B) On-chip analytics calculating 70+ metrics per I/O flow

C) NetFlow-based traffic analysis

D) SNMP trap-based monitoring only

Storage Area Networks (SAN) and Fibre Channel

A SAN is a dedicated high-speed network providing block-level access to storage. Fibre Channel (FC) is the traditional SAN backbone with key characteristics:

Speeds: 8, 16, 32, 64, and 128 Gbps per port
Credit-based flow control: Prevents congestion by ensuring a sender never transmits more frames than the receiver can accept
Zoning and LUN masking: Security mechanisms restricting which hosts can see which storage volumes
Lossless delivery: No frames dropped under normal operation

The Cisco MDS 9000 Series inspects FC and SCSI/NVMe headers of all I/O exchanges, calculating more than 70 metrics per I/O flow using dedicated on-chip hardware -- the industry's first on-chip analytics for NVMe, FC, and FC-SCSI traffic.

NVMe and NVMe over Fabrics (NVMe-oF)

NVMe replaced the legacy SCSI command set with a protocol designed for flash storage. Where SCSI uses a single queue with 32 entries, NVMe supports up to 65,535 queues with 65,536 entries each.

NVMe-oF extends NVMe across a network, achieving 20-30 microsecond latency by avoiding SCSI emulation layers entirely.

Transport	Network	Typical Latency	Key Advantage	Key Limitation
NVMe/FC	Fibre Channel	50-100 us	Leverages existing FC infrastructure	Requires FC switches and HBAs
NVMe/RDMA	RoCEv2 or InfiniBand	20-30 us	Lowest latency; bypasses CPU	Requires lossless Ethernet (PFC/ECN)
NVMe/TCP	Standard Ethernet/TCP	Higher than RDMA	Works on any TCP/IP network	CPU overhead from TCP processing

flowchart TD NVMe["NVMe Protocol\n65,535 queues x 65,536 entries"] NVMe --> NVMeoF["NVMe over Fabrics\n(NVMe-oF)"] NVMeoF --> FC["NVMe/FC\nFibre Channel\n50-100 us"] NVMeoF --> RDMA["NVMe/RDMA\nRoCEv2 or InfiniBand\n20-30 us"] NVMeoF --> TCP["NVMe/TCP\nStandard Ethernet\nHigher latency"] FC --> MDS["Cisco MDS 9000\nFC Switches"] RDMA --> NEXUS1["Cisco Nexus 9000\nLossless Ethernet\n(PFC/ECN)"] TCP --> NEXUS2["Cisco Nexus 9000\nStandard Ethernet"] MDS --> STORAGE["NVMe Storage\nArrays"] NEXUS1 --> STORAGE NEXUS2 --> STORAGE style NVMe fill:#4a90d9,color:#fff style NVMeoF fill:#5bc0de,color:#000 style FC fill:#f0ad4e,color:#000 style RDMA fill:#5cb85c,color:#fff style TCP fill:#d9534f,color:#fff style STORAGE fill:#6c757d,color:#fff

NVMe-oF enables composable disaggregated infrastructure (CDI), where storage, compute, GPU, and FPGA resources can be independently scaled. Direct GPU-to-storage data access through DPUs bypasses CPU bottlenecks entirely.

Block vs. File vs. Object Storage

flowchart TD AI["AI Storage\nRequirements"] AI --> BLOCK["Block Storage (SAN)\nFC, iSCSI, NVMe-oF"] AI --> FILE["File Storage (NAS)\nNFS, SMB/CIFS"] AI --> OBJ["Object Storage\nS3, REST API"] BLOCK --> B_USE["GPU training feeds\nCheckpoints\nDatabases"] FILE --> F_USE["Inference serving\nShared datasets\nParallel FS for HPC"] OBJ --> O_USE["Data lakes\nRaw archives\nUnstructured data"] style BLOCK fill:#d94a4a,color:#fff style FILE fill:#f0ad4e,color:#000 style OBJ fill:#5cb85c,color:#fff style B_USE fill:#f5c6cb,color:#000 style F_USE fill:#ffeeba,color:#000 style O_USE fill:#c3e6cb,color:#000

Feature	Block (SAN)	File (NAS)	Object
Protocol	FC, iSCSI, NVMe-oF	NFS, SMB/CIFS	S3, REST API
Latency	Microseconds (NVMe-oF) to low ms	Low ms to tens of ms	Ms to hundreds of ms
IOPS	Highest	Moderate	Lower
Scalability	Moderate (scale-up)	Moderate to high	Virtually unlimited
Cost per GB	Highest	Moderate	Lowest
Best AI Use	Training GPU feeds, checkpoints	Inference, shared datasets	Data lakes, archives

Parallel file systems (GPFS/Spectrum Scale, Lustre, WEKA, BeeGFS) extend NAS concepts for HPC by striping data across multiple storage nodes, providing concurrent access from thousands of clients.

Key Takeaway: No single storage type fits all AI workload needs. Block storage with NVMe-oF delivers peak performance for training, file storage/parallel FS provides flexible shared access for inference, and object storage offers unmatched scalability for data lakes. Most production AI environments use all three in a tiered architecture.

Post-Quiz: Storage Protocols and Technologies

1. A data center is deploying NVMe-oF and needs the absolute lowest latency. Which transport should they choose, and what infrastructure does it require?

A) NVMe/TCP -- requires standard Ethernet switches

B) NVMe/RDMA -- requires lossless Ethernet with PFC/ECN or InfiniBand

C) NVMe/FC -- requires Fibre Channel switches

D) NVMe/iSCSI -- requires iSCSI initiators

2. Which Cisco switch platform would you use for FC-NVMe connectivity with on-chip SAN analytics?

A) Cisco Nexus 9000

B) Cisco Catalyst 9000

C) Cisco MDS 9000 Series

D) Cisco Nexus 3000

3. An organization needs to store 10 PB of raw training data cost-effectively with S3-compatible access. Which storage type is most appropriate?

A) Block storage (SAN)

B) File storage (NAS)

C) Object storage

D) Direct-attached NVMe

4. What fundamental difference in flow control distinguishes Fibre Channel from standard Ethernet?

A) FC uses larger frame sizes

B) FC uses credit-based flow control (lossless) vs. Ethernet's drop-and-retransmit

C) FC operates at higher speeds

D) FC uses IP-based addressing

5. What does NVMe-oF enable in terms of infrastructure design?

A) Fixed server-storage ratios

B) Composable disaggregated infrastructure (CDI) with independently scalable resources

C) Elimination of all network switches

D) Direct-attached storage only

Section 3: Data Preparation for AI

Pre-Quiz: Data Preparation for AI

1. What percentage of a typical ML project is spent on data preparation?

A) 10-20%

B) 30-40%

C) 60-80%

D) 90-95%

2. In the medallion architecture, which layer contains raw, unprocessed data?

A) Gold layer

B) Silver layer

C) Bronze layer

D) Platinum layer

3. What is "data leakage" in the context of ML data preparation?

A) Data being stolen by attackers

B) Data loss due to hardware failure

C) Model accessing information during training that would not be available at inference time

D) Data being duplicated across storage tiers

4. How many core steps are in the data preparation process?

A) 3

B) 5

C) 7

D) 10

5. For time-series data, how should the train/test split be performed to avoid data leakage?

A) Random shuffling

B) Stratified random sampling

C) Chronological splitting

D) K-fold cross-validation

The Data Preparation Process

Data preparation transforms raw, messy data into clean, structured inputs for ML algorithms. The process follows seven core steps:

Step	Description	Storage Implication
1. Collection	Gather data from databases, APIs, IoT sensors, logs	High-throughput object storage for landing zone
2. Cleaning	Handle missing values, remove duplicates, fix inconsistencies	Fast read/write on working tier
3. Integration	Combine data from multiple sources via ETL/ELT	Schema mapping across systems; network bandwidth critical
4. Transformation	Normalize, encode, scale features	High-IOPS storage for iterative processing
5. Feature Engineering	Extract, select, and create predictive features	Intermediate storage for feature stores
6. Validation	Verify data quality, detect drift, check schema	Metadata-rich storage for lineage tracking
7. Splitting	Divide into training, validation, and test sets	Multiple copies on training-tier storage

Pipeline Architecture and the Medallion Pattern

flowchart LR subgraph Bronze["Bronze Layer\nObject Storage (S3)"] C1["1. Collection\nRaw data ingest"] end subgraph Silver["Silver Layer\nParallel File System"] C2["2. Cleaning"] C3["3. Integration"] C4["4. Transformation"] C5["5. Feature Engineering"] C6["6. Validation"] end subgraph Gold["Gold Layer\nNVMe Block Storage"] C7["7. Splitting\nTrain / Val / Test"] GPU["GPU Cluster\nModel Training"] end C1 --> C2 --> C3 --> C4 --> C5 --> C6 --> C7 --> GPU style Bronze fill:#cd7f32,color:#fff style Silver fill:#c0c0c0,color:#000 style Gold fill:#ffd700,color:#000

Storage tiers mapped to pipeline stages:

Ingestion tier: High-throughput object storage (S3-compatible) for raw data landing
Processing tier: High-IOPS NVMe or parallel file systems for transformation
Serving tier: Low-latency block or file storage for model training data access
Archive tier: Cold object storage for long-term dataset retention

Data Quality, Labeling, and Common Pitfalls

Data Leakage is the most dangerous data preparation failure. The model appears to perform brilliantly in testing but fails in production because it accessed information during training that would not be available at inference time. For time-series data, always split chronologically -- never randomly.

Data Versioning is essential for reproducibility. Without tracking dataset changes over time, teams cannot trace model outputs back to specific data states.

Privacy and Compliance under GDPR, CCPA, and similar regulations impose strict controls. Storage must support access controls, encryption at rest, and audit logging throughout the pipeline.

Key Takeaway: Data preparation is deeply intertwined with storage architecture. Each pipeline stage has distinct storage requirements. The medallion architecture (bronze/silver/gold) maps storage tiers to data maturity stages.

Post-Quiz: Data Preparation for AI

1. In the worked example of an image classification pipeline, where do raw images initially land?

A) NVMe block storage

B) Parallel file system

C) S3-compatible object storage (bronze layer)

D) GPU local storage

2. Which medallion layer maps to NVMe block storage for GPU-ready training data?

A) Bronze layer

B) Silver layer

C) Gold layer

D) Archive layer

3. Why is random shuffling dangerous for time-series train/test splits?

A) It reduces dataset size

B) It leaks future information into the training set

C) It causes class imbalance

D) It requires more storage

4. What storage characteristic is most important for the data cleaning and transformation steps?

A) Maximum capacity at lowest cost

B) High IOPS for iterative read/write processing

C) Rich custom metadata

D) S3-compatible API access

5. Why is data versioning essential in ML workflows?

A) It reduces storage costs

B) It enables reproducibility by tracing model outputs to specific data states

C) It speeds up training

D) It replaces the need for backups

Section 4: Software-Defined Storage and Data Strategies

Pre-Quiz: Software-Defined Storage and Data Strategies

1. What does software-defined storage (SDS) decouple?

A) Compute from networking

B) Storage management from underlying hardware

C) Applications from operating systems

D) GPUs from CPUs

2. Which storage tier (Tier 0) provides nanosecond-to-microsecond latency for active model weights?

A) NVMe SSD arrays

B) Object storage (S3)

C) GPU HBM / local NVMe

D) Parallel file system

3. What is the primary purpose of a caching layer (like Alluxio or WEKA) in a tiered AI storage architecture?

A) Replace object storage entirely

B) Keep frequently accessed data on faster media while fetching less-used data from slower tiers

C) Provide backup copies of data

D) Compress data for long-term archival

4. In a Cisco-based tiered storage deployment, which platform handles the hot tier with NVMe/FC?

A) Cisco Nexus 9000

B) Cisco MDS 9000

C) Cisco Catalyst 9000

D) Cisco UCS

5. What architecture uses a flat structure with object storage to store data with metadata tags and unique identifiers?

A) Data warehouse

B) Data lake

C) Data mart

D) Data cube

Software-Defined Storage (SDS) for AI

Software-defined storage decouples storage management from hardware, placing intelligence in a software layer that manages heterogeneous devices through a unified control plane.

Analogy: Think of SDS like a smart traffic management system. Traditional storage is like fixed, dedicated roads for each destination. SDS acts as an intelligent routing layer that dynamically directs traffic across whatever roads exist, adding new routes as needed.

Benefit	Description	AI Impact
Disaggregation	Independent scaling of compute, storage, networking	Scale storage without buying new GPU servers
Hardware abstraction	Manage diverse devices through single interface	Mix NVMe, SSD, and HDD tiers seamlessly
Automation	Programmatic provisioning, monitoring, maintenance	Reduce OpEx; faster experiment iteration
Cost efficiency	Eliminate overprovisioning; use commodity hardware	Lower per-TB cost for multi-PB data lakes
Policy-driven tiering	Auto-move data between tiers based on access patterns	Hot training data on NVMe; cold archives on object

Data Tiering and Caching Strategies

Tier	Media	Latency	Cost	AI Use Case
Tier 0 (Cache)	GPU HBM / local NVMe	Nanoseconds to microseconds	Highest	Active model weights, current mini-batch
Tier 1 (Hot)	NVMe SSD arrays	Microseconds	High	Active training datasets, checkpoints
Tier 2 (Warm)	Parallel FS / SAS SSD	Low milliseconds	Moderate	Shared datasets, feature stores
Tier 3 (Cold)	Object storage (S3)	Milliseconds to seconds	Lowest	Data lake, raw archives, compliance

flowchart TD GPU["GPU Cluster"] GPU ---|"Nanoseconds-\nmicroseconds"| T0 subgraph T0["Tier 0 - Cache"] HBM["GPU HBM /\nLocal NVMe"] end T0 ---|"Microseconds"| T1 subgraph T1["Tier 1 - Hot"] NVME["NVMe SSD Arrays\n(Cisco MDS 9000 + NVMe/FC)"] end T1 ---|"Low\nmilliseconds"| T2 subgraph T2["Tier 2 - Warm"] PFS["Parallel File System\n(Cisco Nexus 9000 + NFS)"] end T2 ---|"Milliseconds\nto seconds"| T3 subgraph T3["Tier 3 - Cold"] OBJ["Object Storage - S3\n(Cisco Nexus 9000)"] end CACHE["Caching Layer\n(WEKA / Alluxio)"] T3 -.->|"Pre-stage next\nepoch data"| CACHE CACHE -.->|"Promote to\nhot tier"| T1 style T0 fill:#d94a4a,color:#fff style T1 fill:#f0ad4e,color:#000 style T2 fill:#5bc0de,color:#000 style T3 fill:#5cb85c,color:#fff style CACHE fill:#9b59b6,color:#fff

Storage Integration with Cisco AI Infrastructure

Fibre Channel SAN (Cisco MDS 9000 Series)

High-performance block storage connectivity for GPU training clusters
FC-NVMe support enables NVMe commands over existing FC fabrics
On-chip SAN analytics: real-time visibility with 70+ metrics per flow
Zoning and LUN masking enforce security and multi-tenancy

Ethernet Storage Fabric (Cisco Nexus 9000 Series / ACI)

RoCEv2 support enables NVMe/RDMA for lowest-latency Ethernet storage
PFC and ECN create lossless Ethernet required for RDMA
ACI policy model enforces storage traffic QoS and segmentation
NVMe/TCP support for environments not requiring RDMA-level latency

Worked Example: 64-GPU Training Cluster

Hot tier (MDS 9000 + NVMe/FC): 200 TB NVMe all-flash via 32 Gbps FC at 100+ GB/s aggregate
Warm tier (Nexus 9000 + NFS): 1 PB parallel file system via 100 GbE for shared datasets and checkpoints
Cold tier (Nexus 9000 + S3): 10 PB object storage for complete data lake and archives
Caching layer: Distributed cache (WEKA/Alluxio) on NVMe nodes pre-stages next epoch data

Key Takeaway: Software-defined storage provides the abstraction and automation layer needed to manage multi-tier AI storage architectures at scale. Combined with Cisco MDS (Fibre Channel) and Nexus 9000 (Ethernet) switching, organizations deliver NVMe-class performance where needed while leveraging cost-effective object storage for bulk data.

Post-Quiz: Software-Defined Storage and Data Strategies

1. A 64-GPU training cluster needs 100+ GB/s aggregate throughput for its hot storage tier. Which Cisco platform and protocol combination is described for this role?

A) Cisco Nexus 9000 with NFS

B) Cisco MDS 9000 with NVMe/FC over 32 Gbps Fibre Channel

C) Cisco Catalyst 9000 with iSCSI

D) Cisco Nexus 3000 with NVMe/TCP

2. Which SDS benefit allows organizations to scale storage independently without purchasing new GPU servers?

A) Automation

B) Cost efficiency

C) Disaggregation

D) Hardware abstraction

3. In the tiered architecture, what does the caching layer (WEKA/Alluxio) specifically do for AI training?

A) Replaces NVMe storage entirely

B) Pre-stages the next training epoch's data from cold to hot tier

C) Compresses checkpoint files

D) Provides backup copies to a remote site

4. What Cisco Nexus 9000 feature creates the lossless Ethernet environment required for NVMe/RDMA?

A) VXLAN overlay

B) Priority Flow Control (PFC) and Explicit Congestion Notification (ECN)

C) OSPF routing

D) Spanning Tree Protocol

5. Why have data lakes built on S3-compatible object storage become the de facto standard for AI/ML data repositories?

A) They provide the lowest latency

B) They use flat architecture with metadata tags, enabling virtually limitless scalability at low cost with decoupled compute

C) They support Fibre Channel natively

D) They require no networking infrastructure

Chapter 7: Storage Architecture for AI Workloads

Learning Objectives

Section 1: Storage Requirements for AI

Key Points

Performance Metrics: IOPS, Throughput, and Latency

Training vs. Inference Storage Demands

Capacity Planning

Scalability Patterns

Redundancy and Availability

Section 2: Storage Protocols and Technologies

Key Points

Storage Area Networks (SAN) and Fibre Channel

NVMe and NVMe over Fabrics (NVMe-oF)

Block vs. File vs. Object Storage

Section 3: Data Preparation for AI

Key Points

The Data Preparation Process

Pipeline Architecture and the Medallion Pattern

Data Quality, Labeling, and Common Pitfalls

Section 4: Software-Defined Storage and Data Strategies

Key Points

Software-Defined Storage (SDS) for AI

Data Tiering and Caching Strategies

Storage Integration with Cisco AI Infrastructure

Fibre Channel SAN (Cisco MDS 9000 Series)

Ethernet Storage Fabric (Cisco Nexus 9000 Series / ACI)

Worked Example: 64-GPU Training Cluster

Your Progress

Answer Explanations