Evaluate storage deployments based on AI workload requirements for capacity, performance, redundancy, and scalability
Describe storage protocols and technologies including SAN, Fibre Channel, NVMe, block, and file storage
Explain data preparation strategies and software-defined storage approaches for AI environments
Section 1: Storage Requirements for AI
Pre-Quiz: Storage Requirements for AI
1. Which storage performance metric is most critical for AI training workloads?
A) IOPS
B) Throughput (GB/s)
C) Latency
D) Metadata operations per second
2. What is a practical rule of thumb for capacity planning relative to raw dataset size in AI environments?
A) 1.5x the raw dataset size
B) 2x the raw dataset size
C) 3-5x the raw dataset size
D) 10x the raw dataset size
3. Which scaling pattern is preferred for modern AI storage architectures?
A) Scale-up (vertical)
B) Scale-out (horizontal)
C) Scale-down
D) Both are equally preferred
4. Which redundancy technique distributes data and parity across multiple nodes and is more storage-efficient than replication?
A) RAID 10
B) Multi-site replication
C) Erasure coding
D) Mirroring
5. AI inference workloads primarily demand which performance characteristic?
A) Large sequential throughput
B) High IOPS and low latency
C) Maximum capacity
D) Write-optimized storage
Key Points
Three performance metrics: IOPS (operations/sec for inference), Throughput (GB/s for training), Latency (sub-ms target; microseconds for NVMe)
Training vs. inference: Training needs large sequential reads (throughput); inference needs many small random I/Os (IOPS + low latency)
Capacity planning: Plan for 3-5x raw dataset size to cover versioning, checkpoints, and preprocessing
Scale-out preferred: Capacity and performance grow linearly with node count, avoiding hardware ceilings
Redundancy is non-negotiable: RAID, erasure coding, multi-site replication, and automated failover protect against costly retraining
Performance Metrics: IOPS, Throughput, and Latency
Storage in an AI data center is not a passive repository -- it is the fuel line feeding the GPU engines. When that fuel line cannot deliver data fast enough, even the most powerful GPU cluster sits idle.
Metric
Definition
Primary AI Phase
Target Range
IOPS
Discrete read/write operations per second
Inference (many small, random I/Os)
Hundreds of thousands to millions
Throughput
Volume of data transferred per unit time (GB/s)
Training (large sequential reads)
Tens to hundreds of GB/s aggregate
Latency
Time between I/O request and response
Both (idle GPU cycles = wasted money)
Sub-millisecond; microseconds for NVMe
Highway analogy: IOPS = number of cars entering per second. Throughput = total cargo tonnage per second. Latency = travel time per car. Training needs wide highways with heavy trucks (throughput). Inference needs fast sports cars arriving constantly (high IOPS, low latency).
flowchart LR
subgraph AI_Workload["AI Workload"]
direction TB
T["Training Phase"]
I["Inference Phase"]
end
subgraph Metrics["Storage Metrics"]
direction TB
IOPS["IOPS\n(Operations/sec)"]
TP["Throughput\n(GB/s)"]
LAT["Latency\n(Response Time)"]
end
subgraph Targets["Performance Targets"]
direction TB
T1["100s of thousands\nto millions IOPS"]
T2["Tens to hundreds\nof GB/s"]
T3["Sub-millisecond;\nmicroseconds for NVMe"]
end
T -- "Large sequential reads" --> TP
T -- "Checkpoint writes" --> TP
I -- "Many small random I/Os" --> IOPS
I -- "Real-time responses" --> LAT
IOPS --> T1
TP --> T2
LAT --> T3
style T fill:#4a90d9,color:#fff
style I fill:#d94a4a,color:#fff
style IOPS fill:#f0ad4e,color:#000
style TP fill:#f0ad4e,color:#000
style LAT fill:#f0ad4e,color:#000
Training vs. Inference Storage Demands
Characteristic
Training
Inference
I/O Pattern
Large sequential reads
Small random reads/writes
Critical Metric
Throughput (GB/s)
IOPS and latency
Data Volume
Terabytes to petabytes per job
Kilobytes to megabytes per request
Concurrency
Moderate (batch feeding)
Very high (thousands of simultaneous requests)
Checkpoint Writes
Periodic, large sequential writes
Rarely applicable
Capacity Planning
AI datasets are growing exponentially. A single large language model may train on datasets measured in hundreds of terabytes. Capacity planning must account for:
Raw dataset storage: Original collected data in object storage or data lakes
Preprocessed datasets: Cleaned, transformed data on higher-performance tiers
Model checkpoints: Periodic snapshots of model weights (tens of GB each for large models)
Experiment tracking: Multiple versions across dozens or hundreds of experiments
Plan for 3-5x the raw dataset size when accounting for all copies, preprocessing outputs, and checkpoint storage.
Modern AI storage architectures favor scale-out designs because they allow capacity and performance to grow linearly with node count.
Redundancy and Availability
The cost of retraining a model due to data loss far exceeds the cost of implementing proper redundancy. Key techniques include:
RAID: Traditional disk-level redundancy within a single node (RAID 5, RAID 6, RAID 10)
Erasure coding: Distributes data and parity across multiple nodes; more storage-efficient than replication
Multi-site replication: Copies data across geographically separated locations
Automated failover: Ensures storage remains accessible when nodes or paths fail
Key Takeaway: AI storage evaluation requires balancing four interdependent factors -- performance (IOPS, throughput, latency), capacity, scalability, and redundancy. Training prioritizes sequential throughput; inference demands high IOPS and ultra-low latency.
Animation: Interactive visualization showing how GPU utilization drops as storage latency increases, illustrating the cost of I/O starvation
Post-Quiz: Storage Requirements for AI
1. A training job processes a 50 TB dataset. Approximately how much total storage should be provisioned accounting for versioning, checkpoints, and preprocessing?
A) 50 TB
B) 100 TB
C) 150-250 TB
D) 500 TB
2. An inference service handling thousands of simultaneous requests with kilobyte-sized payloads should optimize for which metric?
A) Sequential throughput (GB/s)
B) IOPS and latency
C) Raw capacity (PB)
D) Write throughput
3. Why do modern AI storage architectures favor scale-out over scale-up?
A) Scale-out is always cheaper per TB
B) Scale-out avoids hardware ceilings and allows linear growth in capacity and performance
C) Scale-up is no longer supported by vendors
D) Scale-out requires less network bandwidth
4. Which redundancy technique is most storage-efficient for large-scale AI deployments?
A) RAID 10 (mirroring + striping)
B) Triple replication
C) Erasure coding
D) RAID 0 (striping only)
5. During AI training, periodic checkpoint writes are characterized as:
A) Small random writes at high IOPS
B) Large sequential writes
C) Metadata-only operations
D) Read-only operations
Section 2: Storage Protocols and Technologies
Pre-Quiz: Storage Protocols and Technologies
1. What type of flow control does Fibre Channel use to prevent congestion?
A) Drop-and-retransmit (like Ethernet)
B) Credit-based flow control
C) Token-based flow control
D) Sliding window flow control
2. How many command queues does NVMe support compared to legacy SCSI?
A) 32 queues vs. 1 queue
B) 256 queues vs. 32 queues
C) Up to 65,535 queues vs. 1 queue (with 32 entries)
D) 1,024 queues vs. 64 queues
3. Which NVMe-oF transport offers the lowest latency?
A) NVMe/FC
B) NVMe/TCP
C) NVMe/RDMA (RoCEv2 or InfiniBand)
D) NVMe/iSCSI
4. Which storage type offers virtually unlimited scalability at the lowest cost per GB?
A) Block storage (SAN)
B) File storage (NAS)
C) Object storage
D) Direct-attached storage (DAS)
5. What unique capability do Cisco MDS 9000 Series switches provide for storage analytics?
A) Software-based packet sampling
B) On-chip analytics calculating 70+ metrics per I/O flow
C) NetFlow-based traffic analysis
D) SNMP trap-based monitoring only
Key Points
Fibre Channel SAN: Lossless, credit-based flow control; speeds up to 128 Gbps; zoning and LUN masking for security
NVMe: Replaces SCSI; 65,535 queues x 65,536 entries for massive parallelism; designed for flash
Block storage: Highest IOPS, lowest latency, highest cost -- ideal for GPU training feeds
File storage (NAS): Shared access via NFS/SMB; parallel file systems bridge NAS and SAN performance
Object storage: Flat namespace, S3 API, virtually unlimited scale, lowest cost -- standard for data lakes
Cisco MDS 9000: On-chip SAN analytics with 70+ metrics per I/O flow; FC-NVMe support
Cisco Nexus 9000/ACI: RoCEv2 for NVMe/RDMA with PFC/ECN lossless Ethernet
Storage Area Networks (SAN) and Fibre Channel
A SAN is a dedicated high-speed network providing block-level access to storage. Fibre Channel (FC) is the traditional SAN backbone with key characteristics:
Speeds: 8, 16, 32, 64, and 128 Gbps per port
Credit-based flow control: Prevents congestion by ensuring a sender never transmits more frames than the receiver can accept
Zoning and LUN masking: Security mechanisms restricting which hosts can see which storage volumes
Lossless delivery: No frames dropped under normal operation
The Cisco MDS 9000 Series inspects FC and SCSI/NVMe headers of all I/O exchanges, calculating more than 70 metrics per I/O flow using dedicated on-chip hardware -- the industry's first on-chip analytics for NVMe, FC, and FC-SCSI traffic.
NVMe and NVMe over Fabrics (NVMe-oF)
NVMe replaced the legacy SCSI command set with a protocol designed for flash storage. Where SCSI uses a single queue with 32 entries, NVMe supports up to 65,535 queues with 65,536 entries each.
NVMe-oF extends NVMe across a network, achieving 20-30 microsecond latency by avoiding SCSI emulation layers entirely.
NVMe-oF enables composable disaggregated infrastructure (CDI), where storage, compute, GPU, and FPGA resources can be independently scaled. Direct GPU-to-storage data access through DPUs bypasses CPU bottlenecks entirely.
Parallel file systems (GPFS/Spectrum Scale, Lustre, WEKA, BeeGFS) extend NAS concepts for HPC by striping data across multiple storage nodes, providing concurrent access from thousands of clients.
Key Takeaway: No single storage type fits all AI workload needs. Block storage with NVMe-oF delivers peak performance for training, file storage/parallel FS provides flexible shared access for inference, and object storage offers unmatched scalability for data lakes. Most production AI environments use all three in a tiered architecture.
Animation: Interactive comparison showing data flow through block, file, and object storage paths with latency and throughput meters
Post-Quiz: Storage Protocols and Technologies
1. A data center is deploying NVMe-oF and needs the absolute lowest latency. Which transport should they choose, and what infrastructure does it require?
A) NVMe/TCP -- requires standard Ethernet switches
B) NVMe/RDMA -- requires lossless Ethernet with PFC/ECN or InfiniBand
C) NVMe/FC -- requires Fibre Channel switches
D) NVMe/iSCSI -- requires iSCSI initiators
2. Which Cisco switch platform would you use for FC-NVMe connectivity with on-chip SAN analytics?
A) Cisco Nexus 9000
B) Cisco Catalyst 9000
C) Cisco MDS 9000 Series
D) Cisco Nexus 3000
3. An organization needs to store 10 PB of raw training data cost-effectively with S3-compatible access. Which storage type is most appropriate?
A) Block storage (SAN)
B) File storage (NAS)
C) Object storage
D) Direct-attached NVMe
4. What fundamental difference in flow control distinguishes Fibre Channel from standard Ethernet?
A) FC uses larger frame sizes
B) FC uses credit-based flow control (lossless) vs. Ethernet's drop-and-retransmit
C) FC operates at higher speeds
D) FC uses IP-based addressing
5. What does NVMe-oF enable in terms of infrastructure design?
A) Fixed server-storage ratios
B) Composable disaggregated infrastructure (CDI) with independently scalable resources
C) Elimination of all network switches
D) Direct-attached storage only
Section 3: Data Preparation for AI
Pre-Quiz: Data Preparation for AI
1. What percentage of a typical ML project is spent on data preparation?
A) 10-20%
B) 30-40%
C) 60-80%
D) 90-95%
2. In the medallion architecture, which layer contains raw, unprocessed data?
A) Gold layer
B) Silver layer
C) Bronze layer
D) Platinum layer
3. What is "data leakage" in the context of ML data preparation?
A) Data being stolen by attackers
B) Data loss due to hardware failure
C) Model accessing information during training that would not be available at inference time
D) Data being duplicated across storage tiers
4. How many core steps are in the data preparation process?
A) 3
B) 5
C) 7
D) 10
5. For time-series data, how should the train/test split be performed to avoid data leakage?
A) Random shuffling
B) Stratified random sampling
C) Chronological splitting
D) K-fold cross-validation
Key Points
60-80% of ML effort is spent on data preparation -- it is the primary determinant of model success
Ingestion tier: High-throughput object storage (S3-compatible) for raw data landing
Processing tier: High-IOPS NVMe or parallel file systems for transformation
Serving tier: Low-latency block or file storage for model training data access
Archive tier: Cold object storage for long-term dataset retention
Data Quality, Labeling, and Common Pitfalls
Data Leakage is the most dangerous data preparation failure. The model appears to perform brilliantly in testing but fails in production because it accessed information during training that would not be available at inference time. For time-series data, always split chronologically -- never randomly.
Data Versioning is essential for reproducibility. Without tracking dataset changes over time, teams cannot trace model outputs back to specific data states.
Privacy and Compliance under GDPR, CCPA, and similar regulations impose strict controls. Storage must support access controls, encryption at rest, and audit logging throughout the pipeline.
Key Takeaway: Data preparation is deeply intertwined with storage architecture. Each pipeline stage has distinct storage requirements. The medallion architecture (bronze/silver/gold) maps storage tiers to data maturity stages.
Animation: Step-by-step walkthrough of data flowing through the medallion architecture, showing transformations at each layer with storage tier indicators
Post-Quiz: Data Preparation for AI
1. In the worked example of an image classification pipeline, where do raw images initially land?
A) NVMe block storage
B) Parallel file system
C) S3-compatible object storage (bronze layer)
D) GPU local storage
2. Which medallion layer maps to NVMe block storage for GPU-ready training data?
A) Bronze layer
B) Silver layer
C) Gold layer
D) Archive layer
3. Why is random shuffling dangerous for time-series train/test splits?
A) It reduces dataset size
B) It leaks future information into the training set
C) It causes class imbalance
D) It requires more storage
4. What storage characteristic is most important for the data cleaning and transformation steps?
A) Maximum capacity at lowest cost
B) High IOPS for iterative read/write processing
C) Rich custom metadata
D) S3-compatible API access
5. Why is data versioning essential in ML workflows?
A) It reduces storage costs
B) It enables reproducibility by tracing model outputs to specific data states
C) It speeds up training
D) It replaces the need for backups
Section 4: Software-Defined Storage and Data Strategies
Pre-Quiz: Software-Defined Storage and Data Strategies
1. What does software-defined storage (SDS) decouple?
A) Compute from networking
B) Storage management from underlying hardware
C) Applications from operating systems
D) GPUs from CPUs
2. Which storage tier (Tier 0) provides nanosecond-to-microsecond latency for active model weights?
A) NVMe SSD arrays
B) Object storage (S3)
C) GPU HBM / local NVMe
D) Parallel file system
3. What is the primary purpose of a caching layer (like Alluxio or WEKA) in a tiered AI storage architecture?
A) Replace object storage entirely
B) Keep frequently accessed data on faster media while fetching less-used data from slower tiers
C) Provide backup copies of data
D) Compress data for long-term archival
4. In a Cisco-based tiered storage deployment, which platform handles the hot tier with NVMe/FC?
A) Cisco Nexus 9000
B) Cisco MDS 9000
C) Cisco Catalyst 9000
D) Cisco UCS
5. What architecture uses a flat structure with object storage to store data with metadata tags and unique identifiers?
A) Data warehouse
B) Data lake
C) Data mart
D) Data cube
Key Points
SDS decouples storage management from hardware -- enables policy-driven automation, hardware abstraction, and independent scaling
Caching layers (Alluxio, WEKA) pre-stage next epoch data from cold to hot tier
Data lakes on S3-compatible object storage are the de facto AI/ML data repository standard
Cisco integration: MDS 9000 for FC/NVMe-FC hot tier; Nexus 9000 for RoCEv2/NFS warm tier and S3 cold tier
Software-Defined Storage (SDS) for AI
Software-defined storage decouples storage management from hardware, placing intelligence in a software layer that manages heterogeneous devices through a unified control plane.
Analogy: Think of SDS like a smart traffic management system. Traditional storage is like fixed, dedicated roads for each destination. SDS acts as an intelligent routing layer that dynamically directs traffic across whatever roads exist, adding new routes as needed.
Benefit
Description
AI Impact
Disaggregation
Independent scaling of compute, storage, networking
High-performance block storage connectivity for GPU training clusters
FC-NVMe support enables NVMe commands over existing FC fabrics
On-chip SAN analytics: real-time visibility with 70+ metrics per flow
Zoning and LUN masking enforce security and multi-tenancy
Ethernet Storage Fabric (Cisco Nexus 9000 Series / ACI)
RoCEv2 support enables NVMe/RDMA for lowest-latency Ethernet storage
PFC and ECN create lossless Ethernet required for RDMA
ACI policy model enforces storage traffic QoS and segmentation
NVMe/TCP support for environments not requiring RDMA-level latency
Worked Example: 64-GPU Training Cluster
Hot tier (MDS 9000 + NVMe/FC): 200 TB NVMe all-flash via 32 Gbps FC at 100+ GB/s aggregate
Warm tier (Nexus 9000 + NFS): 1 PB parallel file system via 100 GbE for shared datasets and checkpoints
Cold tier (Nexus 9000 + S3): 10 PB object storage for complete data lake and archives
Caching layer: Distributed cache (WEKA/Alluxio) on NVMe nodes pre-stages next epoch data
Key Takeaway: Software-defined storage provides the abstraction and automation layer needed to manage multi-tier AI storage architectures at scale. Combined with Cisco MDS (Fibre Channel) and Nexus 9000 (Ethernet) switching, organizations deliver NVMe-class performance where needed while leveraging cost-effective object storage for bulk data.
Animation: Interactive tiered storage simulator showing data movement between tiers as training progresses, with cost and latency indicators per tier
Post-Quiz: Software-Defined Storage and Data Strategies
1. A 64-GPU training cluster needs 100+ GB/s aggregate throughput for its hot storage tier. Which Cisco platform and protocol combination is described for this role?
A) Cisco Nexus 9000 with NFS
B) Cisco MDS 9000 with NVMe/FC over 32 Gbps Fibre Channel
C) Cisco Catalyst 9000 with iSCSI
D) Cisco Nexus 3000 with NVMe/TCP
2. Which SDS benefit allows organizations to scale storage independently without purchasing new GPU servers?
A) Automation
B) Cost efficiency
C) Disaggregation
D) Hardware abstraction
3. In the tiered architecture, what does the caching layer (WEKA/Alluxio) specifically do for AI training?
A) Replaces NVMe storage entirely
B) Pre-stages the next training epoch's data from cold to hot tier
C) Compresses checkpoint files
D) Provides backup copies to a remote site
4. What Cisco Nexus 9000 feature creates the lossless Ethernet environment required for NVMe/RDMA?
A) VXLAN overlay
B) Priority Flow Control (PFC) and Explicit Congestion Notification (ECN)
C) OSPF routing
D) Spanning Tree Protocol
5. Why have data lakes built on S3-compatible object storage become the de facto standard for AI/ML data repositories?
A) They provide the lowest latency
B) They use flat architecture with metadata tags, enabling virtually limitless scalability at low cost with decoupled compute