Chapter 3: Storage and Data Management for AI Pipelines

Learning Objectives

Pre-Quiz: Test Your Current Knowledge

Pre-Study Assessment

1. Which I/O characteristic best describes AI model training workloads?

Low-latency random reads of small key-value pairs
Sustained high-throughput sequential/random reads across repeated epochs
Write-heavy append-only logging
Infrequent large sequential writes with no reads

2. What Kubernetes access mode is required for distributed training where multiple pods must read the same dataset?

ReadWriteOnce (RWO)
ReadOnlyMany (ROX)
ReadWriteMany (RWX)
SingleNodeWriter (SNW)

3. What is the primary role of a CSI driver in Kubernetes?

To schedule pods onto GPU nodes
To decouple storage provisioning from the Kubernetes core via a standard interface
To compress training data before writing to disk
To replicate pods across availability zones

4. Why is Lustre preferred over NFS for large-scale distributed training?

Lustre is free while NFS requires a license
Lustre stripes data across multiple servers for linearly scalable aggregate bandwidth
NFS does not support any form of shared access
Lustre stores data in object storage format natively

5. What Kubernetes resource does Fluid use to declare the location of training data?

ConfigMap
Dataset CRD
PersistentVolume
DaemonSet

6. Which storage solution provides block, file, and object interfaces from a single on-premises platform?

Amazon EBS
NFS
Rook-Ceph
hostPath volumes

7. What is the main advantage of JuiceFS over Alluxio for AI training workloads?

JuiceFS supports more cloud providers
JuiceFS provides full POSIX compliance while Alluxio has partial support
JuiceFS is faster at object storage reads
JuiceFS requires no metadata engine

8. What does "data-affinity scheduling" in Fluid accomplish?

It replicates data to every node in the cluster
It places training pods on nodes that already hold cached copies of required data
It encrypts data at rest on each node
It compresses datasets to reduce storage costs

9. Which storage type is best suited for checkpoint writes during training?

Low-latency block storage optimized for random reads
High burst write throughput storage with strong durability
Object storage with eventual consistency
In-memory tmpfs volumes

10. What is the purpose of a Fluid DataLoad CRD?

To define GPU resource limits for training pods
To pre-warm the cache by pulling data before a training job starts
To migrate data between cloud providers
To delete stale cached data after training completes

11. What does volumeBindingMode: WaitForFirstConsumer accomplish in a StorageClass?

It delays volume creation until a pod actually uses the PVC
It creates the volume immediately but delays formatting
It waits for the cluster administrator to approve the volume
It queues volume creation until cluster resources are idle

12. In JuiceFS's three-layer cache stack, what is Layer 1?

Local NVMe SSD
Object store (S3/MinIO)
Kernel page cache (RAM)
Redis metadata cache

13. Why should checkpoint storage and training data storage use separate tiers?

Checkpoints require encryption but training data does not
Checkpoint write bursts can consume bandwidth needed by training reads if they share a tier
Training data must be stored in object format while checkpoints use block format
Kubernetes does not allow two PVCs on the same StorageClass

Section 1: Storage Requirements for AI Workloads

Before choosing a storage technology, it is essential to understand what AI workloads actually demand. Training, checkpointing, and inference each have fundamentally different I/O profiles, and choosing a single storage tier for all three is rarely the right answer.

Dataset Sizes and I/O Patterns

Training is overwhelmingly read-heavy. A large language model training run may iterate over terabytes of tokenized text hundreds of times across multiple epochs, generating sustained high-bandwidth read traffic. Both IOPS and throughput matter simultaneously, since individual read requests are typically small (tens of kilobytes) but aggregate to tens or hundreds of GB/s across a distributed job.

Inference is quite different. Model weights are loaded once at startup (a large sequential read), and then each serving request triggers small, latency-sensitive reads. Inference optimization focuses on low latency for random reads rather than raw throughput.

CharacteristicTrainingInference
Primary operationSequential/random readsRandom reads (small KV)
Access frequencyRepeated (multi-epoch)Continuous, latency-sensitive
Data volumeTens to hundreds of TBModel weights + serving cache
Throughput priorityVery high (GB/s aggregate)Moderate
Latency priorityModerateVery low (ms)
Write patternCheckpoints (periodic burst)Logs, output tokens (small)

POSIX vs Object Storage Tradeoffs

Most AI frameworks (PyTorch DataLoader, TensorFlow tf.data) expect a POSIX filesystem with familiar directory trees, file opens, seeks, and symbolic links. Object storage (S3, GCS, MinIO) organises data as key-value blobs and does not natively support operations like truncate, append, or symbolic links.

Practical AI pipelines adopt a hybrid approach: raw datasets live in object storage for durability and cost efficiency, while a POSIX-compatible cache layer (JuiceFS, a parallel file system) presents data to training pods as a conventional filesystem.

Checkpoint and Model Artifact Storage

Checkpoints are periodic snapshots of model weights, optimizer state, and training metadata. They impose distinctive requirements: high burst write throughput, strong durability, and rare reads (only on restart). Platform teams should provision separate storage tiers for checkpoints with QoS enforcement to prevent write bursts from crowding out training read bandwidth.

Completed model artefacts (final trained weights, tokenizers, configs) are written once and read frequently by inference servers. Object storage is ideal here, providing versioning, content-addressable retrieval, and global distribution.

Storage Tiering: Data Flow from Hot to Cold

HOT TIER Local NVMe SSD Sub-ms latency Multi-GB/s throughput WARM TIER NFS / Lustre / CephFS Network-attached, shared ReadWriteMany access COLD TIER S3 / MinIO / GCS Object storage Petabyte-scale, durable aging archive Use Cases Active training batches Scratch computation Checkpoint writes Use Cases Shared training datasets Multi-pod read access Distributed training Use Cases Raw dataset archive Model artifact storage Long-term checkpoints HIGH PERF / HIGH COST LOW COST / HIGH SCALE

Key Takeaway

Section 2: Kubernetes Storage Primitives

PersistentVolumes, PVCs, and StorageClasses

A PersistentVolume (PV) is a cluster-level resource representing a piece of storage that exists independently of any pod. A PersistentVolumeClaim (PVC) is a namespaced request for storage. Pods reference PVCs (not PVs directly), and Kubernetes matches or dynamically provisions the appropriate PV. A StorageClass defines a profile of storage (provisioner, performance tier, reclaim policy) and enables dynamic provisioning.

# StorageClass: high-performance NVMe-backed block storage
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nvme-training
provisioner: ebs.csi.aws.com
parameters:
  type: io2
  iopsPerGB: "50"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
# PVC requesting 500 GiB of NVMe storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: training-scratch
  namespace: ml-training
spec:
  storageClassName: nvme-training
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 500Gi

CSI Drivers for Cloud and On-Premises Storage

The Container Storage Interface (CSI) decouples storage provisioning from the Kubernetes core. A CSI driver consists of a Controller plugin (Deployment, handles volume lifecycle), a Node plugin (DaemonSet, handles mount/unmount), and an External provisioner sidecar (watches for new PVCs).

DriverBackendBest Fit
ebs.csi.aws.comAmazon EBSSingle-node training scratch
efs.csi.aws.comAmazon EFS (NFS)Shared datasets, RWX
fsx.csi.aws.comFSx for LustreHigh-throughput distributed training
rook-ceph.rbd.csi.ceph.comRook-Ceph (block)On-premises block storage
rook-ceph.cephfs.csi.ceph.comRook-Ceph (file)On-premises RWX
csi.juicefs.comJuiceFSDistributed POSIX cache layer

ReadWriteMany Access Modes for Distributed Training

Kubernetes PVs support three access modes: ReadWriteOnce (RWO), ReadOnlyMany (ROX), and ReadWriteMany (RWX). RWX is the critical mode for distributed training where hundreds of pods must read from the same dataset PVC. Only network/parallel file systems support RWX natively; block storage cannot.

CSI Driver Architecture: PVC to Storage Backend Provisioning Flow

Training Pod mountPath: /data PVC 500Gi, RWO 1 references StorageClass nvme-training selects CSI Driver (ebs.csi.aws.com) Controller Plugin (Deployment) Node Plugin (DaemonSet) 2 triggers provisioner PersistentVolume 500Gi io2 EBS (auto) Storage Backend AWS EBS io2 NVMe Volume Physical infrastructure 3 provisions volume backed by bound to K8s API objects Infrastructure Binding

Key Takeaway

Section 3: High-Performance Storage Solutions

Network File Systems: NFS, Lustre, and BeeGFS

NFS is the simplest shared filesystem -- a central server exports directories that any node can mount as RWX PVs. However, a single NFS server becomes a bandwidth bottleneck at multi-GB/s scale.

Lustre is a high-performance parallel filesystem that separates metadata (MDS) from data (OSS). Data is striped across multiple object storage servers, allowing aggregate bandwidth to scale linearly with server count. Lustre is the go-to backend for the most demanding distributed training workloads.

BeeGFS offers similar metadata/data separation with simpler operational characteristics, popular in on-premises HPC environments.

Object Storage: S3, MinIO, GCS

MinIO is the most widely deployed self-hosted object storage for AI, speaking the S3 API. Object storage is not normally mounted as POSIX; training pipelines access it through SDKs or FUSE tools. For model artefact distribution, object storage provides version control and content-addressable retrieval.

Local NVMe and hostPath

The fastest storage available is local NVMe SSD with sub-millisecond latency. Ideal for scratch data via hostPath volumes (simple, node-tied) or Local PersistentVolumes (with nodeAffinity rules).

# Local PV backed by NVMe on a specific node
apiVersion: v1
kind: PersistentVolume
metadata:
  name: local-nvme-node01
spec:
  capacity:
    storage: 2Ti
  accessModes: [ReadWriteOnce]
  persistentVolumeReclaimPolicy: Delete
  storageClassName: local-nvme
  local:
    path: /mnt/nvme0
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values: [gpu-node-01]

Rook-Ceph for Distributed Storage

Rook-Ceph deploys and manages Ceph within the cluster via a Kubernetes operator, providing three interfaces simultaneously:

InterfaceKubernetes AccessUse Case
RBD (block)rook-ceph.rbd.csi.ceph.comCheckpoints, single-pod scratch
CephFS (file)rook-ceph.cephfs.csi.ceph.comReadWriteMany shared datasets
RGW (object)S3-compatible APIRaw datasets, model artefacts

Rook-Ceph turns commodity disks into a self-healing storage pool with automatic re-replication on node failure.

Key Takeaway

Section 4: Data Caching and Pipeline Optimization

Fluid: Dataset Acceleration on Kubernetes

Fluid (CNCF sandbox) provides Kubernetes-native orchestration for distributed dataset caching via Dataset and Runtime CRDs. The Dataset CRD declares data location without coupling to storage technology. The Runtime CRD selects the caching engine (Alluxio, JuiceFS) and its parameters.

Data-affinity scheduling places training pods on nodes with cached data. Automated data operations include DataLoad (pre-warm), DataMigrate (tier movement), and DataBackup (snapshots).

# Fluid Dataset + JuiceFSRuntime
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: imagenet
  namespace: ml-training
spec:
  mounts:
    - mountPoint: s3://my-bucket/imagenet/
      name: imagenet
      options:
        fs.s3a.endpoint: s3.amazonaws.com
---
apiVersion: data.fluid.io/v1alpha1
kind: JuiceFSRuntime
metadata:
  name: imagenet
  namespace: ml-training
spec:
  replicas: 4
  tieredstore:
    levels:
      - mediumtype: SSD
        path: /dev/shm
        quota: 200Gi
        high: "0.95"
        low: "0.7"

Alluxio and JuiceFS for Tiered Caching

Alluxio provides a unified namespace over multiple backends but has partial POSIX support (missing truncate, symlinks, xattr). JuiceFS provides full POSIX compliance with separated metadata/data architecture and a three-layer cache stack achieving 300-800 MB/s per node:

  1. Layer 1: Kernel page cache (RAM, sub-microsecond)
  2. Layer 2: Local NVMe SSD (300-800 MB/s)
  3. Layer 3: Object store (S3/MinIO, network latency)
FeatureAlluxioJuiceFS
POSIX compliancePartialFull
MetadataIntegratedSeparated (pluggable)
Cache perfHighVery high (300-800 MB/s)
ComplexityHighModerate
Multi-storageStrong (HDFS, S3, NFS, GCS)S3-compatible

Prefetching and Data Locality Strategies

Data Caching Pipeline: Cache Hits and Misses During Training

Pod 1 DataLoader Pod 2 DataLoader Pod 3 DataLoader POSIX read Fluid / JuiceFS Cache Layer FUSE Pod Worker Pod Local SSD Cache Page Cache (RAM) Remote Object Storage (S3 / MinIO) s3://my-bucket/imagenet/ cache miss Epoch 1 (Cold Cache) X MISS Fetch from S3 Fills cache Epoch 2+ (Warm Cache) HIT Local SSD / RAM Performance Impact Epoch 1 (cold): Network latency (S3) Epoch 2+ (warm): 300-800 MB/s local

Key Takeaway

Post-Quiz: Test What You Learned

Post-Study Assessment

1. Which I/O characteristic best describes AI model training workloads?

Low-latency random reads of small key-value pairs
Sustained high-throughput sequential/random reads across repeated epochs
Write-heavy append-only logging
Infrequent large sequential writes with no reads

2. What Kubernetes access mode is required for distributed training where multiple pods must read the same dataset?

ReadWriteOnce (RWO)
ReadOnlyMany (ROX)
ReadWriteMany (RWX)
SingleNodeWriter (SNW)

3. What is the primary role of a CSI driver in Kubernetes?

To schedule pods onto GPU nodes
To decouple storage provisioning from the Kubernetes core via a standard interface
To compress training data before writing to disk
To replicate pods across availability zones

4. Why is Lustre preferred over NFS for large-scale distributed training?

Lustre is free while NFS requires a license
Lustre stripes data across multiple servers for linearly scalable aggregate bandwidth
NFS does not support any form of shared access
Lustre stores data in object storage format natively

5. What Kubernetes resource does Fluid use to declare the location of training data?

ConfigMap
Dataset CRD
PersistentVolume
DaemonSet

6. Which storage solution provides block, file, and object interfaces from a single on-premises platform?

Amazon EBS
NFS
Rook-Ceph
hostPath volumes

7. What is the main advantage of JuiceFS over Alluxio for AI training workloads?

JuiceFS supports more cloud providers
JuiceFS provides full POSIX compliance while Alluxio has partial support
JuiceFS is faster at object storage reads
JuiceFS requires no metadata engine

8. What does "data-affinity scheduling" in Fluid accomplish?

It replicates data to every node in the cluster
It places training pods on nodes that already hold cached copies of required data
It encrypts data at rest on each node
It compresses datasets to reduce storage costs

9. Which storage type is best suited for checkpoint writes during training?

Low-latency block storage optimized for random reads
High burst write throughput storage with strong durability
Object storage with eventual consistency
In-memory tmpfs volumes

10. What is the purpose of a Fluid DataLoad CRD?

To define GPU resource limits for training pods
To pre-warm the cache by pulling data before a training job starts
To migrate data between cloud providers
To delete stale cached data after training completes

11. What does volumeBindingMode: WaitForFirstConsumer accomplish in a StorageClass?

It delays volume creation until a pod actually uses the PVC
It creates the volume immediately but delays formatting
It waits for the cluster administrator to approve the volume
It queues volume creation until cluster resources are idle

12. In JuiceFS's three-layer cache stack, what is Layer 1?

Local NVMe SSD
Object store (S3/MinIO)
Kernel page cache (RAM)
Redis metadata cache

13. Why should checkpoint storage and training data storage use separate tiers?

Checkpoints require encryption but training data does not
Checkpoint write bursts can consume bandwidth needed by training reads if they share a tier
Training data must be stored in object format while checkpoints use block format
Kubernetes does not allow two PVCs on the same StorageClass

Your Progress

Answer Explanations