Study Guide: Chapter 3 — Storage and Data Management for AI Pipelines

Pre-Study Assessment

1. Which I/O characteristic best describes AI model training workloads?

Low-latency random reads of small key-value pairs

Sustained high-throughput sequential/random reads across repeated epochs

Write-heavy append-only logging

Infrequent large sequential writes with no reads

2. What Kubernetes access mode is required for distributed training where multiple pods must read the same dataset?

ReadWriteOnce (RWO)

ReadOnlyMany (ROX)

ReadWriteMany (RWX)

SingleNodeWriter (SNW)

3. What is the primary role of a CSI driver in Kubernetes?

To schedule pods onto GPU nodes

To decouple storage provisioning from the Kubernetes core via a standard interface

To compress training data before writing to disk

To replicate pods across availability zones

4. Why is Lustre preferred over NFS for large-scale distributed training?

Lustre is free while NFS requires a license

Lustre stripes data across multiple servers for linearly scalable aggregate bandwidth

NFS does not support any form of shared access

Lustre stores data in object storage format natively

5. What Kubernetes resource does Fluid use to declare the location of training data?

ConfigMap

Dataset CRD

PersistentVolume

DaemonSet

6. Which storage solution provides block, file, and object interfaces from a single on-premises platform?

Amazon EBS

NFS

Rook-Ceph

hostPath volumes

7. What is the main advantage of JuiceFS over Alluxio for AI training workloads?

JuiceFS supports more cloud providers

JuiceFS provides full POSIX compliance while Alluxio has partial support

JuiceFS is faster at object storage reads

JuiceFS requires no metadata engine

8. What does "data-affinity scheduling" in Fluid accomplish?

It replicates data to every node in the cluster

It places training pods on nodes that already hold cached copies of required data

It encrypts data at rest on each node

It compresses datasets to reduce storage costs

9. Which storage type is best suited for checkpoint writes during training?

Low-latency block storage optimized for random reads

High burst write throughput storage with strong durability

Object storage with eventual consistency

In-memory tmpfs volumes

10. What is the purpose of a Fluid DataLoad CRD?

To define GPU resource limits for training pods

To pre-warm the cache by pulling data before a training job starts

To migrate data between cloud providers

To delete stale cached data after training completes

11. What does volumeBindingMode: WaitForFirstConsumer accomplish in a StorageClass?

It delays volume creation until a pod actually uses the PVC

It creates the volume immediately but delays formatting

It waits for the cluster administrator to approve the volume

It queues volume creation until cluster resources are idle

12. In JuiceFS's three-layer cache stack, what is Layer 1?

Local NVMe SSD

Object store (S3/MinIO)

Kernel page cache (RAM)

Redis metadata cache

13. Why should checkpoint storage and training data storage use separate tiers?

Checkpoints require encryption but training data does not

Checkpoint write bursts can consume bandwidth needed by training reads if they share a tier

Training data must be stored in object format while checkpoints use block format

Kubernetes does not allow two PVCs on the same StorageClass

Section 1: Storage Requirements for AI Workloads

Before choosing a storage technology, it is essential to understand what AI workloads actually demand. Training, checkpointing, and inference each have fundamentally different I/O profiles, and choosing a single storage tier for all three is rarely the right answer.

Dataset Sizes and I/O Patterns

Training is overwhelmingly read-heavy. A large language model training run may iterate over terabytes of tokenized text hundreds of times across multiple epochs, generating sustained high-bandwidth read traffic. Both IOPS and throughput matter simultaneously, since individual read requests are typically small (tens of kilobytes) but aggregate to tens or hundreds of GB/s across a distributed job.

Inference is quite different. Model weights are loaded once at startup (a large sequential read), and then each serving request triggers small, latency-sensitive reads. Inference optimization focuses on low latency for random reads rather than raw throughput.

Characteristic	Training	Inference
Primary operation	Sequential/random reads	Random reads (small KV)
Access frequency	Repeated (multi-epoch)	Continuous, latency-sensitive
Data volume	Tens to hundreds of TB	Model weights + serving cache
Throughput priority	Very high (GB/s aggregate)	Moderate
Latency priority	Moderate	Very low (ms)
Write pattern	Checkpoints (periodic burst)	Logs, output tokens (small)

POSIX vs Object Storage Tradeoffs

Most AI frameworks (PyTorch DataLoader, TensorFlow tf.data) expect a POSIX filesystem with familiar directory trees, file opens, seeks, and symbolic links. Object storage (S3, GCS, MinIO) organises data as key-value blobs and does not natively support operations like truncate, append, or symbolic links.

Practical AI pipelines adopt a hybrid approach: raw datasets live in object storage for durability and cost efficiency, while a POSIX-compatible cache layer (JuiceFS, a parallel file system) presents data to training pods as a conventional filesystem.

Checkpoint and Model Artifact Storage

Checkpoints are periodic snapshots of model weights, optimizer state, and training metadata. They impose distinctive requirements: high burst write throughput, strong durability, and rare reads (only on restart). Platform teams should provision separate storage tiers for checkpoints with QoS enforcement to prevent write bursts from crowding out training read bandwidth.

Completed model artefacts (final trained weights, tokenizers, configs) are written once and read frequently by inference servers. Object storage is ideal here, providing versioning, content-addressable retrieval, and global distribution.

Section 2: Kubernetes Storage Primitives

PersistentVolumes, PVCs, and StorageClasses

A PersistentVolume (PV) is a cluster-level resource representing a piece of storage that exists independently of any pod. A PersistentVolumeClaim (PVC) is a namespaced request for storage. Pods reference PVCs (not PVs directly), and Kubernetes matches or dynamically provisions the appropriate PV. A StorageClass defines a profile of storage (provisioner, performance tier, reclaim policy) and enables dynamic provisioning.

# StorageClass: high-performance NVMe-backed block storage
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nvme-training
provisioner: ebs.csi.aws.com
parameters:
  type: io2
  iopsPerGB: "50"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

# PVC requesting 500 GiB of NVMe storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: training-scratch
  namespace: ml-training
spec:
  storageClassName: nvme-training
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 500Gi

CSI Drivers for Cloud and On-Premises Storage

The Container Storage Interface (CSI) decouples storage provisioning from the Kubernetes core. A CSI driver consists of a Controller plugin (Deployment, handles volume lifecycle), a Node plugin (DaemonSet, handles mount/unmount), and an External provisioner sidecar (watches for new PVCs).

Driver	Backend	Best Fit
`ebs.csi.aws.com`	Amazon EBS	Single-node training scratch
`efs.csi.aws.com`	Amazon EFS (NFS)	Shared datasets, RWX
`fsx.csi.aws.com`	FSx for Lustre	High-throughput distributed training
`rook-ceph.rbd.csi.ceph.com`	Rook-Ceph (block)	On-premises block storage
`rook-ceph.cephfs.csi.ceph.com`	Rook-Ceph (file)	On-premises RWX
`csi.juicefs.com`	JuiceFS	Distributed POSIX cache layer

ReadWriteMany Access Modes for Distributed Training

Kubernetes PVs support three access modes: ReadWriteOnce (RWO), ReadOnlyMany (ROX), and ReadWriteMany (RWX). RWX is the critical mode for distributed training where hundreds of pods must read from the same dataset PVC. Only network/parallel file systems support RWX natively; block storage cannot.

Section 3: High-Performance Storage Solutions

Network File Systems: NFS, Lustre, and BeeGFS

NFS is the simplest shared filesystem -- a central server exports directories that any node can mount as RWX PVs. However, a single NFS server becomes a bandwidth bottleneck at multi-GB/s scale.

Lustre is a high-performance parallel filesystem that separates metadata (MDS) from data (OSS). Data is striped across multiple object storage servers, allowing aggregate bandwidth to scale linearly with server count. Lustre is the go-to backend for the most demanding distributed training workloads.

BeeGFS offers similar metadata/data separation with simpler operational characteristics, popular in on-premises HPC environments.

Object Storage: S3, MinIO, GCS

MinIO is the most widely deployed self-hosted object storage for AI, speaking the S3 API. Object storage is not normally mounted as POSIX; training pipelines access it through SDKs or FUSE tools. For model artefact distribution, object storage provides version control and content-addressable retrieval.

Local NVMe and hostPath

The fastest storage available is local NVMe SSD with sub-millisecond latency. Ideal for scratch data via hostPath volumes (simple, node-tied) or Local PersistentVolumes (with nodeAffinity rules).

# Local PV backed by NVMe on a specific node
apiVersion: v1
kind: PersistentVolume
metadata:
  name: local-nvme-node01
spec:
  capacity:
    storage: 2Ti
  accessModes: [ReadWriteOnce]
  persistentVolumeReclaimPolicy: Delete
  storageClassName: local-nvme
  local:
    path: /mnt/nvme0
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values: [gpu-node-01]

Rook-Ceph for Distributed Storage

Rook-Ceph deploys and manages Ceph within the cluster via a Kubernetes operator, providing three interfaces simultaneously:

Interface	Kubernetes Access	Use Case
RBD (block)	`rook-ceph.rbd.csi.ceph.com`	Checkpoints, single-pod scratch
CephFS (file)	`rook-ceph.cephfs.csi.ceph.com`	ReadWriteMany shared datasets
RGW (object)	S3-compatible API	Raw datasets, model artefacts

Rook-Ceph turns commodity disks into a self-healing storage pool with automatic re-replication on node failure.

Section 4: Data Caching and Pipeline Optimization

Fluid: Dataset Acceleration on Kubernetes

Fluid (CNCF sandbox) provides Kubernetes-native orchestration for distributed dataset caching via Dataset and Runtime CRDs. The Dataset CRD declares data location without coupling to storage technology. The Runtime CRD selects the caching engine (Alluxio, JuiceFS) and its parameters.

Data-affinity scheduling places training pods on nodes with cached data. Automated data operations include DataLoad (pre-warm), DataMigrate (tier movement), and DataBackup (snapshots).

# Fluid Dataset + JuiceFSRuntime
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: imagenet
  namespace: ml-training
spec:
  mounts:
    - mountPoint: s3://my-bucket/imagenet/
      name: imagenet
      options:
        fs.s3a.endpoint: s3.amazonaws.com
---
apiVersion: data.fluid.io/v1alpha1
kind: JuiceFSRuntime
metadata:
  name: imagenet
  namespace: ml-training
spec:
  replicas: 4
  tieredstore:
    levels:
      - mediumtype: SSD
        path: /dev/shm
        quota: 200Gi
        high: "0.95"
        low: "0.7"

Alluxio and JuiceFS for Tiered Caching

Alluxio provides a unified namespace over multiple backends but has partial POSIX support (missing truncate, symlinks, xattr). JuiceFS provides full POSIX compliance with separated metadata/data architecture and a three-layer cache stack achieving 300-800 MB/s per node:

Layer 1: Kernel page cache (RAM, sub-microsecond)
Layer 2: Local NVMe SSD (300-800 MB/s)
Layer 3: Object store (S3/MinIO, network latency)

Feature	Alluxio	JuiceFS
POSIX compliance	Partial	Full
Metadata	Integrated	Separated (pluggable)
Cache perf	High	Very high (300-800 MB/s)
Complexity	High	Moderate
Multi-storage	Strong (HDFS, S3, NFS, GCS)	S3-compatible

Prefetching and Data Locality Strategies

Fluid DataLoad (pre-warming): Pull the dataset into cache before training starts
List caching on CSI FUSE drivers: Cache directory listings in the kernel for repeated epoch iterations
File cache for multi-epoch training: Store frequently accessed data on local node storage
Data-affinity pod scheduling: Schedule pods where their data already resides
Storage QoS tiering: Enforce per-volume IOPS/bandwidth limits to prevent noisy neighbours

Post-Quiz: Test What You Learned

Post-Study Assessment

1. Which I/O characteristic best describes AI model training workloads?

Low-latency random reads of small key-value pairs

Sustained high-throughput sequential/random reads across repeated epochs

Write-heavy append-only logging

Infrequent large sequential writes with no reads

2. What Kubernetes access mode is required for distributed training where multiple pods must read the same dataset?

ReadWriteOnce (RWO)

ReadOnlyMany (ROX)

ReadWriteMany (RWX)

SingleNodeWriter (SNW)

3. What is the primary role of a CSI driver in Kubernetes?

To schedule pods onto GPU nodes

To decouple storage provisioning from the Kubernetes core via a standard interface

To compress training data before writing to disk

To replicate pods across availability zones

4. Why is Lustre preferred over NFS for large-scale distributed training?

Lustre is free while NFS requires a license

Lustre stripes data across multiple servers for linearly scalable aggregate bandwidth

NFS does not support any form of shared access

Lustre stores data in object storage format natively

5. What Kubernetes resource does Fluid use to declare the location of training data?

ConfigMap

Dataset CRD

PersistentVolume

DaemonSet

6. Which storage solution provides block, file, and object interfaces from a single on-premises platform?

Amazon EBS

NFS

Rook-Ceph

hostPath volumes

7. What is the main advantage of JuiceFS over Alluxio for AI training workloads?

JuiceFS supports more cloud providers

JuiceFS provides full POSIX compliance while Alluxio has partial support

JuiceFS is faster at object storage reads

JuiceFS requires no metadata engine

8. What does "data-affinity scheduling" in Fluid accomplish?

It replicates data to every node in the cluster

It places training pods on nodes that already hold cached copies of required data

It encrypts data at rest on each node

It compresses datasets to reduce storage costs

9. Which storage type is best suited for checkpoint writes during training?

Low-latency block storage optimized for random reads

High burst write throughput storage with strong durability

Object storage with eventual consistency

In-memory tmpfs volumes

10. What is the purpose of a Fluid DataLoad CRD?

To define GPU resource limits for training pods

To pre-warm the cache by pulling data before a training job starts

To migrate data between cloud providers

To delete stale cached data after training completes

11. What does volumeBindingMode: WaitForFirstConsumer accomplish in a StorageClass?

It delays volume creation until a pod actually uses the PVC

It creates the volume immediately but delays formatting

It waits for the cluster administrator to approve the volume

It queues volume creation until cluster resources are idle

12. In JuiceFS's three-layer cache stack, what is Layer 1?

Local NVMe SSD

Object store (S3/MinIO)

Kernel page cache (RAM)

Redis metadata cache

13. Why should checkpoint storage and training data storage use separate tiers?

Checkpoints require encryption but training data does not

Checkpoint write bursts can consume bandwidth needed by training reads if they share a tier

Training data must be stored in object format while checkpoints use block format

Kubernetes does not allow two PVCs on the same StorageClass

Chapter 3: Storage and Data Management for AI Pipelines

Learning Objectives

Pre-Quiz: Test Your Current Knowledge

Section 1: Storage Requirements for AI Workloads

Dataset Sizes and I/O Patterns

POSIX vs Object Storage Tradeoffs

Checkpoint and Model Artifact Storage

Storage Tiering: Data Flow from Hot to Cold

Key Takeaway

Section 2: Kubernetes Storage Primitives

PersistentVolumes, PVCs, and StorageClasses

CSI Drivers for Cloud and On-Premises Storage

ReadWriteMany Access Modes for Distributed Training

CSI Driver Architecture: PVC to Storage Backend Provisioning Flow

Key Takeaway

Section 3: High-Performance Storage Solutions

Network File Systems: NFS, Lustre, and BeeGFS

Object Storage: S3, MinIO, GCS

Local NVMe and hostPath

Rook-Ceph for Distributed Storage

Key Takeaway

Section 4: Data Caching and Pipeline Optimization

Fluid: Dataset Acceleration on Kubernetes

Alluxio and JuiceFS for Tiered Caching

Prefetching and Data Locality Strategies

Data Caching Pipeline: Cache Hits and Misses During Training

Key Takeaway

Post-Quiz: Test What You Learned

Your Progress

Answer Explanations