1. Which I/O characteristic best describes AI model training workloads?
Low-latency random reads of small key-value pairs
Sustained high-throughput sequential/random reads across repeated epochs
Write-heavy append-only logging
Infrequent large sequential writes with no reads
2. What Kubernetes access mode is required for distributed training where multiple pods must read the same dataset?
ReadWriteOnce (RWO)
ReadOnlyMany (ROX)
ReadWriteMany (RWX)
SingleNodeWriter (SNW)
3. What is the primary role of a CSI driver in Kubernetes?
To schedule pods onto GPU nodes
To decouple storage provisioning from the Kubernetes core via a standard interface
To compress training data before writing to disk
To replicate pods across availability zones
4. Why is Lustre preferred over NFS for large-scale distributed training?
Lustre is free while NFS requires a license
Lustre stripes data across multiple servers for linearly scalable aggregate bandwidth
NFS does not support any form of shared access
Lustre stores data in object storage format natively
5. What Kubernetes resource does Fluid use to declare the location of training data?
ConfigMap
Dataset CRD
PersistentVolume
DaemonSet
6. Which storage solution provides block, file, and object interfaces from a single on-premises platform?
Amazon EBS
NFS
Rook-Ceph
hostPath volumes
7. What is the main advantage of JuiceFS over Alluxio for AI training workloads?
JuiceFS supports more cloud providers
JuiceFS provides full POSIX compliance while Alluxio has partial support
JuiceFS is faster at object storage reads
JuiceFS requires no metadata engine
8. What does "data-affinity scheduling" in Fluid accomplish?
It replicates data to every node in the cluster
It places training pods on nodes that already hold cached copies of required data
It encrypts data at rest on each node
It compresses datasets to reduce storage costs
9. Which storage type is best suited for checkpoint writes during training?
Low-latency block storage optimized for random reads
High burst write throughput storage with strong durability
Object storage with eventual consistency
In-memory tmpfs volumes
10. What is the purpose of a Fluid DataLoad CRD?
To define GPU resource limits for training pods
To pre-warm the cache by pulling data before a training job starts
To migrate data between cloud providers
To delete stale cached data after training completes
11. What does volumeBindingMode: WaitForFirstConsumer accomplish in a StorageClass?
It delays volume creation until a pod actually uses the PVC
It creates the volume immediately but delays formatting
It waits for the cluster administrator to approve the volume
It queues volume creation until cluster resources are idle
12. In JuiceFS's three-layer cache stack, what is Layer 1?
Local NVMe SSD
Object store (S3/MinIO)
Kernel page cache (RAM)
Redis metadata cache
13. Why should checkpoint storage and training data storage use separate tiers?
Checkpoints require encryption but training data does not
Checkpoint write bursts can consume bandwidth needed by training reads if they share a tier
Training data must be stored in object format while checkpoints use block format
Kubernetes does not allow two PVCs on the same StorageClass
Before choosing a storage technology, it is essential to understand what AI workloads actually demand. Training, checkpointing, and inference each have fundamentally different I/O profiles, and choosing a single storage tier for all three is rarely the right answer.
Dataset Sizes and I/O Patterns
Training is overwhelmingly read-heavy. A large language model training run may iterate over terabytes of tokenized text hundreds of times across multiple epochs, generating sustained high-bandwidth read traffic. Both IOPS and throughput matter simultaneously, since individual read requests are typically small (tens of kilobytes) but aggregate to tens or hundreds of GB/s across a distributed job.
Inference is quite different. Model weights are loaded once at startup (a large sequential read), and then each serving request triggers small, latency-sensitive reads. Inference optimization focuses on low latency for random reads rather than raw throughput.
| Characteristic | Training | Inference |
| Primary operation | Sequential/random reads | Random reads (small KV) |
| Access frequency | Repeated (multi-epoch) | Continuous, latency-sensitive |
| Data volume | Tens to hundreds of TB | Model weights + serving cache |
| Throughput priority | Very high (GB/s aggregate) | Moderate |
| Latency priority | Moderate | Very low (ms) |
| Write pattern | Checkpoints (periodic burst) | Logs, output tokens (small) |
POSIX vs Object Storage Tradeoffs
Most AI frameworks (PyTorch DataLoader, TensorFlow tf.data) expect a POSIX filesystem with familiar directory trees, file opens, seeks, and symbolic links. Object storage (S3, GCS, MinIO) organises data as key-value blobs and does not natively support operations like truncate, append, or symbolic links.
Practical AI pipelines adopt a hybrid approach: raw datasets live in object storage for durability and cost efficiency, while a POSIX-compatible cache layer (JuiceFS, a parallel file system) presents data to training pods as a conventional filesystem.
Checkpoint and Model Artifact Storage
Checkpoints are periodic snapshots of model weights, optimizer state, and training metadata. They impose distinctive requirements: high burst write throughput, strong durability, and rare reads (only on restart). Platform teams should provision separate storage tiers for checkpoints with QoS enforcement to prevent write bursts from crowding out training read bandwidth.
Completed model artefacts (final trained weights, tokenizers, configs) are written once and read frequently by inference servers. Object storage is ideal here, providing versioning, content-addressable retrieval, and global distribution.
PersistentVolumes, PVCs, and StorageClasses
A PersistentVolume (PV) is a cluster-level resource representing a piece of storage that exists independently of any pod. A PersistentVolumeClaim (PVC) is a namespaced request for storage. Pods reference PVCs (not PVs directly), and Kubernetes matches or dynamically provisions the appropriate PV. A StorageClass defines a profile of storage (provisioner, performance tier, reclaim policy) and enables dynamic provisioning.
# StorageClass: high-performance NVMe-backed block storage
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nvme-training
provisioner: ebs.csi.aws.com
parameters:
type: io2
iopsPerGB: "50"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
# PVC requesting 500 GiB of NVMe storage
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: training-scratch
namespace: ml-training
spec:
storageClassName: nvme-training
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Gi
CSI Drivers for Cloud and On-Premises Storage
The Container Storage Interface (CSI) decouples storage provisioning from the Kubernetes core. A CSI driver consists of a Controller plugin (Deployment, handles volume lifecycle), a Node plugin (DaemonSet, handles mount/unmount), and an External provisioner sidecar (watches for new PVCs).
| Driver | Backend | Best Fit |
ebs.csi.aws.com | Amazon EBS | Single-node training scratch |
efs.csi.aws.com | Amazon EFS (NFS) | Shared datasets, RWX |
fsx.csi.aws.com | FSx for Lustre | High-throughput distributed training |
rook-ceph.rbd.csi.ceph.com | Rook-Ceph (block) | On-premises block storage |
rook-ceph.cephfs.csi.ceph.com | Rook-Ceph (file) | On-premises RWX |
csi.juicefs.com | JuiceFS | Distributed POSIX cache layer |
ReadWriteMany Access Modes for Distributed Training
Kubernetes PVs support three access modes: ReadWriteOnce (RWO), ReadOnlyMany (ROX), and ReadWriteMany (RWX). RWX is the critical mode for distributed training where hundreds of pods must read from the same dataset PVC. Only network/parallel file systems support RWX natively; block storage cannot.
Network File Systems: NFS, Lustre, and BeeGFS
NFS is the simplest shared filesystem -- a central server exports directories that any node can mount as RWX PVs. However, a single NFS server becomes a bandwidth bottleneck at multi-GB/s scale.
Lustre is a high-performance parallel filesystem that separates metadata (MDS) from data (OSS). Data is striped across multiple object storage servers, allowing aggregate bandwidth to scale linearly with server count. Lustre is the go-to backend for the most demanding distributed training workloads.
BeeGFS offers similar metadata/data separation with simpler operational characteristics, popular in on-premises HPC environments.
Object Storage: S3, MinIO, GCS
MinIO is the most widely deployed self-hosted object storage for AI, speaking the S3 API. Object storage is not normally mounted as POSIX; training pipelines access it through SDKs or FUSE tools. For model artefact distribution, object storage provides version control and content-addressable retrieval.
Local NVMe and hostPath
The fastest storage available is local NVMe SSD with sub-millisecond latency. Ideal for scratch data via hostPath volumes (simple, node-tied) or Local PersistentVolumes (with nodeAffinity rules).
# Local PV backed by NVMe on a specific node
apiVersion: v1
kind: PersistentVolume
metadata:
name: local-nvme-node01
spec:
capacity:
storage: 2Ti
accessModes: [ReadWriteOnce]
persistentVolumeReclaimPolicy: Delete
storageClassName: local-nvme
local:
path: /mnt/nvme0
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values: [gpu-node-01]
Rook-Ceph for Distributed Storage
Rook-Ceph deploys and manages Ceph within the cluster via a Kubernetes operator, providing three interfaces simultaneously:
| Interface | Kubernetes Access | Use Case |
| RBD (block) | rook-ceph.rbd.csi.ceph.com | Checkpoints, single-pod scratch |
| CephFS (file) | rook-ceph.cephfs.csi.ceph.com | ReadWriteMany shared datasets |
| RGW (object) | S3-compatible API | Raw datasets, model artefacts |
Rook-Ceph turns commodity disks into a self-healing storage pool with automatic re-replication on node failure.
Fluid: Dataset Acceleration on Kubernetes
Fluid (CNCF sandbox) provides Kubernetes-native orchestration for distributed dataset caching via Dataset and Runtime CRDs. The Dataset CRD declares data location without coupling to storage technology. The Runtime CRD selects the caching engine (Alluxio, JuiceFS) and its parameters.
Data-affinity scheduling places training pods on nodes with cached data. Automated data operations include DataLoad (pre-warm), DataMigrate (tier movement), and DataBackup (snapshots).
# Fluid Dataset + JuiceFSRuntime
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: imagenet
namespace: ml-training
spec:
mounts:
- mountPoint: s3://my-bucket/imagenet/
name: imagenet
options:
fs.s3a.endpoint: s3.amazonaws.com
---
apiVersion: data.fluid.io/v1alpha1
kind: JuiceFSRuntime
metadata:
name: imagenet
namespace: ml-training
spec:
replicas: 4
tieredstore:
levels:
- mediumtype: SSD
path: /dev/shm
quota: 200Gi
high: "0.95"
low: "0.7"
Alluxio and JuiceFS for Tiered Caching
Alluxio provides a unified namespace over multiple backends but has partial POSIX support (missing truncate, symlinks, xattr). JuiceFS provides full POSIX compliance with separated metadata/data architecture and a three-layer cache stack achieving 300-800 MB/s per node:
- Layer 1: Kernel page cache (RAM, sub-microsecond)
- Layer 2: Local NVMe SSD (300-800 MB/s)
- Layer 3: Object store (S3/MinIO, network latency)
| Feature | Alluxio | JuiceFS |
| POSIX compliance | Partial | Full |
| Metadata | Integrated | Separated (pluggable) |
| Cache perf | High | Very high (300-800 MB/s) |
| Complexity | High | Moderate |
| Multi-storage | Strong (HDFS, S3, NFS, GCS) | S3-compatible |
Prefetching and Data Locality Strategies
- Fluid DataLoad (pre-warming): Pull the dataset into cache before training starts
- List caching on CSI FUSE drivers: Cache directory listings in the kernel for repeated epoch iterations
- File cache for multi-epoch training: Store frequently accessed data on local node storage
- Data-affinity pod scheduling: Schedule pods where their data already resides
- Storage QoS tiering: Enforce per-volume IOPS/bandwidth limits to prevent noisy neighbours
1. Which I/O characteristic best describes AI model training workloads?
Low-latency random reads of small key-value pairs
Sustained high-throughput sequential/random reads across repeated epochs
Write-heavy append-only logging
Infrequent large sequential writes with no reads
2. What Kubernetes access mode is required for distributed training where multiple pods must read the same dataset?
ReadWriteOnce (RWO)
ReadOnlyMany (ROX)
ReadWriteMany (RWX)
SingleNodeWriter (SNW)
3. What is the primary role of a CSI driver in Kubernetes?
To schedule pods onto GPU nodes
To decouple storage provisioning from the Kubernetes core via a standard interface
To compress training data before writing to disk
To replicate pods across availability zones
4. Why is Lustre preferred over NFS for large-scale distributed training?
Lustre is free while NFS requires a license
Lustre stripes data across multiple servers for linearly scalable aggregate bandwidth
NFS does not support any form of shared access
Lustre stores data in object storage format natively
5. What Kubernetes resource does Fluid use to declare the location of training data?
ConfigMap
Dataset CRD
PersistentVolume
DaemonSet
6. Which storage solution provides block, file, and object interfaces from a single on-premises platform?
Amazon EBS
NFS
Rook-Ceph
hostPath volumes
7. What is the main advantage of JuiceFS over Alluxio for AI training workloads?
JuiceFS supports more cloud providers
JuiceFS provides full POSIX compliance while Alluxio has partial support
JuiceFS is faster at object storage reads
JuiceFS requires no metadata engine
8. What does "data-affinity scheduling" in Fluid accomplish?
It replicates data to every node in the cluster
It places training pods on nodes that already hold cached copies of required data
It encrypts data at rest on each node
It compresses datasets to reduce storage costs
9. Which storage type is best suited for checkpoint writes during training?
Low-latency block storage optimized for random reads
High burst write throughput storage with strong durability
Object storage with eventual consistency
In-memory tmpfs volumes
10. What is the purpose of a Fluid DataLoad CRD?
To define GPU resource limits for training pods
To pre-warm the cache by pulling data before a training job starts
To migrate data between cloud providers
To delete stale cached data after training completes
11. What does volumeBindingMode: WaitForFirstConsumer accomplish in a StorageClass?
It delays volume creation until a pod actually uses the PVC
It creates the volume immediately but delays formatting
It waits for the cluster administrator to approve the volume
It queues volume creation until cluster resources are idle
12. In JuiceFS's three-layer cache stack, what is Layer 1?
Local NVMe SSD
Object store (S3/MinIO)
Kernel page cache (RAM)
Redis metadata cache
13. Why should checkpoint storage and training data storage use separate tiers?
Checkpoints require encryption but training data does not
Checkpoint write bursts can consume bandwidth needed by training reads if they share a tier
Training data must be stored in object format while checkpoints use block format
Kubernetes does not allow two PVCs on the same StorageClass