Chapter 5 — Data and Pipeline Versioning

Learning Objectives

Section 1: Reproducibility in ML

Pre-Reading Quiz — Reproducibility

1. Why is git alone insufficient for versioning an ML workflow?

Git cannot track Python files that import non-standard libraries. Git poorly handles large binary datasets, ephemeral environments, and stochastic state. Git only supports one branch at a time, so parallel experiments are impossible. Git automatically deletes files larger than 1 MB, breaking model artifacts.

2. Which tuple uniquely defines a reproducible ML run?

(branch name, dataset name, OS version) (git commit, data version, container image digest, seed/config) (MLflow run ID, hostname, timestamp) (model accuracy, loss value, training duration)

3. Which level of reproducibility requires identical floating-point outputs across runs?

Conceptual reproducibility Statistical reproducibility Numerical reproducibility Bitwise reproducibility

An ML system is the marriage of code, data, environment, and stochastic processes. Each can drift independently and silently invalidate yesterday's results. A reproducible experiment is one where a tuple of (git commit, data version, container image digest, seed/config) uniquely defines the run.

The Four Dimensions of ML Reproducibility

DimensionCoversPrimary ToolsCommon Failure
CodeScripts, configs, pipelinesGit, Airflow, Kubeflow, dvc.yamlUntracked notebook edits
DataRaw inputs, splits, features, labelsDVC, lakeFS, Delta LakeSilent overwrites, schema drift
EnvironmentOS, CUDA, librariesDocker, pip-tools, Poetry, conda-lock"Worked yesterday" post-upgrade
RandomnessInitialization, shuffling, dropoutFramework seed APIs, deterministic flagsOne forgotten seed call
graph TD R[Reproducible ML Run] R --> C[Code Dimension] R --> D[Data Dimension] R --> E[Environment Dimension] R --> S[Randomness Dimension] C --> C1[Git commit SHA] C --> C2[Pipeline-as-code] D --> D1[DVC hashes] D --> D2[lakeFS commits] D --> D3[Delta Lake version] E --> E1[Container digest] E --> E2[Lockfiles] E --> E3[CUDA + driver] S --> S1[Seed APIs] S --> S2[Deterministic flags] S --> S3[Per-rank seed offsets]

Reproducibility Levels

PyTorch Determinism Playbook

def set_seed(seed: int):
    os.environ["PYTHONHASHSEED"] = str(seed)
    os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8"
    random.seed(seed); np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.use_deterministic_algorithms(True)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

For DistributedDataParallel, derive each rank's seed as base_seed + rank. None of these flags guarantee bitwise reproducibility across different GPU architectures.

Key Points

Section 2: Data Versioning Tools

Pre-Reading Quiz — Data Versioning Tools

1. In DVC, what does git actually store when you run dvc add data/raw/?

A compressed copy of the entire dataset. A small .dvc metadata file containing the content hash; the bytes live in a DVC remote. Nothing — DVC bypasses git entirely. A symbolic link to S3 with embedded credentials.

2. What makes lakeFS branches cheap to create even over petabyte-scale data?

It physically duplicates objects in parallel using GPUs. Copy-on-write semantics — objects are copied only when modified on the branch. It limits branches to 1 GB maximum size. It stores everything in RAM until merge.

3. How does Delta Lake provide time-travel queries on object storage?

It snapshots Parquet files every hour to a backup bucket. A _delta_log directory records every transaction; engines read it to determine which files constitute a given table version. It periodically uploads CSV exports to a versioning service. It uses git submodules to track each Parquet file.

Large datasets do not fit in git, and git's text-oriented diffing makes binary diffs useless. Three dominant patterns: a git-like project tool (DVC), a branchable layer over object storage (lakeFS), and a transactional table format (Delta Lake).

DVC: Git-Like at the Project Level

dvc add computes a content hash, moves bytes into a cache, and writes a tiny .dvc metadata file you commit to git. Actual bytes live in a DVC remote (S3/GCS/Azure). The git history of .dvc files is the git history of your data.

git checkout v1.3-paper-submission
dvc pull          # downloads exact data hashes for this commit
dvc repro         # re-runs the pipeline defined in dvc.yaml

Animation A1: DVC + Git Workflow

workspace data/raw/40GB Local working directory containing large dataset files Git Repository code + .dvc files Git tracks code and tiny .dvc pointer files (kilobytes) DVC Remote S3 / GCS / Azure Content-addressable blob storage holding the actual dataset bytes reproducer checkout + pull Another engineer reproducing the exact dataset state dvc add dvc push git checkout dvc pull

Git tracks tiny .dvc metadata pointers; DVC pushes the actual bytes to a separate remote. Reproducing a run is git checkout + dvc pull.

lakeFS: Branchable Over Object Storage

lakeFS versions an entire bucket, exposing S3 (or GCS/Azure) through a versioning layer with branches, commits, and merges. Branches are cheap because of copy-on-write — objects are copied only when modified on the branch.

Animation A2: lakeFS Branching and Merge

main dev c1 c2 c3 (fork) merge feature-v4 model-win zero-copy: COW until modified main branch with commits; dev branch forks at c3, makes changes, merges back into main

A dev branch forks from main at zero storage cost. New writes happen only on the branch. If the experiment wins, merge back; if not, discard.

Delta Lake and Iceberg: ACID Time Travel for Tables

Delta Lake is a transactional table format: Parquet files plus a _delta_log/ directory that records every transaction as JSON or checkpoint entries. Engines read the log to determine which files constitute a given table version.

-- Read the exact features table the model was trained on
SELECT * FROM features.user_features VERSION AS OF 37;
SELECT * FROM features.user_features TIMESTAMP AS OF '2026-05-01 09:00:00';

Animation A3: Delta Lake Time Travel

_delta_log/ v35 add A,B v36 add C v37 remove B, add D v38 add E v39 (HEAD) add F SELECT * FROM t VERSION AS OF 36 Parquet files part-A part-B part-C part-D part-E part-F Snapshot v36 restored part-A + part-B + part-C Transaction log scrolls; version 36 highlighted; the three files that compose v36 are restored as the current view

The transaction log resolves "VERSION AS OF 36" to the exact set of Parquet files present at that version — reads only A, B, and C.

Tool Comparison

DimensionDVClakeFSDelta Lake
Mental modelGit for data in ML repoGit for a bucketACID table with txn log
ScopeSingle projectEntire data lakePer-table
Versioning unit.dvc hashCommit IDTable version / timestamp
Sweet spot scaleGBs to low TBsTBs to PBs cross-teamTBs to PBs tabular
Time travelgit checkout + dvc checkouts3://repo@commit/VERSION AS OF n

Large organizations often stack all three: lakeFS at the bucket layer, Delta Lake for ACID tables inside it, DVC for project-local slices and models.

Key Points

Section 3: Pipeline and Environment Versioning

Pre-Reading Quiz — Pipeline & Environment

1. Why should you log the container image digest rather than the tag?

Digests are shorter and easier to type. Tags like :latest or :v1.3 can be re-pointed; SHA-256 digests cannot. Digests automatically include the build date. Tags do not work with GPU images.

2. What is the role of a lockfile like requirements.lock or poetry.lock in a Dockerfile?

It encrypts dependencies for secure delivery. It pins every transitive dependency to an exact version + hash so rebuilds are deterministic. It compresses the requirements list to save bandwidth. It tells Docker which port to expose.

3. Which statement best captures "pipeline-as-code"?

Pipelines must be implemented in Rust for performance. Orchestration topology lives in version-controlled source files, not in a UI someone clicked. Each pipeline step must be a one-line bash command. Pipelines are auto-generated by an LLM from a prompt.

Docker as the Environment Unit

A container image encodes the OS, CUDA/cuDNN, Python interpreter, system libs, and all Python dependencies into a single immutable artifact identified by a SHA-256 digest. Two engineers running docker pull myorg/ml@sha256:abc123... execute against byte-identical environments.

FROM nvidia/cuda:12.1.0-cudnn9-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y \
    git python3 python3-pip && \
    rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.lock /app/
RUN pip install --no-cache-dir -r requirements.lock
COPY . /app
CMD ["python3", "train.py", "--config", "config/exp1.yaml"]

Two rules: (1) log the image digest, never the tag; (2) treat the image as immutable — rebuild and re-tag instead of docker exec-ing fixes.

Lockfile Tools

Pipeline-as-Code

Pipeline definitions belong in version-controlled source files, not in a UI someone clicked six months ago. Forms include Airflow DAGs (Python), Kubeflow/Argo (YAML), DVC pipelines (dvc.yaml), and Python-native frameworks (Dagster, Prefect, Metaflow).

stages:
  prepare_features:
    cmd: python src/features.py --input data/raw --output data/features
    deps: [data/raw, src/features.py]
    outs: [data/features]
  train:
    cmd: python src/train.py --features data/features --model models/churn.pt
    deps: [data/features, src/train.py]
    outs: [models/churn.pt]
    metrics:
      - metrics.json:
          cache: false

Compute Environment Metadata

Beyond the image digest, mature teams log runtime configuration alongside each experiment so a run is auditable months later:

{
  "image_digest": "sha256:abc123...",
  "git_sha": "f4a2c8e",
  "data_dvc_hash": "md5:9f1ec4...",
  "gpu_model": "NVIDIA A100-SXM4-80GB",
  "cuda_version": "12.1",
  "driver_version": "535.104.05",
  "seed": 1234
}

Key Points

Section 4: Lineage and Provenance

Pre-Reading Quiz — Lineage & Provenance

1. Which entities form the core of the OpenLineage data model?

User, Role, Permission, Resource Job, Run, Dataset, Event (with Facets) Source, Transform, Sink, Buffer Container, Pod, Service, Ingress

2. How is lineage typically emitted for Airflow, Spark, and dbt?

Engineers manually write JSON events after each run. Auto-emitting integrations (providers, listener jars, plugins) hook into the tool's lifecycle. It's harvested nightly by scraping log files with regex. It's encoded into git commit messages and parsed later.

3. Why does the EU AI Act + GDPR make lineage compliance-critical?

Lineage automatically anonymizes PII before training. It supports queries like "which models depend on user X's data" needed for right-to-be-forgotten and AI Act traceability. Lineage replaces the need for differential privacy. It exempts companies from data-processing obligations.

Data lineage is the graph that ties raw sources to features to models to predictions. It is the difference between "this model was trained on user data" and "this exact run, with this commit SHA, on these specific Delta table versions, produced this artifact."

OpenLineage Data Model

Facets carry rich metadata: schema, columnLineage, dataQualityMetrics, errorMessage, sourceCode, sql, plus custom ML facets for hyperparameters.

Auto-Integrations

End-to-End Lineage Graph

graph LR K[(Kafka events)] K -->|ingest_events| RE[raw.events] RE -->|dbt: stg_events| SE[stg.events] SE -->|Spark: user_features_job| UF[features.user_features
Delta v=42] UF -->|train_churn_model| MC[models.churn_model:v1.3
git SHA + image digest] UF -->|batch_inference| CP[predictions.churn_scores] MC -->|batch_inference| CP CP -->|consumed| APP[CRM, dashboards]

In Marquez (the reference open-source store/UI), start at predictions.churn_scores and walk upstream through the model artifact, feature table, dbt staging model, all the way to raw events. Every node carries run history, schema, and facets.

Compliance Queries Become Graph Walks

Debugging via Lineage

Churn model AUC drops 0.80 → 0.72 overnight. Walk: open latest training run, inspect facets (unchanged), walk upstream to features.user_features (row count down 30%), walk upstream to stg.events (schema-mismatch error in dbt). Fix once, re-run, verify the downstream graph turns green.

Key Points

Post-Reading Quizzes

Post-Reading Quiz — Reproducibility

1. Why is git alone insufficient for versioning an ML workflow?

Git cannot track Python files that import non-standard libraries. Git poorly handles large binary datasets, ephemeral environments, and stochastic state. Git only supports one branch at a time, so parallel experiments are impossible. Git automatically deletes files larger than 1 MB, breaking model artifacts.

2. Which tuple uniquely defines a reproducible ML run?

(branch name, dataset name, OS version) (git commit, data version, container image digest, seed/config) (MLflow run ID, hostname, timestamp) (model accuracy, loss value, training duration)

3. Which level of reproducibility requires identical floating-point outputs across runs?

Conceptual reproducibility Statistical reproducibility Numerical reproducibility Bitwise reproducibility
Post-Reading Quiz — Data Versioning Tools

1. In DVC, what does git actually store when you run dvc add data/raw/?

A compressed copy of the entire dataset. A small .dvc metadata file containing the content hash; the bytes live in a DVC remote. Nothing — DVC bypasses git entirely. A symbolic link to S3 with embedded credentials.

2. What makes lakeFS branches cheap to create even over petabyte-scale data?

It physically duplicates objects in parallel using GPUs. Copy-on-write semantics — objects are copied only when modified on the branch. It limits branches to 1 GB maximum size. It stores everything in RAM until merge.

3. How does Delta Lake provide time-travel queries on object storage?

It snapshots Parquet files every hour to a backup bucket. A _delta_log directory records every transaction; engines read it to determine which files constitute a given table version. It periodically uploads CSV exports to a versioning service. It uses git submodules to track each Parquet file.
Post-Reading Quiz — Pipeline & Environment

1. Why should you log the container image digest rather than the tag?

Digests are shorter and easier to type. Tags like :latest or :v1.3 can be re-pointed; SHA-256 digests cannot. Digests automatically include the build date. Tags do not work with GPU images.

2. What is the role of a lockfile like requirements.lock or poetry.lock in a Dockerfile?

It encrypts dependencies for secure delivery. It pins every transitive dependency to an exact version + hash so rebuilds are deterministic. It compresses the requirements list to save bandwidth. It tells Docker which port to expose.

3. Which statement best captures "pipeline-as-code"?

Pipelines must be implemented in Rust for performance. Orchestration topology lives in version-controlled source files, not in a UI someone clicked. Each pipeline step must be a one-line bash command. Pipelines are auto-generated by an LLM from a prompt.
Post-Reading Quiz — Lineage & Provenance

1. Which entities form the core of the OpenLineage data model?

User, Role, Permission, Resource Job, Run, Dataset, Event (with Facets) Source, Transform, Sink, Buffer Container, Pod, Service, Ingress

2. How is lineage typically emitted for Airflow, Spark, and dbt?

Engineers manually write JSON events after each run. Auto-emitting integrations (providers, listener jars, plugins) hook into the tool's lifecycle. It's harvested nightly by scraping log files with regex. It's encoded into git commit messages and parsed later.

3. Why does the EU AI Act + GDPR make lineage compliance-critical?

Lineage automatically anonymizes PII before training. It supports queries like "which models depend on user X's data" needed for right-to-be-forgotten and AI Act traceability. Lineage replaces the need for differential privacy. It exempts companies from data-processing obligations.

Your Progress

Answer Explanations