Study Guide: Chapter 5 — Data and Pipeline Versioning

Pre-Reading Quiz — Reproducibility

1. Why is git alone insufficient for versioning an ML workflow?

Git cannot track Python files that import non-standard libraries. Git poorly handles large binary datasets, ephemeral environments, and stochastic state. Git only supports one branch at a time, so parallel experiments are impossible. Git automatically deletes files larger than 1 MB, breaking model artifacts.

2. Which tuple uniquely defines a reproducible ML run?

(branch name, dataset name, OS version) (git commit, data version, container image digest, seed/config) (MLflow run ID, hostname, timestamp) (model accuracy, loss value, training duration)

3. Which level of reproducibility requires identical floating-point outputs across runs?

Conceptual reproducibility Statistical reproducibility Numerical reproducibility Bitwise reproducibility

An ML system is the marriage of code, data, environment, and stochastic processes. Each can drift independently and silently invalidate yesterday's results. A reproducible experiment is one where a tuple of (git commit, data version, container image digest, seed/config) uniquely defines the run.

The Four Dimensions of ML Reproducibility

Dimension	Covers	Primary Tools	Common Failure
Code	Scripts, configs, pipelines	Git, Airflow, Kubeflow, dvc.yaml	Untracked notebook edits
Data	Raw inputs, splits, features, labels	DVC, lakeFS, Delta Lake	Silent overwrites, schema drift
Environment	OS, CUDA, libraries	Docker, pip-tools, Poetry, conda-lock	"Worked yesterday" post-upgrade
Randomness	Initialization, shuffling, dropout	Framework seed APIs, deterministic flags	One forgotten seed call

graph TD R[Reproducible ML Run] R --> C[Code Dimension] R --> D[Data Dimension] R --> E[Environment Dimension] R --> S[Randomness Dimension] C --> C1[Git commit SHA] C --> C2[Pipeline-as-code] D --> D1[DVC hashes] D --> D2[lakeFS commits] D --> D3[Delta Lake version] E --> E1[Container digest] E --> E2[Lockfiles] E --> E3[CUDA + driver] S --> S1[Seed APIs] S --> S2[Deterministic flags] S --> S3[Per-rank seed offsets]

Reproducibility Levels

Bitwise: identical floats, requires identical hardware + deterministic kernels.
Numerical: within small tolerance (loss within 1e-4) — feasible same-GPU-family.
Statistical: distributions match but individual runs differ — fine for research.
Conceptual: same methodology, reimplemented — academic bar, not production.

PyTorch Determinism Playbook

def set_seed(seed: int):
    os.environ["PYTHONHASHSEED"] = str(seed)
    os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8"
    random.seed(seed); np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.use_deterministic_algorithms(True)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

For DistributedDataParallel, derive each rank's seed as base_seed + rank. None of these flags guarantee bitwise reproducibility across different GPU architectures.

Section 2: Data Versioning Tools

Pre-Reading Quiz — Data Versioning Tools

1. In DVC, what does git actually store when you run dvc add data/raw/?

A compressed copy of the entire dataset. A small .dvc metadata file containing the content hash; the bytes live in a DVC remote. Nothing — DVC bypasses git entirely. A symbolic link to S3 with embedded credentials.

2. What makes lakeFS branches cheap to create even over petabyte-scale data?

It physically duplicates objects in parallel using GPUs. Copy-on-write semantics — objects are copied only when modified on the branch. It limits branches to 1 GB maximum size. It stores everything in RAM until merge.

3. How does Delta Lake provide time-travel queries on object storage?

It snapshots Parquet files every hour to a backup bucket. A _delta_log directory records every transaction; engines read it to determine which files constitute a given table version. It periodically uploads CSV exports to a versioning service. It uses git submodules to track each Parquet file.

Large datasets do not fit in git, and git's text-oriented diffing makes binary diffs useless. Three dominant patterns: a git-like project tool (DVC), a branchable layer over object storage (lakeFS), and a transactional table format (Delta Lake).

DVC: Git-Like at the Project Level

dvc add computes a content hash, moves bytes into a cache, and writes a tiny .dvc metadata file you commit to git. Actual bytes live in a DVC remote (S3/GCS/Azure). The git history of .dvc files is the git history of your data.

git checkout v1.3-paper-submission
dvc pull          # downloads exact data hashes for this commit
dvc repro         # re-runs the pipeline defined in dvc.yaml

lakeFS: Branchable Over Object Storage

lakeFS versions an entire bucket, exposing S3 (or GCS/Azure) through a versioning layer with branches, commits, and merges. Branches are cheap because of copy-on-write — objects are copied only when modified on the branch.

Delta Lake and Iceberg: ACID Time Travel for Tables

Delta Lake is a transactional table format: Parquet files plus a _delta_log/ directory that records every transaction as JSON or checkpoint entries. Engines read the log to determine which files constitute a given table version.

-- Read the exact features table the model was trained on
SELECT * FROM features.user_features VERSION AS OF 37;
SELECT * FROM features.user_features TIMESTAMP AS OF '2026-05-01 09:00:00';

Tool Comparison

Dimension	DVC	lakeFS	Delta Lake
Mental model	Git for data in ML repo	Git for a bucket	ACID table with txn log
Scope	Single project	Entire data lake	Per-table
Versioning unit	.dvc hash	Commit ID	Table version / timestamp
Sweet spot scale	GBs to low TBs	TBs to PBs cross-team	TBs to PBs tabular
Time travel	`git checkout + dvc checkout`	`s3://repo@commit/`	`VERSION AS OF n`

Large organizations often stack all three: lakeFS at the bucket layer, Delta Lake for ACID tables inside it, DVC for project-local slices and models.

Section 3: Pipeline and Environment Versioning

Pre-Reading Quiz — Pipeline & Environment

1. Why should you log the container image digest rather than the tag?

Digests are shorter and easier to type. Tags like :latest or :v1.3 can be re-pointed; SHA-256 digests cannot. Digests automatically include the build date. Tags do not work with GPU images.

2. What is the role of a lockfile like requirements.lock or poetry.lock in a Dockerfile?

It encrypts dependencies for secure delivery. It pins every transitive dependency to an exact version + hash so rebuilds are deterministic. It compresses the requirements list to save bandwidth. It tells Docker which port to expose.

3. Which statement best captures "pipeline-as-code"?

Pipelines must be implemented in Rust for performance. Orchestration topology lives in version-controlled source files, not in a UI someone clicked. Each pipeline step must be a one-line bash command. Pipelines are auto-generated by an LLM from a prompt.

Docker as the Environment Unit

A container image encodes the OS, CUDA/cuDNN, Python interpreter, system libs, and all Python dependencies into a single immutable artifact identified by a SHA-256 digest. Two engineers running docker pull myorg/ml@sha256:abc123... execute against byte-identical environments.

FROM nvidia/cuda:12.1.0-cudnn9-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y \
    git python3 python3-pip && \
    rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.lock /app/
RUN pip install --no-cache-dir -r requirements.lock
COPY . /app
CMD ["python3", "train.py", "--config", "config/exp1.yaml"]

Two rules: (1) log the image digest, never the tag; (2) treat the image as immutable — rebuild and re-tag instead of docker exec-ing fixes.

Lockfile Tools

pip-tools (pip-compile) produces a fully pinned requirements.txt from requirements.in, with transitive deps + hashes.
Poetry resolves pyproject.toml into poetry.lock with exact versions and content hashes.
conda-lock generates platform-specific lockfiles from environment.yml, capturing non-Python packages too.

Pipeline-as-Code

Pipeline definitions belong in version-controlled source files, not in a UI someone clicked six months ago. Forms include Airflow DAGs (Python), Kubeflow/Argo (YAML), DVC pipelines (dvc.yaml), and Python-native frameworks (Dagster, Prefect, Metaflow).

stages:
  prepare_features:
    cmd: python src/features.py --input data/raw --output data/features
    deps: [data/raw, src/features.py]
    outs: [data/features]
  train:
    cmd: python src/train.py --features data/features --model models/churn.pt
    deps: [data/features, src/train.py]
    outs: [models/churn.pt]
    metrics:
      - metrics.json:
          cache: false

Compute Environment Metadata

Beyond the image digest, mature teams log runtime configuration alongside each experiment so a run is auditable months later:

{
  "image_digest": "sha256:abc123...",
  "git_sha": "f4a2c8e",
  "data_dvc_hash": "md5:9f1ec4...",
  "gpu_model": "NVIDIA A100-SXM4-80GB",
  "cuda_version": "12.1",
  "driver_version": "535.104.05",
  "seed": 1234
}

Section 4: Lineage and Provenance

Pre-Reading Quiz — Lineage & Provenance

1. Which entities form the core of the OpenLineage data model?

User, Role, Permission, Resource Job, Run, Dataset, Event (with Facets) Source, Transform, Sink, Buffer Container, Pod, Service, Ingress

2. How is lineage typically emitted for Airflow, Spark, and dbt?

Engineers manually write JSON events after each run. Auto-emitting integrations (providers, listener jars, plugins) hook into the tool's lifecycle. It's harvested nightly by scraping log files with regex. It's encoded into git commit messages and parsed later.

3. Why does the EU AI Act + GDPR make lineage compliance-critical?

Lineage automatically anonymizes PII before training. It supports queries like "which models depend on user X's data" needed for right-to-be-forgotten and AI Act traceability. Lineage replaces the need for differential privacy. It exempts companies from data-processing obligations.

Data lineage is the graph that ties raw sources to features to models to predictions. It is the difference between "this model was trained on user data" and "this exact run, with this commit SHA, on these specific Delta table versions, produced this artifact."

OpenLineage Data Model

Job: a logical unit of work (Airflow task, Spark job, dbt model). Identified by name + namespace.
Run: a single execution of a Job, identified by a runId (UUID).
Dataset: a logical dataset read or written — tables, files, streams, model artifacts.
Event: START / COMPLETE / FAIL JSON messages with inputs, outputs, and extensible facets.

Facets carry rich metadata: schema, columnLineage, dataQualityMetrics, errorMessage, sourceCode, sql, plus custom ML facets for hyperparameters.

Auto-Integrations

Airflow: OpenLineage provider hooks task-lifecycle callbacks; inspects operators to infer inputs/outputs.
Spark: A listener jar attaches to logical/physical plans; emits jobs + datasets with column-level lineage where derivable.
dbt: A plugin reads manifest.json + run_results.json for rich column-level lineage from compiled SQL.

End-to-End Lineage Graph

In Marquez (the reference open-source store/UI), start at predictions.churn_scores and walk upstream through the model artifact, feature table, dbt staging model, all the way to raw events. Every node carries run history, schema, and facets.

Compliance Queries Become Graph Walks

"Show all models trained on PII-tagged datasets" → filter by sensitivity: PII, walk downstream.
"User X revoked consent — which models need retraining?" → identify their datasets, walk forward to models.
"Prove this credit model didn't use applicant_race" → column-level lineage from training inputs.

Debugging via Lineage

Churn model AUC drops 0.80 → 0.72 overnight. Walk: open latest training run, inspect facets (unchanged), walk upstream to features.user_features (row count down 30%), walk upstream to stg.events (schema-mismatch error in dbt). Fix once, re-run, verify the downstream graph turns green.

Post-Reading Quizzes

Post-Reading Quiz — Reproducibility

1. Why is git alone insufficient for versioning an ML workflow?

2. Which tuple uniquely defines a reproducible ML run?

(branch name, dataset name, OS version) (git commit, data version, container image digest, seed/config) (MLflow run ID, hostname, timestamp) (model accuracy, loss value, training duration)

3. Which level of reproducibility requires identical floating-point outputs across runs?

Conceptual reproducibility Statistical reproducibility Numerical reproducibility Bitwise reproducibility

Post-Reading Quiz — Data Versioning Tools

1. In DVC, what does git actually store when you run dvc add data/raw/?

2. What makes lakeFS branches cheap to create even over petabyte-scale data?

3. How does Delta Lake provide time-travel queries on object storage?

Post-Reading Quiz — Pipeline & Environment

1. Why should you log the container image digest rather than the tag?

Digests are shorter and easier to type. Tags like :latest or :v1.3 can be re-pointed; SHA-256 digests cannot. Digests automatically include the build date. Tags do not work with GPU images.

2. What is the role of a lockfile like requirements.lock or poetry.lock in a Dockerfile?

3. Which statement best captures "pipeline-as-code"?

Post-Reading Quiz — Lineage & Provenance

1. Which entities form the core of the OpenLineage data model?

User, Role, Permission, Resource Job, Run, Dataset, Event (with Facets) Source, Transform, Sink, Buffer Container, Pod, Service, Ingress

2. How is lineage typically emitted for Airflow, Spark, and dbt?

3. Why does the EU AI Act + GDPR make lineage compliance-critical?

Chapter 5 — Data and Pipeline Versioning

Learning Objectives

Section 1: Reproducibility in ML

The Four Dimensions of ML Reproducibility

Reproducibility Levels

PyTorch Determinism Playbook

Key Points

Section 2: Data Versioning Tools

DVC: Git-Like at the Project Level

Animation A1: DVC + Git Workflow

lakeFS: Branchable Over Object Storage

Animation A2: lakeFS Branching and Merge

Delta Lake and Iceberg: ACID Time Travel for Tables

Animation A3: Delta Lake Time Travel

Tool Comparison

Key Points

Section 3: Pipeline and Environment Versioning

Docker as the Environment Unit

Lockfile Tools

Pipeline-as-Code

Compute Environment Metadata

Key Points

Section 4: Lineage and Provenance

OpenLineage Data Model

Auto-Integrations

End-to-End Lineage Graph

Compliance Queries Become Graph Walks

Debugging via Lineage

Key Points

Post-Reading Quizzes

Your Progress

Answer Explanations