Decompose ML reproducibility into its four orthogonal dimensions and identify which tools address each.
Compare DVC, lakeFS, and Delta Lake/Iceberg by mental model, scope, and ideal use case.
Build a digest-pinned, lockfile-driven container image and define a pipeline-as-code workflow.
Trace an end-to-end OpenLineage graph from raw events to model predictions and use it for compliance and debugging.
Section 1: Reproducibility in ML
Pre-Reading Quiz — Reproducibility
1. Why is git alone insufficient for versioning an ML workflow?
Git cannot track Python files that import non-standard libraries.Git poorly handles large binary datasets, ephemeral environments, and stochastic state.Git only supports one branch at a time, so parallel experiments are impossible.Git automatically deletes files larger than 1 MB, breaking model artifacts.
2. Which tuple uniquely defines a reproducible ML run?
(branch name, dataset name, OS version)(git commit, data version, container image digest, seed/config)(MLflow run ID, hostname, timestamp)(model accuracy, loss value, training duration)
3. Which level of reproducibility requires identical floating-point outputs across runs?
An ML system is the marriage of code, data, environment, and stochastic processes. Each can drift independently and silently invalidate yesterday's results. A reproducible experiment is one where a tuple of (git commit, data version, container image digest, seed/config) uniquely defines the run.
The Four Dimensions of ML Reproducibility
Dimension
Covers
Primary Tools
Common Failure
Code
Scripts, configs, pipelines
Git, Airflow, Kubeflow, dvc.yaml
Untracked notebook edits
Data
Raw inputs, splits, features, labels
DVC, lakeFS, Delta Lake
Silent overwrites, schema drift
Environment
OS, CUDA, libraries
Docker, pip-tools, Poetry, conda-lock
"Worked yesterday" post-upgrade
Randomness
Initialization, shuffling, dropout
Framework seed APIs, deterministic flags
One forgotten seed call
graph TD
R[Reproducible ML Run]
R --> C[Code Dimension]
R --> D[Data Dimension]
R --> E[Environment Dimension]
R --> S[Randomness Dimension]
C --> C1[Git commit SHA]
C --> C2[Pipeline-as-code]
D --> D1[DVC hashes]
D --> D2[lakeFS commits]
D --> D3[Delta Lake version]
E --> E1[Container digest]
E --> E2[Lockfiles]
E --> E3[CUDA + driver]
S --> S1[Seed APIs]
S --> S2[Deterministic flags]
S --> S3[Per-rank seed offsets]
For DistributedDataParallel, derive each rank's seed as base_seed + rank. None of these flags guarantee bitwise reproducibility across different GPU architectures.
Key Points
Reproducibility is a four-dimensional problem: code, data, environment, randomness.
Git alone covers only the code dimension; the other three need their own toolchains.
A run is reproducible only when the full tuple (commit, data hash, image digest, seed) is recorded and restorable.
Pick the lowest acceptable reproducibility level — chasing bitwise determinism in a Spark feature job wastes effort.
Distributed training adds variance via non-associative floating-point reductions and non-deterministic CUDA kernels.
Section 2: Data Versioning Tools
Pre-Reading Quiz — Data Versioning Tools
1. In DVC, what does git actually store when you run dvc add data/raw/?
A compressed copy of the entire dataset.A small .dvc metadata file containing the content hash; the bytes live in a DVC remote.Nothing — DVC bypasses git entirely.A symbolic link to S3 with embedded credentials.
2. What makes lakeFS branches cheap to create even over petabyte-scale data?
It physically duplicates objects in parallel using GPUs.Copy-on-write semantics — objects are copied only when modified on the branch.It limits branches to 1 GB maximum size.It stores everything in RAM until merge.
3. How does Delta Lake provide time-travel queries on object storage?
It snapshots Parquet files every hour to a backup bucket.A _delta_log directory records every transaction; engines read it to determine which files constitute a given table version.It periodically uploads CSV exports to a versioning service.It uses git submodules to track each Parquet file.
Large datasets do not fit in git, and git's text-oriented diffing makes binary diffs useless. Three dominant patterns: a git-like project tool (DVC), a branchable layer over object storage (lakeFS), and a transactional table format (Delta Lake).
DVC: Git-Like at the Project Level
dvc add computes a content hash, moves bytes into a cache, and writes a tiny .dvc metadata file you commit to git. Actual bytes live in a DVC remote (S3/GCS/Azure). The git history of .dvc files is the git history of your data.
git checkout v1.3-paper-submission
dvc pull # downloads exact data hashes for this commit
dvc repro # re-runs the pipeline defined in dvc.yaml
Animation A1: DVC + Git Workflow
Git tracks tiny .dvc metadata pointers; DVC pushes the actual bytes to a separate remote. Reproducing a run is git checkout + dvc pull.
lakeFS: Branchable Over Object Storage
lakeFS versions an entire bucket, exposing S3 (or GCS/Azure) through a versioning layer with branches, commits, and merges. Branches are cheap because of copy-on-write — objects are copied only when modified on the branch.
Animation A2: lakeFS Branching and Merge
A dev branch forks from main at zero storage cost. New writes happen only on the branch. If the experiment wins, merge back; if not, discard.
Delta Lake and Iceberg: ACID Time Travel for Tables
Delta Lake is a transactional table format: Parquet files plus a _delta_log/ directory that records every transaction as JSON or checkpoint entries. Engines read the log to determine which files constitute a given table version.
-- Read the exact features table the model was trained on
SELECT * FROM features.user_features VERSION AS OF 37;
SELECT * FROM features.user_features TIMESTAMP AS OF '2026-05-01 09:00:00';
Animation A3: Delta Lake Time Travel
The transaction log resolves "VERSION AS OF 36" to the exact set of Parquet files present at that version — reads only A, B, and C.
Tool Comparison
Dimension
DVC
lakeFS
Delta Lake
Mental model
Git for data in ML repo
Git for a bucket
ACID table with txn log
Scope
Single project
Entire data lake
Per-table
Versioning unit
.dvc hash
Commit ID
Table version / timestamp
Sweet spot scale
GBs to low TBs
TBs to PBs cross-team
TBs to PBs tabular
Time travel
git checkout + dvc checkout
s3://repo@commit/
VERSION AS OF n
Large organizations often stack all three: lakeFS at the bucket layer, Delta Lake for ACID tables inside it, DVC for project-local slices and models.
Key Points
DVC stores hashes in git and bytes in a separate remote (S3/GCS/Azure).
lakeFS provides zero-copy branches over object storage via copy-on-write semantics.
Delta Lake's _delta_log records every transaction, giving ACID and time-travel queries.
The three tools operate at different layers (project, bucket, table) and frequently coexist.
Iceberg is Delta Lake's closest competitor — choose based on engine ecosystem (Databricks vs Trino/Snowflake).
Section 3: Pipeline and Environment Versioning
Pre-Reading Quiz — Pipeline & Environment
1. Why should you log the container image digest rather than the tag?
Digests are shorter and easier to type.Tags like :latest or :v1.3 can be re-pointed; SHA-256 digests cannot.Digests automatically include the build date.Tags do not work with GPU images.
2. What is the role of a lockfile like requirements.lock or poetry.lock in a Dockerfile?
It encrypts dependencies for secure delivery.It pins every transitive dependency to an exact version + hash so rebuilds are deterministic.It compresses the requirements list to save bandwidth.It tells Docker which port to expose.
3. Which statement best captures "pipeline-as-code"?
Pipelines must be implemented in Rust for performance.Orchestration topology lives in version-controlled source files, not in a UI someone clicked.Each pipeline step must be a one-line bash command.Pipelines are auto-generated by an LLM from a prompt.
Docker as the Environment Unit
A container image encodes the OS, CUDA/cuDNN, Python interpreter, system libs, and all Python dependencies into a single immutable artifact identified by a SHA-256 digest. Two engineers running docker pull myorg/ml@sha256:abc123... execute against byte-identical environments.
Two rules: (1) log the image digest, never the tag; (2) treat the image as immutable — rebuild and re-tag instead of docker exec-ing fixes.
Lockfile Tools
pip-tools (pip-compile) produces a fully pinned requirements.txt from requirements.in, with transitive deps + hashes.
Poetry resolves pyproject.toml into poetry.lock with exact versions and content hashes.
conda-lock generates platform-specific lockfiles from environment.yml, capturing non-Python packages too.
Pipeline-as-Code
Pipeline definitions belong in version-controlled source files, not in a UI someone clicked six months ago. Forms include Airflow DAGs (Python), Kubeflow/Argo (YAML), DVC pipelines (dvc.yaml), and Python-native frameworks (Dagster, Prefect, Metaflow).
2. How is lineage typically emitted for Airflow, Spark, and dbt?
Engineers manually write JSON events after each run.Auto-emitting integrations (providers, listener jars, plugins) hook into the tool's lifecycle.It's harvested nightly by scraping log files with regex.It's encoded into git commit messages and parsed later.
3. Why does the EU AI Act + GDPR make lineage compliance-critical?
Lineage automatically anonymizes PII before training.It supports queries like "which models depend on user X's data" needed for right-to-be-forgotten and AI Act traceability.Lineage replaces the need for differential privacy.It exempts companies from data-processing obligations.
Data lineage is the graph that ties raw sources to features to models to predictions. It is the difference between "this model was trained on user data" and "this exact run, with this commit SHA, on these specific Delta table versions, produced this artifact."
OpenLineage Data Model
Job: a logical unit of work (Airflow task, Spark job, dbt model). Identified by name + namespace.
Run: a single execution of a Job, identified by a runId (UUID).
Dataset: a logical dataset read or written — tables, files, streams, model artifacts.
Event: START / COMPLETE / FAIL JSON messages with inputs, outputs, and extensible facets.
Facets carry rich metadata: schema, columnLineage, dataQualityMetrics, errorMessage, sourceCode, sql, plus custom ML facets for hyperparameters.
Spark: A listener jar attaches to logical/physical plans; emits jobs + datasets with column-level lineage where derivable.
dbt: A plugin reads manifest.json + run_results.json for rich column-level lineage from compiled SQL.
End-to-End Lineage Graph
graph LR
K[(Kafka events)]
K -->|ingest_events| RE[raw.events]
RE -->|dbt: stg_events| SE[stg.events]
SE -->|Spark: user_features_job| UF[features.user_features Delta v=42]
UF -->|train_churn_model| MC[models.churn_model:v1.3 git SHA + image digest]
UF -->|batch_inference| CP[predictions.churn_scores]
MC -->|batch_inference| CP
CP -->|consumed| APP[CRM, dashboards]
In Marquez (the reference open-source store/UI), start at predictions.churn_scores and walk upstream through the model artifact, feature table, dbt staging model, all the way to raw events. Every node carries run history, schema, and facets.
Compliance Queries Become Graph Walks
"Show all models trained on PII-tagged datasets" → filter by sensitivity: PII, walk downstream.
"User X revoked consent — which models need retraining?" → identify their datasets, walk forward to models.
"Prove this credit model didn't use applicant_race" → column-level lineage from training inputs.
Debugging via Lineage
Churn model AUC drops 0.80 → 0.72 overnight. Walk: open latest training run, inspect facets (unchanged), walk upstream to features.user_features (row count down 30%), walk upstream to stg.events (schema-mismatch error in dbt). Fix once, re-run, verify the downstream graph turns green.
Key Points
OpenLineage standardizes lineage events around Jobs, Runs, Datasets, and extensible Facets.
Auto-integrations for Airflow, Spark, and dbt mean you do not emit lineage by hand.
Marquez ingests and visualizes the lineage graph for both forward and backward navigation.
Lineage converts GDPR/EU AI Act compliance questions from forensic SQL into dashboard clicks.
Beyond compliance, lineage enables fast root-cause analysis and impact analysis for planned changes.
Post-Reading Quizzes
Post-Reading Quiz — Reproducibility
1. Why is git alone insufficient for versioning an ML workflow?
Git cannot track Python files that import non-standard libraries.Git poorly handles large binary datasets, ephemeral environments, and stochastic state.Git only supports one branch at a time, so parallel experiments are impossible.Git automatically deletes files larger than 1 MB, breaking model artifacts.
2. Which tuple uniquely defines a reproducible ML run?
(branch name, dataset name, OS version)(git commit, data version, container image digest, seed/config)(MLflow run ID, hostname, timestamp)(model accuracy, loss value, training duration)
3. Which level of reproducibility requires identical floating-point outputs across runs?
1. In DVC, what does git actually store when you run dvc add data/raw/?
A compressed copy of the entire dataset.A small .dvc metadata file containing the content hash; the bytes live in a DVC remote.Nothing — DVC bypasses git entirely.A symbolic link to S3 with embedded credentials.
2. What makes lakeFS branches cheap to create even over petabyte-scale data?
It physically duplicates objects in parallel using GPUs.Copy-on-write semantics — objects are copied only when modified on the branch.It limits branches to 1 GB maximum size.It stores everything in RAM until merge.
3. How does Delta Lake provide time-travel queries on object storage?
It snapshots Parquet files every hour to a backup bucket.A _delta_log directory records every transaction; engines read it to determine which files constitute a given table version.It periodically uploads CSV exports to a versioning service.It uses git submodules to track each Parquet file.
Post-Reading Quiz — Pipeline & Environment
1. Why should you log the container image digest rather than the tag?
Digests are shorter and easier to type.Tags like :latest or :v1.3 can be re-pointed; SHA-256 digests cannot.Digests automatically include the build date.Tags do not work with GPU images.
2. What is the role of a lockfile like requirements.lock or poetry.lock in a Dockerfile?
It encrypts dependencies for secure delivery.It pins every transitive dependency to an exact version + hash so rebuilds are deterministic.It compresses the requirements list to save bandwidth.It tells Docker which port to expose.
3. Which statement best captures "pipeline-as-code"?
Pipelines must be implemented in Rust for performance.Orchestration topology lives in version-controlled source files, not in a UI someone clicked.Each pipeline step must be a one-line bash command.Pipelines are auto-generated by an LLM from a prompt.
Post-Reading Quiz — Lineage & Provenance
1. Which entities form the core of the OpenLineage data model?
2. How is lineage typically emitted for Airflow, Spark, and dbt?
Engineers manually write JSON events after each run.Auto-emitting integrations (providers, listener jars, plugins) hook into the tool's lifecycle.It's harvested nightly by scraping log files with regex.It's encoded into git commit messages and parsed later.
3. Why does the EU AI Act + GDPR make lineage compliance-critical?
Lineage automatically anonymizes PII before training.It supports queries like "which models depend on user X's data" needed for right-to-be-forgotten and AI Act traceability.Lineage replaces the need for differential privacy.It exempts companies from data-processing obligations.