Study Guide: Foundations of ML Pipelines and MLOps

Pre-Quiz — Section 1: What is an ML Pipeline?

1. Why is an ML pipeline a better fit for production than a notebook, even when both produce the same model file?

A) Notebooks cannot import Python packages. B) Pipelines provide modular, versioned, orchestrated stages with retries, lineage, and reproducibility that notebooks lack. C) Notebooks always run slower than scripts. D) Pipelines are required by Python's standard library.

2. An upstream team renames a column from "region" to "state" without warning. Why would a cron-scheduled standalone script handle this worse than a real ML pipeline?

A) The script would automatically detect the rename and rewrite its code. B) A pipeline orchestrator surfaces a task-level failure with metadata and lineage; a script just emits stack traces and lacks the structured retry, logging, and data-validation affordances of a pipeline. C) Scripts can never read CSV files. D) Pipelines disable cron jobs to prevent collisions.

3. The chapter compares an ML pipeline to a franchise restaurant scaling a home cook's recipe. What property of production ML does this analogy primarily illustrate?

A) Production ML needs standardization, calibrated inputs/outputs, and feedback loops so the workflow runs repeatably across many environments and operators. B) Models must always be retrained at 02:00 daily. C) Only chefs can deploy ML models. D) Notebooks make recipes taste better than pipelines.

1. What is an ML Pipeline?

Key Points

An ML pipeline is a reusable, orchestrated DAG of stages (ingest → validate → features → train → evaluate → validate → deploy → monitor), not a single notebook or script.
Notebooks are for exploration; scripts are for single-task automation; pipelines are for end-to-end production with versioning, retries, and lineage.
Production ML demands pipelines because data is alive, models decay, reproducibility is non-trivial, train-serve skew lurks, and stakeholders are heterogeneous.
Pipelines coordinate data engineers, data scientists, ML engineers, DevOps, business owners, and compliance — each owning a different artifact handoff.
The franchise-restaurant analogy: a recipe alone is not enough to feed 10,000 customers per day; you need measured inputs, calibrated process, and feedback.

An ML pipeline is the industrialized version of an experimental workflow — the set of automated, orchestrated, repeatable steps that take data from its source and turn it into a deployed, monitored, retrainable model. Each stage is a clearly defined unit with known inputs and outputs; the orchestrator wires the stages together, runs them on schedule or on demand, retries on transient failure, and records what happened for later inspection.

Why not just take a well-written script and slap a cron job on it? Because production ML carries a unique combination of properties: data is alive (inputs change continuously and silently), models degrade through concept drift even when code never changes, reproducibility is non-trivial (data snapshot + seeds + code + library versions all needed), train-serve skew creates silent regressions when training and serving preprocess differently, and stakeholders are heterogeneous. Data scientists, data engineers, ML engineers, DevOps, security, compliance, and business owners all touch the system. The orchestrated pipeline is the technical artifact that lets a team manage all five at once.

Notebook vs. Script vs. Pipeline

Property	Notebook	Script	ML Pipeline
Primary purpose	Exploration	One-task automation	End-to-end production
Execution model	Interactive, stateful cells	Linear, top-to-bottom	DAG of containers
Retries	Manual	None or shell-level	Orchestrator-managed
Versioning	Often missing	Git on code only	Git + data + model registry
Observability	Cell output	Stdout	Structured logs + metadata + lineage

Figure 1.1: End-to-end ML pipeline stages with feedback loop

flowchart LR A[Ingest] --> B[Validate Data] B --> C[Prepare Features] C --> D[Train Model] D --> E[Evaluate Model] E --> F{Validate Model} F -->|Pass| G[Deploy Model] F -.->|Fail| D G --> H[Monitor] H -.->|Drift / Decay| A

Animation: End-to-end ML pipeline with feedback loop

Each forward stage glows in turn; the dashed orange feedback edge appears last, closing the loop from Monitor back to Ingest.

Post-Quiz — Section 1: What is an ML Pipeline?

1. Why is an ML pipeline a better fit for production than a notebook, even when both produce the same model file?

2. An upstream team renames a column from "region" to "state" without warning. Why would a cron-scheduled standalone script handle this worse than a real ML pipeline?

3. The chapter compares an ML pipeline to a franchise restaurant scaling a home cook's recipe. What property of production ML does this analogy primarily illustrate?

Pre-Quiz — Section 2: The MLOps Discipline

1. A team has automated their training DAG (Airflow), schedules nightly retraining, and logs metrics to MLflow, but pipeline code changes are still tested and deployed manually. Which Google MLOps maturity level best describes them?

A) Level 0 — Manual ML, because they still review metrics by hand. B) Level 1 — Pipeline Automation, because the training workflow is automated but the surrounding software lifecycle is not. C) Level 2 — CI/CD + Continuous Training, because they use Airflow. D) Level 3 — AGI ML, because MLflow is used.

2. Sculley et al. describe ML systems as a "high-interest credit card" of technical debt. Which failure mode best illustrates a debt that traditional software CI/CD would not catch?

A) A null-pointer exception in serving code. B) A typo in an HTML template. C) An upstream column changes semantics silently (e.g., "USD" becomes "USD,EUR"), degrading model accuracy without any code change or alert. D) A failed unit test on a deterministic addition function.

3. What single technical property primarily distinguishes MLOps from classic DevOps?

A) MLOps treats data and models as first-class versioned, tested, and monitored artifacts alongside code, not just code. B) MLOps requires using Kubernetes; DevOps does not. C) MLOps forbids git. D) MLOps is exclusively for cloud workloads.

2. The MLOps Discipline

Key Points

MLOps generalizes DevOps by treating data, models, and experimental configurations as first-class objects to be versioned, tested, deployed, and monitored.
Google's three-level maturity model: L0 (manual notebooks), L1 (orchestrated pipeline DAG), L2 (full CI/CD plus Continuous Training driven by drift).
L1 is where many teams aspire; L2 is the roadmap, not a checklist. Not every system needs L2 — only those with rapidly changing data, high stakes, or many models.
Distinctive ML failure modes drive the discipline: train-serve skew, silent data dependency changes, hidden technical debt, feedback loops, reproducibility gaps.
The tooling landscape has stabilized into categories: orchestration, data versioning, experiment tracking, feature stores, registries, serving, monitoring.

MLOps applies DevOps-style automation, testing, and monitoring to the entire machine learning lifecycle. Crucially, it extends beyond classic DevOps by treating data, models, and experimental configurations as first-class objects to be versioned, tested, deployed, and monitored.

The Three Maturity Levels

Level 0 is what most teams start with: a notebook on a laptop, a .pkl emailed to an engineer, no orchestrator, no automated testing, no drift detection. Acceptable for low-stakes, infrequently-retrained models — fragile for anything customer-facing.

Level 1 automates the training workflow itself: the notebook becomes a four-task Airflow DAG (ingest → features → train → evaluate), every nightly run logs metrics to MLflow, and trained models land in a registry tagged with the data snapshot and code commit. Errors now manifest as task-level Airflow failures with structured metadata.

Level 2 is full production-grade ML: PRs trigger CI runs with unit tests, data validation, and "train-on-sample" smoke tests; CI success triggers CD; in production an Evidently dashboard tracks input distributions; when PSI crosses 0.2 a retraining pipeline fires; the candidate is shadow-deployed, then canary-released at 5% → 25% → 100%; rollback is one button.

Figure 1.2: Google MLOps maturity model (L0 to L2)

flowchart TD L0["Level 0: Manual ML Notebooks, .pkl handoffs No orchestration, no tests"] L1["Level 1: Pipeline Automation Orchestrated DAG (Airflow/KFP) Scheduled retraining, basic monitoring"] L2["Level 2: CI/CD + Continuous Training Automated tests, canary deploys Drift-triggered retraining"] L0 ==>|"Refactor notebook into DAG"| L1 L1 ==>|"Add CI/CD, drift monitoring, auto-retraining"| L2 L2 -.->|Roadmap, not checklist| L1

Animation: MLOps maturity levels (L0 → L1 → L2)

Each level rises in sequence; the blue climb arrows mark the transitions from manual to automated to fully continuous.

Tooling Categories

Category	Purpose	Examples
Orchestration	Run pipeline DAGs	Airflow, Kubeflow Pipelines, Prefect, Dagster
Data versioning	Track datasets like code	DVC, LakeFS, Delta Lake
Experiment tracking	Log runs and params	MLflow, W&B, Neptune
Feature stores	Eliminate train-serve skew	Feast, Tecton, Vertex Feature Store
Model registries	Catalog model versions	MLflow Model Registry, SageMaker Registry
Serving	Run models behind APIs	TF Serving, TorchServe, BentoML, KServe
Monitoring	Track drift and performance	Evidently, NannyML, Arize, WhyLabs

Post-Quiz — Section 2: The MLOps Discipline

2. Sculley et al. describe ML systems as a "high-interest credit card" of technical debt. Which failure mode best illustrates a debt that traditional software CI/CD would not catch?

3. What single technical property primarily distinguishes MLOps from classic DevOps?

Pre-Quiz — Section 3: Anatomy of an End-to-End Pipeline

1. Why must an ML pipeline be a DAG — specifically acyclic — rather than a graph with cycles such as train → evaluate → train?

A) Acyclic graphs are required by Python type hints. B) A cycle would never terminate; retraining loops are implemented at a higher level by having monitoring trigger a brand-new run of the training pipeline. C) Cycles always cause memory leaks in Airflow. D) DAGs cannot have parallel branches.

2. The model-validation gate stage exists between evaluate and deploy. What would happen if a team omitted it from their pipeline?

A) The pipeline would run faster, with no downside. B) Mermaid diagrams would render incorrectly. C) Regressions could reach production: any newly trained model would auto-deploy regardless of whether its metrics beat the baseline or held on protected subgroups. D) The model would always achieve higher accuracy.

3. Which trigger is unique to ML pipelines and does not exist for traditional software CI/CD?

A) Git commit triggers. B) Cron schedule triggers. C) Manual operator triggers. D) Drift-driven triggers, where a monitoring system fires the pipeline because input distributions or model performance crossed a threshold.

3. Anatomy of an End-to-End Pipeline

Key Points

Eight canonical stages: ingest, validate data, prepare features, train, evaluate, validate model, deploy, monitor.
The DAG structure is load-bearing: edges encode artifact/metadata dependencies, not time; that is what enables parallelism, retries, and conditional branching.
ML pipelines have many triggers: code commit, schedule, event (new data), drift, performance, manual. Traditional CI/CD has one (commit).
Every run produces artifacts (data, features, models, metrics) plus metadata/lineage — the chain-of-custody that lets you ask "what produced this model?".
TFX components map almost 1:1 to canonical stages (ExampleGen, StatisticsGen, Transform, Trainer, Evaluator, Pusher); the same shape recurs in Kubeflow, Airflow, and Vertex.

The DAG is not arbitrary jargon. Directed: edges have arrows, so train only starts after prepare_features finishes. Acyclic: no cycles — you cannot have train → evaluate → train in a single DAG because it would never terminate. Retraining loops are implemented at a higher level: the monitoring pipeline triggers a brand-new run of the training pipeline. Graph: tasks with no mutual dependency run in parallel.

Triggers are where ML pipelines diverge most visibly from software CI/CD. Software has one trigger (the git push); ML pipelines accept code-driven, schedule-driven, event-driven, drift-driven, performance-driven, and manual triggers. In a Level 2 system the orchestrator records why each run was triggered — invaluable forensic information when a model misbehaves later.

Figure 1.3: DAG orchestration with parallel branches and conditional gating

flowchart TD Ingest((Ingest)) --> Validate[Validate Data] Validate --> Features[Prepare Features] Features --> Train[Train Model] Features --> Baseline[Train Baseline] Train --> Eval[Evaluate Candidate] Baseline --> Eval Eval --> Gate{Metrics > baseline?} Gate -->|Yes| Deploy[Deploy] Gate -.->|No| Stop((Stop))

Stage-by-Stage Summary

Stage	Output	Key Risk if Skipped
Ingest	Versioned data reference	Non-reproducible training data
Validate data	Pass/fail + statistics report	Silently poisoned models
Prepare features	Training dataset + transform graph	Train-serve skew
Train	Model artifact + metrics	Wasted compute, unreproducible models
Evaluate	Slice metrics, plots	Aggregate-good but subgroup-bad models
Validate (gate)	Approve/reject decision	Regressions reach production
Deploy	Running endpoint	Slow rollback, downtime
Monitor	Alerts, drift signals	Silent decay over weeks

Post-Quiz — Section 3: Anatomy of an End-to-End Pipeline

1. Why must an ML pipeline be a DAG — specifically acyclic — rather than a graph with cycles such as train → evaluate → train?

2. The model-validation gate stage exists between evaluate and deploy. What would happen if a team omitted it from their pipeline?

3. Which trigger is unique to ML pipelines and does not exist for traditional software CI/CD?

Pre-Quiz — Section 4: ML Pipelines vs Software CI/CD

1. What is the single load-bearing insight that drives almost every difference between ML pipelines and software CI/CD?

A) ML pipelines are always written in Python. B) ML behavior is determined by code interacting with data, not by code alone. C) ML pipelines run more slowly than software CI/CD. D) Software CI/CD does not use containers.

2. Two intertwined loops characterize MLOps: software CI/CD and Continuous Training (CT). Which best describes the difference?

A) CI/CD is triggered by code commits and ships pipeline/serving code; CT is triggered by data signals (new data, drift, performance) and produces new model artifacts. B) CI/CD is for staging; CT is for production. They never share infrastructure. C) CT replaces CI/CD entirely in MLOps. D) CT is triggered by git commits; CI/CD by drift.

3. A team versions code in git but not data. Why is "I rolled back to last week's git SHA" insufficient to reproduce last week's model?

A) Git SHAs expire after seven days. B) Models are bit-for-bit identical regardless of inputs. C) In ML, the model is a function of code AND data; without versioned data snapshots (and seeds, env, library pins), the same code can produce a different model. D) Last week's git SHA always contains the model.pkl.

4. ML Pipelines vs Software CI/CD

Key Points

ML pipelines do not replace CI/CD — they superimpose data and model concerns on top of it.
The defining insight: in software, behavior is determined by code; in ML, behavior is determined by code interacting with data. Every consequence flows from this.
MLOps adds a second loop, Continuous Training (CT), triggered by data signals rather than code commits, that retrains and redeploys models when the world changes.
Tests become probabilistic and threshold-based: slice metrics, fairness, statistical drift — not just assert add(2,3) == 5.
Reproducing an ML run requires data snapshot + seeds + environment + library pins; containers freeze the environment but cannot freeze the data.

In traditional CI/CD, the build is a pure function of code: same source plus same toolchain yields the same binary, bit-for-bit. In ML, the function is impure: same training code plus same hyperparameters but different data yields a different model. This forces data into the artifact universe. In software engineering, code is the source and the binary is the build output; in ML, both code and data are the source and the model is the build output. If you do not version both inputs, you cannot reproduce the output.

Two intertwined loops therefore characterize MLOps:

The software CI/CD loop is triggered by code commits and builds/tests/deploys pipeline code, training code, and serving code.
The model CT loop is triggered by data signals (new data, drift, performance degradation) and reruns the training pipeline using the current codebase and fresh data, producing new model artifacts that are evaluated and potentially deployed.

Figure 1.4: Two intertwined loops — CI/CD plus Continuous Training

flowchart TD subgraph CICD["Software CI/CD Loop"] Commit[Git Commit] --> Build[Build & Test] Build --> DeployCode[Deploy Pipeline Code] end subgraph CT["Continuous Training Loop"] Signal["Data Signal: new data / drift / decay"] --> Retrain[Retrain on Fresh Data] Retrain --> EvalCT[Evaluate Model] EvalCT --> DeployModel[Deploy Model Artifact] end DeployCode -.->|Updates pipeline used by| Retrain DeployModel -.->|Production metrics feed| Signal

Animation: Continuous Training loop — drift triggers retraining

The orange pulse on Monitor represents a drift threshold crossing; the loop highlights each stage in turn as the new model flows from retraining through gate to canary deploy and back into monitoring.

Hidden Technical Debt Categories

Debt Category	What it Looks Like	Why It Hurts
Data dependency	Upstream column semantics shift silently	No compiler catches a semantic shift
Configuration	Hyperparameters sprawl across experiments	"Which settings produced our best model?" becomes archaeology
Glue-code	Ad hoc scripts stitching things together	Brittle; every change requires careful manual reasoning
Feedback loop	Model outputs shape its own training data	Self-reinforcing biases
Reproducibility	No data versioning, no env pinning	Cannot recreate past results
Monitoring	No drift or performance dashboards	Silent decay over weeks

Post-Quiz — Section 4: ML Pipelines vs Software CI/CD

1. What is the single load-bearing insight that drives almost every difference between ML pipelines and software CI/CD?

2. Two intertwined loops characterize MLOps: software CI/CD and Continuous Training (CT). Which best describes the difference?

3. A team versions code in git but not data. Why is "I rolled back to last week's git SHA" insufficient to reproduce last week's model?

Chapter 1: Foundations of ML Pipelines and MLOps

Learning Objectives

1. What is an ML Pipeline?

Key Points

Notebook vs. Script vs. Pipeline

2. The MLOps Discipline

Key Points

The Three Maturity Levels

Tooling Categories

3. Anatomy of an End-to-End Pipeline

Key Points

Stage-by-Stage Summary

4. ML Pipelines vs Software CI/CD

Key Points

Hidden Technical Debt Categories

Your Progress

Answer Explanations