Define an ML pipeline and distinguish it from a one-off training script or exploratory notebook.
Explain the MLOps maturity model and articulate the gap between research-grade ML and production-grade ML.
Identify the major stages of an end-to-end ML pipeline and describe how they interconnect through a directed acyclic graph (DAG).
Compare ML pipelines to traditional software CI/CD pipelines along the dimensions of artifacts, testing, triggers, and feedback loops.
Recognize the canonical failure modes — technical debt, train-serve skew, silent data dependency changes — that motivate disciplined pipeline engineering.
Pre-Quiz — Section 1: What is an ML Pipeline?
1. Why is an ML pipeline a better fit for production than a notebook, even when both produce the same model file?
2. An upstream team renames a column from "region" to "state" without warning. Why would a cron-scheduled standalone script handle this worse than a real ML pipeline?
3. The chapter compares an ML pipeline to a franchise restaurant scaling a home cook's recipe. What property of production ML does this analogy primarily illustrate?
1. What is an ML Pipeline?
Key Points
An ML pipeline is a reusable, orchestrated DAG of stages (ingest → validate → features → train → evaluate → validate → deploy → monitor), not a single notebook or script.
Notebooks are for exploration; scripts are for single-task automation; pipelines are for end-to-end production with versioning, retries, and lineage.
Production ML demands pipelines because data is alive, models decay, reproducibility is non-trivial, train-serve skew lurks, and stakeholders are heterogeneous.
Pipelines coordinate data engineers, data scientists, ML engineers, DevOps, business owners, and compliance — each owning a different artifact handoff.
The franchise-restaurant analogy: a recipe alone is not enough to feed 10,000 customers per day; you need measured inputs, calibrated process, and feedback.
An ML pipeline is the industrialized version of an experimental workflow — the set of automated, orchestrated, repeatable steps that take data from its source and turn it into a deployed, monitored, retrainable model. Each stage is a clearly defined unit with known inputs and outputs; the orchestrator wires the stages together, runs them on schedule or on demand, retries on transient failure, and records what happened for later inspection.
Why not just take a well-written script and slap a cron job on it? Because production ML carries a unique combination of properties: data is alive (inputs change continuously and silently), models degrade through concept drift even when code never changes, reproducibility is non-trivial (data snapshot + seeds + code + library versions all needed), train-serve skew creates silent regressions when training and serving preprocess differently, and stakeholders are heterogeneous. Data scientists, data engineers, ML engineers, DevOps, security, compliance, and business owners all touch the system. The orchestrated pipeline is the technical artifact that lets a team manage all five at once.
Notebook vs. Script vs. Pipeline
Property
Notebook
Script
ML Pipeline
Primary purpose
Exploration
One-task automation
End-to-end production
Execution model
Interactive, stateful cells
Linear, top-to-bottom
DAG of containers
Retries
Manual
None or shell-level
Orchestrator-managed
Versioning
Often missing
Git on code only
Git + data + model registry
Observability
Cell output
Stdout
Structured logs + metadata + lineage
Figure 1.1: End-to-end ML pipeline stages with feedback loop
flowchart LR
A[Ingest] --> B[Validate Data]
B --> C[Prepare Features]
C --> D[Train Model]
D --> E[Evaluate Model]
E --> F{Validate Model}
F -->|Pass| G[Deploy Model]
F -.->|Fail| D
G --> H[Monitor]
H -.->|Drift / Decay| A
Animation: End-to-end ML pipeline with feedback loop
Each forward stage glows in turn; the dashed orange feedback edge appears last, closing the loop from Monitor back to Ingest.
Post-Quiz — Section 1: What is an ML Pipeline?
1. Why is an ML pipeline a better fit for production than a notebook, even when both produce the same model file?
2. An upstream team renames a column from "region" to "state" without warning. Why would a cron-scheduled standalone script handle this worse than a real ML pipeline?
3. The chapter compares an ML pipeline to a franchise restaurant scaling a home cook's recipe. What property of production ML does this analogy primarily illustrate?
Pre-Quiz — Section 2: The MLOps Discipline
1. A team has automated their training DAG (Airflow), schedules nightly retraining, and logs metrics to MLflow, but pipeline code changes are still tested and deployed manually. Which Google MLOps maturity level best describes them?
2. Sculley et al. describe ML systems as a "high-interest credit card" of technical debt. Which failure mode best illustrates a debt that traditional software CI/CD would not catch?
3. What single technical property primarily distinguishes MLOps from classic DevOps?
2. The MLOps Discipline
Key Points
MLOps generalizes DevOps by treating data, models, and experimental configurations as first-class objects to be versioned, tested, deployed, and monitored.
Google's three-level maturity model: L0 (manual notebooks), L1 (orchestrated pipeline DAG), L2 (full CI/CD plus Continuous Training driven by drift).
L1 is where many teams aspire; L2 is the roadmap, not a checklist. Not every system needs L2 — only those with rapidly changing data, high stakes, or many models.
Distinctive ML failure modes drive the discipline: train-serve skew, silent data dependency changes, hidden technical debt, feedback loops, reproducibility gaps.
The tooling landscape has stabilized into categories: orchestration, data versioning, experiment tracking, feature stores, registries, serving, monitoring.
MLOps applies DevOps-style automation, testing, and monitoring to the entire machine learning lifecycle. Crucially, it extends beyond classic DevOps by treating data, models, and experimental configurations as first-class objects to be versioned, tested, deployed, and monitored.
The Three Maturity Levels
Level 0 is what most teams start with: a notebook on a laptop, a .pkl emailed to an engineer, no orchestrator, no automated testing, no drift detection. Acceptable for low-stakes, infrequently-retrained models — fragile for anything customer-facing.
Level 1 automates the training workflow itself: the notebook becomes a four-task Airflow DAG (ingest → features → train → evaluate), every nightly run logs metrics to MLflow, and trained models land in a registry tagged with the data snapshot and code commit. Errors now manifest as task-level Airflow failures with structured metadata.
Level 2 is full production-grade ML: PRs trigger CI runs with unit tests, data validation, and "train-on-sample" smoke tests; CI success triggers CD; in production an Evidently dashboard tracks input distributions; when PSI crosses 0.2 a retraining pipeline fires; the candidate is shadow-deployed, then canary-released at 5% → 25% → 100%; rollback is one button.
Figure 1.2: Google MLOps maturity model (L0 to L2)
Each level rises in sequence; the blue climb arrows mark the transitions from manual to automated to fully continuous.
Tooling Categories
Category
Purpose
Examples
Orchestration
Run pipeline DAGs
Airflow, Kubeflow Pipelines, Prefect, Dagster
Data versioning
Track datasets like code
DVC, LakeFS, Delta Lake
Experiment tracking
Log runs and params
MLflow, W&B, Neptune
Feature stores
Eliminate train-serve skew
Feast, Tecton, Vertex Feature Store
Model registries
Catalog model versions
MLflow Model Registry, SageMaker Registry
Serving
Run models behind APIs
TF Serving, TorchServe, BentoML, KServe
Monitoring
Track drift and performance
Evidently, NannyML, Arize, WhyLabs
Post-Quiz — Section 2: The MLOps Discipline
1. A team has automated their training DAG (Airflow), schedules nightly retraining, and logs metrics to MLflow, but pipeline code changes are still tested and deployed manually. Which Google MLOps maturity level best describes them?
2. Sculley et al. describe ML systems as a "high-interest credit card" of technical debt. Which failure mode best illustrates a debt that traditional software CI/CD would not catch?
3. What single technical property primarily distinguishes MLOps from classic DevOps?
Pre-Quiz — Section 3: Anatomy of an End-to-End Pipeline
1. Why must an ML pipeline be a DAG — specifically acyclic — rather than a graph with cycles such as train → evaluate → train?
2. The model-validation gate stage exists between evaluate and deploy. What would happen if a team omitted it from their pipeline?
3. Which trigger is unique to ML pipelines and does not exist for traditional software CI/CD?
The DAG structure is load-bearing: edges encode artifact/metadata dependencies, not time; that is what enables parallelism, retries, and conditional branching.
ML pipelines have many triggers: code commit, schedule, event (new data), drift, performance, manual. Traditional CI/CD has one (commit).
Every run produces artifacts (data, features, models, metrics) plus metadata/lineage — the chain-of-custody that lets you ask "what produced this model?".
TFX components map almost 1:1 to canonical stages (ExampleGen, StatisticsGen, Transform, Trainer, Evaluator, Pusher); the same shape recurs in Kubeflow, Airflow, and Vertex.
The DAG is not arbitrary jargon. Directed: edges have arrows, so train only starts after prepare_features finishes. Acyclic: no cycles — you cannot have train → evaluate → train in a single DAG because it would never terminate. Retraining loops are implemented at a higher level: the monitoring pipeline triggers a brand-new run of the training pipeline. Graph: tasks with no mutual dependency run in parallel.
Triggers are where ML pipelines diverge most visibly from software CI/CD. Software has one trigger (the git push); ML pipelines accept code-driven, schedule-driven, event-driven, drift-driven, performance-driven, and manual triggers. In a Level 2 system the orchestrator records why each run was triggered — invaluable forensic information when a model misbehaves later.
Figure 1.3: DAG orchestration with parallel branches and conditional gating
Post-Quiz — Section 3: Anatomy of an End-to-End Pipeline
1. Why must an ML pipeline be a DAG — specifically acyclic — rather than a graph with cycles such as train → evaluate → train?
2. The model-validation gate stage exists between evaluate and deploy. What would happen if a team omitted it from their pipeline?
3. Which trigger is unique to ML pipelines and does not exist for traditional software CI/CD?
Pre-Quiz — Section 4: ML Pipelines vs Software CI/CD
1. What is the single load-bearing insight that drives almost every difference between ML pipelines and software CI/CD?
2. Two intertwined loops characterize MLOps: software CI/CD and Continuous Training (CT). Which best describes the difference?
3. A team versions code in git but not data. Why is "I rolled back to last week's git SHA" insufficient to reproduce last week's model?
4. ML Pipelines vs Software CI/CD
Key Points
ML pipelines do not replace CI/CD — they superimpose data and model concerns on top of it.
The defining insight: in software, behavior is determined by code; in ML, behavior is determined by code interacting with data. Every consequence flows from this.
MLOps adds a second loop, Continuous Training (CT), triggered by data signals rather than code commits, that retrains and redeploys models when the world changes.
Tests become probabilistic and threshold-based: slice metrics, fairness, statistical drift — not just assert add(2,3) == 5.
Reproducing an ML run requires data snapshot + seeds + environment + library pins; containers freeze the environment but cannot freeze the data.
In traditional CI/CD, the build is a pure function of code: same source plus same toolchain yields the same binary, bit-for-bit. In ML, the function is impure: same training code plus same hyperparameters but different data yields a different model. This forces data into the artifact universe. In software engineering, code is the source and the binary is the build output; in ML, both code and data are the source and the model is the build output. If you do not version both inputs, you cannot reproduce the output.
Two intertwined loops therefore characterize MLOps:
The software CI/CD loop is triggered by code commits and builds/tests/deploys pipeline code, training code, and serving code.
The model CT loop is triggered by data signals (new data, drift, performance degradation) and reruns the training pipeline using the current codebase and fresh data, producing new model artifacts that are evaluated and potentially deployed.
Figure 1.4: Two intertwined loops — CI/CD plus Continuous Training
flowchart TD
subgraph CICD["Software CI/CD Loop"]
Commit[Git Commit] --> Build[Build & Test]
Build --> DeployCode[Deploy Pipeline Code]
end
subgraph CT["Continuous Training Loop"]
Signal["Data Signal:<br/>new data / drift / decay"] --> Retrain[Retrain on Fresh Data]
Retrain --> EvalCT[Evaluate Model]
EvalCT --> DeployModel[Deploy Model Artifact]
end
DeployCode -.->|Updates pipeline used by| Retrain
DeployModel -.->|Production metrics feed| Signal
Animation: Continuous Training loop — drift triggers retraining
The orange pulse on Monitor represents a drift threshold crossing; the loop highlights each stage in turn as the new model flows from retraining through gate to canary deploy and back into monitoring.
Hidden Technical Debt Categories
Debt Category
What it Looks Like
Why It Hurts
Data dependency
Upstream column semantics shift silently
No compiler catches a semantic shift
Configuration
Hyperparameters sprawl across experiments
"Which settings produced our best model?" becomes archaeology
Glue-code
Ad hoc scripts stitching things together
Brittle; every change requires careful manual reasoning
Feedback loop
Model outputs shape its own training data
Self-reinforcing biases
Reproducibility
No data versioning, no env pinning
Cannot recreate past results
Monitoring
No drift or performance dashboards
Silent decay over weeks
Post-Quiz — Section 4: ML Pipelines vs Software CI/CD
1. What is the single load-bearing insight that drives almost every difference between ML pipelines and software CI/CD?
2. Two intertwined loops characterize MLOps: software CI/CD and Continuous Training (CT). Which best describes the difference?
3. A team versions code in git but not data. Why is "I rolled back to last week's git SHA" insufficient to reproduce last week's model?