Chapter 1: Foundations of ML Pipelines and MLOps

Learning Objectives

Pre-Quiz — Section 1: What is an ML Pipeline?

1. Why is an ML pipeline a better fit for production than a notebook, even when both produce the same model file?

2. An upstream team renames a column from "region" to "state" without warning. Why would a cron-scheduled standalone script handle this worse than a real ML pipeline?

3. The chapter compares an ML pipeline to a franchise restaurant scaling a home cook's recipe. What property of production ML does this analogy primarily illustrate?

1. What is an ML Pipeline?

Key Points

An ML pipeline is the industrialized version of an experimental workflow — the set of automated, orchestrated, repeatable steps that take data from its source and turn it into a deployed, monitored, retrainable model. Each stage is a clearly defined unit with known inputs and outputs; the orchestrator wires the stages together, runs them on schedule or on demand, retries on transient failure, and records what happened for later inspection.

Why not just take a well-written script and slap a cron job on it? Because production ML carries a unique combination of properties: data is alive (inputs change continuously and silently), models degrade through concept drift even when code never changes, reproducibility is non-trivial (data snapshot + seeds + code + library versions all needed), train-serve skew creates silent regressions when training and serving preprocess differently, and stakeholders are heterogeneous. Data scientists, data engineers, ML engineers, DevOps, security, compliance, and business owners all touch the system. The orchestrated pipeline is the technical artifact that lets a team manage all five at once.

Notebook vs. Script vs. Pipeline

PropertyNotebookScriptML Pipeline
Primary purposeExplorationOne-task automationEnd-to-end production
Execution modelInteractive, stateful cellsLinear, top-to-bottomDAG of containers
RetriesManualNone or shell-levelOrchestrator-managed
VersioningOften missingGit on code onlyGit + data + model registry
ObservabilityCell outputStdoutStructured logs + metadata + lineage

Figure 1.1: End-to-end ML pipeline stages with feedback loop

flowchart LR A[Ingest] --> B[Validate Data] B --> C[Prepare Features] C --> D[Train Model] D --> E[Evaluate Model] E --> F{Validate Model} F -->|Pass| G[Deploy Model] F -.->|Fail| D G --> H[Monitor] H -.->|Drift / Decay| A
Animation: End-to-end ML pipeline with feedback loop
Ingest Validate Features Train Evaluate Deploy Monitor drift / decay triggers retraining
Each forward stage glows in turn; the dashed orange feedback edge appears last, closing the loop from Monitor back to Ingest.
Post-Quiz — Section 1: What is an ML Pipeline?

1. Why is an ML pipeline a better fit for production than a notebook, even when both produce the same model file?

2. An upstream team renames a column from "region" to "state" without warning. Why would a cron-scheduled standalone script handle this worse than a real ML pipeline?

3. The chapter compares an ML pipeline to a franchise restaurant scaling a home cook's recipe. What property of production ML does this analogy primarily illustrate?

Pre-Quiz — Section 2: The MLOps Discipline

1. A team has automated their training DAG (Airflow), schedules nightly retraining, and logs metrics to MLflow, but pipeline code changes are still tested and deployed manually. Which Google MLOps maturity level best describes them?

2. Sculley et al. describe ML systems as a "high-interest credit card" of technical debt. Which failure mode best illustrates a debt that traditional software CI/CD would not catch?

3. What single technical property primarily distinguishes MLOps from classic DevOps?

2. The MLOps Discipline

Key Points

MLOps applies DevOps-style automation, testing, and monitoring to the entire machine learning lifecycle. Crucially, it extends beyond classic DevOps by treating data, models, and experimental configurations as first-class objects to be versioned, tested, deployed, and monitored.

The Three Maturity Levels

Level 0 is what most teams start with: a notebook on a laptop, a .pkl emailed to an engineer, no orchestrator, no automated testing, no drift detection. Acceptable for low-stakes, infrequently-retrained models — fragile for anything customer-facing.

Level 1 automates the training workflow itself: the notebook becomes a four-task Airflow DAG (ingest → features → train → evaluate), every nightly run logs metrics to MLflow, and trained models land in a registry tagged with the data snapshot and code commit. Errors now manifest as task-level Airflow failures with structured metadata.

Level 2 is full production-grade ML: PRs trigger CI runs with unit tests, data validation, and "train-on-sample" smoke tests; CI success triggers CD; in production an Evidently dashboard tracks input distributions; when PSI crosses 0.2 a retraining pipeline fires; the candidate is shadow-deployed, then canary-released at 5% → 25% → 100%; rollback is one button.

Figure 1.2: Google MLOps maturity model (L0 to L2)

flowchart TD L0["Level 0: Manual ML<br/>Notebooks, .pkl handoffs<br/>No orchestration, no tests"] L1["Level 1: Pipeline Automation<br/>Orchestrated DAG (Airflow/KFP)<br/>Scheduled retraining, basic monitoring"] L2["Level 2: CI/CD + Continuous Training<br/>Automated tests, canary deploys<br/>Drift-triggered retraining"] L0 ==>|"Refactor notebook<br/>into DAG"| L1 L1 ==>|"Add CI/CD,<br/>drift monitoring,<br/>auto-retraining"| L2 L2 -.->|Roadmap, not checklist| L1
Animation: MLOps maturity levels (L0 → L1 → L2)
LEVEL 0 Manual ML Notebooks · .pkl handoffs · No orchestration · No tests · Manual deploys LEVEL 1 Pipeline Automation Orchestrated DAG (Airflow/KFP) · Scheduled retraining · MLflow tracking · Manual code deploys LEVEL 2 CI/CD + Continuous Training Automated tests · Canary deploys · Drift-triggered retraining · Shadow + A/B
Each level rises in sequence; the blue climb arrows mark the transitions from manual to automated to fully continuous.

Tooling Categories

CategoryPurposeExamples
OrchestrationRun pipeline DAGsAirflow, Kubeflow Pipelines, Prefect, Dagster
Data versioningTrack datasets like codeDVC, LakeFS, Delta Lake
Experiment trackingLog runs and paramsMLflow, W&B, Neptune
Feature storesEliminate train-serve skewFeast, Tecton, Vertex Feature Store
Model registriesCatalog model versionsMLflow Model Registry, SageMaker Registry
ServingRun models behind APIsTF Serving, TorchServe, BentoML, KServe
MonitoringTrack drift and performanceEvidently, NannyML, Arize, WhyLabs
Post-Quiz — Section 2: The MLOps Discipline

1. A team has automated their training DAG (Airflow), schedules nightly retraining, and logs metrics to MLflow, but pipeline code changes are still tested and deployed manually. Which Google MLOps maturity level best describes them?

2. Sculley et al. describe ML systems as a "high-interest credit card" of technical debt. Which failure mode best illustrates a debt that traditional software CI/CD would not catch?

3. What single technical property primarily distinguishes MLOps from classic DevOps?

Pre-Quiz — Section 3: Anatomy of an End-to-End Pipeline

1. Why must an ML pipeline be a DAG — specifically acyclic — rather than a graph with cycles such as train → evaluate → train?

2. The model-validation gate stage exists between evaluate and deploy. What would happen if a team omitted it from their pipeline?

3. Which trigger is unique to ML pipelines and does not exist for traditional software CI/CD?

3. Anatomy of an End-to-End Pipeline

Key Points

The DAG is not arbitrary jargon. Directed: edges have arrows, so train only starts after prepare_features finishes. Acyclic: no cycles — you cannot have train → evaluate → train in a single DAG because it would never terminate. Retraining loops are implemented at a higher level: the monitoring pipeline triggers a brand-new run of the training pipeline. Graph: tasks with no mutual dependency run in parallel.

Triggers are where ML pipelines diverge most visibly from software CI/CD. Software has one trigger (the git push); ML pipelines accept code-driven, schedule-driven, event-driven, drift-driven, performance-driven, and manual triggers. In a Level 2 system the orchestrator records why each run was triggered — invaluable forensic information when a model misbehaves later.

Figure 1.3: DAG orchestration with parallel branches and conditional gating

flowchart TD Ingest((Ingest)) --> Validate[Validate Data] Validate --> Features[Prepare Features] Features --> Train[Train Model] Features --> Baseline[Train Baseline] Train --> Eval[Evaluate Candidate] Baseline --> Eval Eval --> Gate{Metrics > baseline?} Gate -->|Yes| Deploy[Deploy] Gate -.->|No| Stop((Stop))

Stage-by-Stage Summary

StageOutputKey Risk if Skipped
IngestVersioned data referenceNon-reproducible training data
Validate dataPass/fail + statistics reportSilently poisoned models
Prepare featuresTraining dataset + transform graphTrain-serve skew
TrainModel artifact + metricsWasted compute, unreproducible models
EvaluateSlice metrics, plotsAggregate-good but subgroup-bad models
Validate (gate)Approve/reject decisionRegressions reach production
DeployRunning endpointSlow rollback, downtime
MonitorAlerts, drift signalsSilent decay over weeks
Post-Quiz — Section 3: Anatomy of an End-to-End Pipeline

1. Why must an ML pipeline be a DAG — specifically acyclic — rather than a graph with cycles such as train → evaluate → train?

2. The model-validation gate stage exists between evaluate and deploy. What would happen if a team omitted it from their pipeline?

3. Which trigger is unique to ML pipelines and does not exist for traditional software CI/CD?

Pre-Quiz — Section 4: ML Pipelines vs Software CI/CD

1. What is the single load-bearing insight that drives almost every difference between ML pipelines and software CI/CD?

2. Two intertwined loops characterize MLOps: software CI/CD and Continuous Training (CT). Which best describes the difference?

3. A team versions code in git but not data. Why is "I rolled back to last week's git SHA" insufficient to reproduce last week's model?

4. ML Pipelines vs Software CI/CD

Key Points

In traditional CI/CD, the build is a pure function of code: same source plus same toolchain yields the same binary, bit-for-bit. In ML, the function is impure: same training code plus same hyperparameters but different data yields a different model. This forces data into the artifact universe. In software engineering, code is the source and the binary is the build output; in ML, both code and data are the source and the model is the build output. If you do not version both inputs, you cannot reproduce the output.

Two intertwined loops therefore characterize MLOps:

Figure 1.4: Two intertwined loops — CI/CD plus Continuous Training

flowchart TD subgraph CICD["Software CI/CD Loop"] Commit[Git Commit] --> Build[Build & Test] Build --> DeployCode[Deploy Pipeline Code] end subgraph CT["Continuous Training Loop"] Signal["Data Signal:<br/>new data / drift / decay"] --> Retrain[Retrain on Fresh Data] Retrain --> EvalCT[Evaluate Model] EvalCT --> DeployModel[Deploy Model Artifact] end DeployCode -.->|Updates pipeline used by| Retrain DeployModel -.->|Production metrics feed| Signal
Animation: Continuous Training loop — drift triggers retraining
Monitor production PSI / accuracy / drift Retrain on fresh data trigger: drift threshold Evaluate candidate gate vs baseline Shadow + canary deploy 5% → 25% → 100% drift detected new model passes gate prod metrics
The orange pulse on Monitor represents a drift threshold crossing; the loop highlights each stage in turn as the new model flows from retraining through gate to canary deploy and back into monitoring.

Hidden Technical Debt Categories

Debt CategoryWhat it Looks LikeWhy It Hurts
Data dependencyUpstream column semantics shift silentlyNo compiler catches a semantic shift
ConfigurationHyperparameters sprawl across experiments"Which settings produced our best model?" becomes archaeology
Glue-codeAd hoc scripts stitching things togetherBrittle; every change requires careful manual reasoning
Feedback loopModel outputs shape its own training dataSelf-reinforcing biases
ReproducibilityNo data versioning, no env pinningCannot recreate past results
MonitoringNo drift or performance dashboardsSilent decay over weeks
Post-Quiz — Section 4: ML Pipelines vs Software CI/CD

1. What is the single load-bearing insight that drives almost every difference between ML pipelines and software CI/CD?

2. Two intertwined loops characterize MLOps: software CI/CD and Continuous Training (CT). Which best describes the difference?

3. A team versions code in git but not data. Why is "I rolled back to last week's git SHA" insufficient to reproduce last week's model?

Your Progress

Answer Explanations