Chapter 6: Pipeline Orchestration Frameworks

Learning Objectives

Section 1: Orchestration Concepts

Pre-Quiz — Section 1

1. In every orchestrator, the component that decides what runs next by scanning DAG definitions and checking dependencies is the:

Executor Scheduler Operator Metadata DB

2. The deeper paradigm split between Airflow and Dagster is best characterized as:

Python vs YAML authoring Single-machine vs distributed execution Task-centric (“did it run?”) vs asset-centric (“is it fresh?”) Open source vs commercial

3. A team needs to fire a pipeline whenever a file lands in S3, without burning a worker slot continuously polling. The best modern primitive is:

A traditional sensor in a loop A cron schedule running every minute A deferrable sensor or event-driven trigger A manual run triggered by the on-call engineer

An orchestrator is to ML pipelines what air traffic control is to aircraft: it does not fly the plane, but it sequences takeoffs, prevents collisions, and decides what is allowed to land. In production, something has to wake up at 02:00 UTC, run yesterday's feature build, fan out training across three customer segments, conditionally deploy only winning models, and retry tomorrow if the warehouse was flaky.

1.1 DAGs, Tasks, Operators, and Executors

Every modern orchestrator decomposes into four primitives:

The DAG is the plan; the executor is the labor. A “scaled” orchestrator usually means a scaled executor — but the scheduler is almost always the bottleneck first, especially with thousands of short tasks.

Animation A1 — DAG Execution Order

ingest_raw validate + features train_retail train_smb train_enterprise evaluate + register

Tasks light up in topological order. The three training tasks activate simultaneously (parallel branches); evaluate_all only fires once all three upstream branches complete.

1.2 Imperative vs Declarative

Imperative workflows (Prefect, Metaflow) let you write Python that does things; decorators turn function calls into a runtime graph. Declarative workflows (Argo YAML) ask you to describe the desired graph and let the system schedule it. Airflow sits in between — Python files that read declaratively but execute imperatively at parse time. The trade-off: imperative is easy to author but hard to inspect statically; declarative is easier for platform teams to validate but feels verbose.

1.3 Materialization vs Orchestration

Orchestration-first systems (Airflow, Prefect, Argo) think in tasks: did this job run? Materialization-first systems (Dagster, Flyte, KFP, ZenML to varying degrees) think in assets: is this dataset, feature table, or model up-to-date relative to its inputs? For ML the canonical question is rarely “did training run?” — it is “is this model version fresh given the latest features and the latest code?”

1.4 Triggers — Cron, Sensors, Events

Key Points — Section 1

Section 2: General-Purpose Orchestrators

Pre-Quiz — Section 2

1. Apache Airflow's classical weakness for ML workloads is:

Inability to run Python code Scheduler strain on DAGs with many short, fine-grained tasks (e.g., large hyperparameter sweeps) Lack of a web UI No support for Kubernetes

2. The unique selling point of Dagster versus Airflow and Prefect is:

It is the only orchestrator with a UI Asset-based orchestration with first-class partitions, freshness, and backfills It does not require Python knowledge It is the cheapest commercial offering

3. Argo Workflows is most accurately described as:

A Python ML framework competing directly with Metaflow A Kubernetes-native declarative workflow engine that higher-level tools (like KFP) compile into A managed cloud service from AWS A replacement for the Kubernetes scheduler itself

General-purpose orchestrators were designed for ETL and arbitrary batch jobs, not ML. That history is both their strength (mature, battle-tested, huge ecosystem) and weakness (they treat models and datasets as opaque side effects).

2.1 Apache Airflow Architecture

Airflow is the de facto standard for data engineering. Its four moving parts are:

Animation A2 — Airflow Architecture: Task Flow

DAG Files (Python) Scheduler Task Queue (Celery/K8s) Worker Metadata DB (Postgres) Scheduler → Queue → Worker → Metadata DB (state writes flow back continuously)

The scheduler scans DAG files and pushes a ready task into the queue; a worker picks it up, executes the operator, and writes the result back to the metadata DB. Each box highlights as the work passes through.

Airflow 2.x added the TaskFlow API, datasets as first-class citizens, a stable scheduler, a REST API, and DAG versioning. Its ML story is mostly through operators: KubernetesPodOperator, DatabricksRunNowOperator, SageMakerTrainingOperator. The orchestration is Airflow's; the work itself is delegated.

2.2 Prefect 2.x and Dagster

Prefect 2 (“Orion”) is the most Pythonic of the major orchestrators. A @flow is a normal Python function; @task decorates the steps. Workers pull from work pools and execute on whatever infrastructure you point them at.

Dagster takes the radical position that the unit of orchestration should be the asset. An @asset declaration says “this dataset/model exists, and depends on these other assets.” The runtime materializes assets in the right order, exposes partitions and freshness in the UI, and runs backfills by selecting partition ranges. For teams already on dbt, Dagster brings the same lineage philosophy to features and models.

2.3 Argo Workflows on Kubernetes

Argo Workflows is the Kubernetes-native workflow engine: workflows are CRDs, every step is a pod, and the controller reconciles state. It is declarative YAML — rarely used directly by data scientists, but it is the substrate KFP and other tools compile into.

2.4 Strengths and Weaknesses for ML

flowchart TD subgraph Authors[DAG Authors] DAGs[DAG Files
Python] end subgraph Control[Control Plane] Scheduler[Scheduler] Webserver[Webserver / UI] end subgraph State[State] MetaDB[(Metadata DB
Postgres / MySQL)] end subgraph Compute[Compute Plane] Executor[Executor
Celery / K8s / Local] W1[Worker 1] W2[Worker 2] W3[Worker N] end DAGs --> Scheduler Scheduler <--> MetaDB Webserver <--> MetaDB Scheduler --> Executor Executor --> W1 Executor --> W2 Executor --> W3 W1 --> MetaDB W2 --> MetaDB W3 --> MetaDB
OrchestratorML StrengthsML Weaknesses
AirflowHuge operator ecosystem, mature, ubiquitous in data engineeringNo native model/dataset lineage, scheduler strains on many short tasks
PrefectPythonic, fast onboarding, easy to wrap ML scriptsNo first-class asset model, smaller ecosystem
DagsterAsset-based lineage, partitions, native backfills, freshnessConcept overhead (ops/jobs/assets/repos), not an experiment tracker
ArgoK8s-native, infinitely scalable, declarativeToo low-level for direct ML use; YAML-only

Key Points — Section 2

Section 3: ML-Native Orchestrators

Pre-Quiz — Section 3

1. In Kubeflow Pipelines v2, artifact lineage (“this model came from these features, which came from this dataset”) is recorded in:

ML Metadata (MLMD) The Argo workflow controller A standalone PostgreSQL outside the cluster Git commits on the pipeline repository

2. Metaflow's signature feature for data-scientist productivity is:

Mandatory Kubernetes deployment Local-to-cloud transparency: same code runs locally and at scale on Batch/K8s by adding decorators A SQL-only DSL Built-in feature store and model registry replacing all external tools

3. ZenML is best described as:

A direct competitor to Kubernetes A meta-orchestrator that compiles ML-centric definitions to your existing backend (Airflow, KFP, Step Functions, Vertex) A managed SaaS owned by AWS A YAML-only declarative engine like Argo

ML-native orchestrators start from a different premise: pipelines are sequences of typed ML steps that produce versioned artifacts (datasets, models, metrics), and the platform should track those artifacts as first-class entities.

3.1 Kubeflow Pipelines

KFP v2 is the canonical open-source K8s-native ML orchestrator. Pipelines are written in a Python DSL that compiles to a Kubernetes workflow (Argo or Tekton). Every step is a containerized component with typed inputs and outputs; lineage is tracked in ML Metadata (MLMD). The cost is operational: running Kubeflow means running K8s, the KFP control plane, MLMD, MinIO/object store, and ideally Istio. Vertex AI Pipelines is Google's managed offering that speaks the KFP SDK.

flowchart TD SDK[KFP Python SDK] -->|compile| Spec[Pipeline Spec
YAML / IR] Spec --> API[KFP API Server] API --> Argo[Argo / Tekton
Workflow Controller] API <--> MLMD[(ML Metadata
MLMD)] Argo --> P1[Step Pod 1] Argo --> P2[Step Pod 2] Argo --> P3[Step Pod N] P1 --> Store[(Artifact Store
MinIO / GCS / S3)] P2 --> Store P3 --> Store P1 --> MLMD P2 --> MLMD P3 --> MLMD UI[KFP UI] <--> API UI <--> MLMD

3.2 Metaflow

Open-sourced by Netflix, Metaflow is unapologetically optimized for data scientists. A flow is a Python class with @step methods; you run python flow.py run locally and the same code scales out by adding @batch or @kubernetes to a step. Anything assigned to self becomes a versioned artifact persisted to S3 and queryable by run ID. The signature feature is local-to-cloud transparency: a 10-row sample locally and a 10-million-row partition on AWS Batch — same code.

3.3 ZenML and Flyte

Flyte originated at Lyft for ML at scale and is the strongest “K8s-native, typed, reproducible” option. Tasks are Python functions with type hints; Flyte uses those hints to serialize artifacts, content-address cache outputs, and validate the graph at compile time. Resource specs (GPU, memory, accelerator) are decorator arguments.

ZenML is a meta-orchestrator: rather than executing pipelines itself, it compiles ML-centric definitions into the backend you already have (Airflow, KFP, K8s, Step Functions, Vertex). On top of that it provides typed Artifact classes and stack abstractions for swapping artifact stores or trackers without rewriting code.

3.4 Vertex AI and SageMaker Pipelines

Vertex AI Pipelines (GCP) runs KFP-compatible pipelines without the K8s ops burden, integrates with Vertex Metadata, bills per pipeline-second. SageMaker Pipelines (AWS) provides a CI/CD-flavored DSL tightly integrated with SageMaker Training, Processing, and Model Registry. Both trade flexibility for managed convenience.

3.5 The Big Comparison

ToolParadigmK8sArtifact TrackingCachingBest Fit
Airflow 2.xTask DAGPodOperatorExternalNone nativeCoarse-grained ML on existing data infra
Prefect 2.xPythonic flow/taskK8s workersGeneric result cacheGenericPython ML teams, limited platform ops
Dagster 1.xAsset / op / jobFirst-classMaterializationsMemoizationModern data + ML platforms
Argo WorkflowsDeclarative YAMLNativeVolume artifactsManualSubstrate for higher-level tools
Kubeflow v2Component DAGNative onlyMLMD (typed)Content-addressableK8s-first ML, GCP/Vertex
Flyte 1.xTyped Python tasksNative onlyPlatform-level lineageContent-addressableLarge-scale K8s ML platforms
Metaflow 2.xPythonic @stepOptionalself.x versionedResume from past runsML developer productivity, AWS-leaning
ZenMLMeta-orchestratorDelegatedTyped metadata storeInherits + ownAvoiding lock-in to one backend

Key Points — Section 3

Section 4: Operationalizing Pipelines

Pre-Quiz — Section 4

1. The recommended default retry policy for ML pipeline tasks is:

Retry indefinitely with no delay 3-5 retries with exponential backoff (capped at 30-60 min) plus jitter to avoid thundering herd A single retry after exactly one hour No retries — page on-call on the first failure

2. Why is idempotency a prerequisite for safe retries?

It makes tasks run faster Without it, retrying a task that partially completed can duplicate or corrupt data (e.g., insert rows three times) It is required by Kubernetes It reduces cloud costs by 50%

3. When triggering a backfill of a year of daily partitions, the two operational rules from the chapter are:

Run all 365 partitions in parallel and never tag the runs Limit concurrency to avoid overloading downstream systems, and tag backfill runs distinctly from scheduled runs Pause the scheduler entirely until backfill finishes Use no caching so every partition recomputes from scratch

Choosing an orchestrator is the easy half. Operating one well — so it survives bad data, flaky infrastructure, half-written deployments, and frantic backfills — requires a small set of disciplines that are essentially the same across every tool.

4.1 A Worked Example: Daily Training DAG

The canonical Prefect example builds features for a date partition, validates them, conditionally trains one model per customer segment, evaluates each, and registers winners. Four things to notice:

  1. Retries with jitter are declared on the decorator, not embedded in business logic.
  2. The feature build is cached by input hash — rerunning with the same date skips recomputation.
  3. Conditional execution is a plain Python if; Prefect treats it as a visible branch.
  4. Training fans out dynamically via .map() — equivalents are Airflow .expand(), Dagster DynamicOutput, Flyte @dynamic, KFP dsl.ParallelFor.

4.2 Retries, Timeouts, and Idempotency

Default playbook: 3-5 retries with exponential backoff capped at 30-60 minutes, plus jitter to avoid thundering-herd. Without idempotency, retries are dangerous. The canonical patterns:

Treat the orchestrator's state as ephemeral and the data system's state as the source of truth. Timeouts are the safety net — every long-running task should have a timeout shorter than the next scheduled run.

Animation A3 — Retry with Exponential Backoff

time → attempt 1 FAIL wait 1s attempt 2 FAIL wait 2s attempt 3 FAIL wait 4s attempt 4 SUCCESS delay = base * 2^attempt + jitter red = failure, blue bar = backoff wait, green = success

Three failures each double the wait (1s → 2s → 4s) before the fourth attempt succeeds. Jitter (random noise added to the delay) prevents many failed tasks from retrying in lockstep.

4.3 Backfills and Catch-Up

A backfill runs a pipeline for a historical range, usually because you fixed a bug, ingested missing data, or rebuilt features under a new schema. Mechanisms differ widely:

Two operational rules: (1) limit concurrency — a year of daily features in parallel will overload the warehouse and cost a fortune. (2) Tag backfill runs distinctly so monitoring and cost dashboards can distinguish them from scheduled runs.

4.4 Parameterization

Parameterize everything that changes between runs: date, env, model_name, segment, hyperparameters. Good practice: type-hint every parameter, validate at flow start, and tag runs with parameter values so you can filter “all daily-training runs with model_type=xgboost in env=prod” later.

4.5 Resource Management and Queueing

stateDiagram-v2 [*] --> Pending Pending --> Running: scheduler dispatches Running --> Success: exit 0 Running --> Failed: exit != 0 / timeout Failed --> Backoff: attempts < max Backoff --> Running: wait = base * 2^n + jitter Failed --> DeadLetter: attempts >= max Success --> [*] DeadLetter --> [*]: alert on-call

Key Points — Section 4

Post-Quiz — Reinforcement

Post-Quiz — Section 1: Orchestration Concepts

1. In every orchestrator, the component that decides what runs next by scanning DAG definitions and checking dependencies is the:

Executor Scheduler Operator Metadata DB

2. The deeper paradigm split between Airflow and Dagster is best characterized as:

Python vs YAML authoring Single-machine vs distributed execution Task-centric (“did it run?”) vs asset-centric (“is it fresh?”) Open source vs commercial

3. A team needs to fire a pipeline whenever a file lands in S3, without burning a worker slot continuously polling. The best modern primitive is:

A traditional sensor in a loop A cron schedule running every minute A deferrable sensor or event-driven trigger A manual run triggered by the on-call engineer
Post-Quiz — Section 2: General-Purpose Orchestrators

1. Apache Airflow's classical weakness for ML workloads is:

Inability to run Python code Scheduler strain on DAGs with many short, fine-grained tasks (e.g., large hyperparameter sweeps) Lack of a web UI No support for Kubernetes

2. The unique selling point of Dagster versus Airflow and Prefect is:

It is the only orchestrator with a UI Asset-based orchestration with first-class partitions, freshness, and backfills It does not require Python knowledge It is the cheapest commercial offering

3. Argo Workflows is most accurately described as:

A Python ML framework competing directly with Metaflow A Kubernetes-native declarative workflow engine that higher-level tools (like KFP) compile into A managed cloud service from AWS A replacement for the Kubernetes scheduler itself
Post-Quiz — Section 3: ML-Native Orchestrators

1. In Kubeflow Pipelines v2, artifact lineage (“this model came from these features, which came from this dataset”) is recorded in:

ML Metadata (MLMD) The Argo workflow controller A standalone PostgreSQL outside the cluster Git commits on the pipeline repository

2. Metaflow's signature feature for data-scientist productivity is:

Mandatory Kubernetes deployment Local-to-cloud transparency: same code runs locally and at scale on Batch/K8s by adding decorators A SQL-only DSL Built-in feature store and model registry replacing all external tools

3. ZenML is best described as:

A direct competitor to Kubernetes A meta-orchestrator that compiles ML-centric definitions to your existing backend (Airflow, KFP, Step Functions, Vertex) A managed SaaS owned by AWS A YAML-only declarative engine like Argo
Post-Quiz — Section 4: Operationalizing Pipelines

1. The recommended default retry policy for ML pipeline tasks is:

Retry indefinitely with no delay 3-5 retries with exponential backoff (capped at 30-60 min) plus jitter to avoid thundering herd A single retry after exactly one hour No retries — page on-call on the first failure

2. Why is idempotency a prerequisite for safe retries?

It makes tasks run faster Without it, retrying a task that partially completed can duplicate or corrupt data (e.g., insert rows three times) It is required by Kubernetes It reduces cloud costs by 50%

3. When triggering a backfill of a year of daily partitions, the two operational rules from the chapter are:

Run all 365 partitions in parallel and never tag the runs Limit concurrency to avoid overloading downstream systems, and tag backfill runs distinctly from scheduled runs Pause the scheduler entirely until backfill finishes Use no caching so every partition recomputes from scratch

Your Progress

Answer Explanations