Chapter 6: Pipeline Orchestration Frameworks

Learning Objectives

Explain the four core orchestration primitives — DAG, task, operator, executor — and distinguish scheduler from executor scaling.
Compare task-centric vs asset/materialization-centric orchestration paradigms and their implications for ML lineage.
Evaluate general-purpose orchestrators (Airflow, Prefect, Dagster, Argo) against ML-native ones (KFP, Flyte, Metaflow, ZenML, Vertex, SageMaker).
Apply operational disciplines: idempotent retries with jittered exponential backoff, timeouts, backfills, parameterization, and resource pooling.
Read a Prefect DAG and identify caching, conditional execution, and dynamic fan-out patterns that generalize across frameworks.

Section 1: Orchestration Concepts

Pre-Quiz — Section 1

1. In every orchestrator, the component that decides what runs next by scanning DAG definitions and checking dependencies is the:

Executor Scheduler Operator Metadata DB

2. The deeper paradigm split between Airflow and Dagster is best characterized as:

Python vs YAML authoring Single-machine vs distributed execution Task-centric (“did it run?”) vs asset-centric (“is it fresh?”) Open source vs commercial

3. A team needs to fire a pipeline whenever a file lands in S3, without burning a worker slot continuously polling. The best modern primitive is:

A traditional sensor in a loop A cron schedule running every minute A deferrable sensor or event-driven trigger A manual run triggered by the on-call engineer

An orchestrator is to ML pipelines what air traffic control is to aircraft: it does not fly the plane, but it sequences takeoffs, prevents collisions, and decides what is allowed to land. In production, something has to wake up at 02:00 UTC, run yesterday's feature build, fan out training across three customer segments, conditionally deploy only winning models, and retry tomorrow if the warehouse was flaky.

1.1 DAGs, Tasks, Operators, and Executors

Every modern orchestrator decomposes into four primitives:

DAG (Directed Acyclic Graph): the dependency structure. Nodes are work units; edges are “must run before” relationships; no cycles guarantees termination.
Task (also: step, op, component): a single unit of work — a Python function, SQL query, or container.
Operator: a reusable template for a class of tasks. Airflow's BashOperator, PythonOperator, and KubernetesPodOperator are canonical.
Executor: the engine that actually runs tasks — Local, Celery, Kubernetes, Dask.

The DAG is the plan; the executor is the labor. A “scaled” orchestrator usually means a scaled executor — but the scheduler is almost always the bottleneck first, especially with thousands of short tasks.

Animation A1 — DAG Execution Order

Tasks light up in topological order. The three training tasks activate simultaneously (parallel branches); evaluate_all only fires once all three upstream branches complete.

1.2 Imperative vs Declarative

Imperative workflows (Prefect, Metaflow) let you write Python that does things; decorators turn function calls into a runtime graph. Declarative workflows (Argo YAML) ask you to describe the desired graph and let the system schedule it. Airflow sits in between — Python files that read declaratively but execute imperatively at parse time. The trade-off: imperative is easy to author but hard to inspect statically; declarative is easier for platform teams to validate but feels verbose.

1.3 Materialization vs Orchestration

Orchestration-first systems (Airflow, Prefect, Argo) think in tasks: did this job run? Materialization-first systems (Dagster, Flyte, KFP, ZenML to varying degrees) think in assets: is this dataset, feature table, or model up-to-date relative to its inputs? For ML the canonical question is rarely “did training run?” — it is “is this model version fresh given the latest features and the latest code?”

1.4 Triggers — Cron, Sensors, Events

Cron schedules fire on calendar time. Simple and universal.
Sensors poll until an external condition is true. Convenient but can monopolize workers. Modern Airflow offers deferrable sensors that release the slot.
Events push a trigger from outside (Pub/Sub, S3 notification, webhook). Preferred over polling at scale.

Key Points — Section 1

The four orchestration primitives are DAG, task, operator, and executor; the scheduler is the often-overlooked fifth piece and is usually the first bottleneck.
Imperative (Python) authoring is comfortable for data scientists; declarative (YAML) is easier for platform teams to validate.
Task-centric orchestration answers “did it run?”; asset/materialization-centric orchestration answers “is it fresh?” — the latter is closer to what ML actually needs.
Triggers come in three flavors: cron, sensors, and events. Prefer events (or deferrable sensors) at scale to avoid wasting worker slots polling.
Lineage and incremental recomputation strategy follow directly from the task-vs-asset paradigm choice.

Section 2: General-Purpose Orchestrators

Pre-Quiz — Section 2

1. Apache Airflow's classical weakness for ML workloads is:

Inability to run Python code Scheduler strain on DAGs with many short, fine-grained tasks (e.g., large hyperparameter sweeps) Lack of a web UI No support for Kubernetes

2. The unique selling point of Dagster versus Airflow and Prefect is:

It is the only orchestrator with a UI Asset-based orchestration with first-class partitions, freshness, and backfills It does not require Python knowledge It is the cheapest commercial offering

3. Argo Workflows is most accurately described as:

A Python ML framework competing directly with Metaflow A Kubernetes-native declarative workflow engine that higher-level tools (like KFP) compile into A managed cloud service from AWS A replacement for the Kubernetes scheduler itself

General-purpose orchestrators were designed for ETL and arbitrary batch jobs, not ML. That history is both their strength (mature, battle-tested, huge ecosystem) and weakness (they treat models and datasets as opaque side effects).

2.1 Apache Airflow Architecture

Airflow is the de facto standard for data engineering. Its four moving parts are:

Metadata database (Postgres/MySQL): stores DAG definitions, run history, task state.
Scheduler: scans DAG files, decides what is ready, dispatches it.
Webserver: renders the UI.
Executor(s): Local, Celery, Kubernetes, or LocalKubernetes — runs the tasks.

Animation A2 — Airflow Architecture: Task Flow

The scheduler scans DAG files and pushes a ready task into the queue; a worker picks it up, executes the operator, and writes the result back to the metadata DB. Each box highlights as the work passes through.

Airflow 2.x added the TaskFlow API, datasets as first-class citizens, a stable scheduler, a REST API, and DAG versioning. Its ML story is mostly through operators: KubernetesPodOperator, DatabricksRunNowOperator, SageMakerTrainingOperator. The orchestration is Airflow's; the work itself is delegated.

2.2 Prefect 2.x and Dagster

Prefect 2 (“Orion”) is the most Pythonic of the major orchestrators. A @flow is a normal Python function; @task decorates the steps. Workers pull from work pools and execute on whatever infrastructure you point them at.

Dagster takes the radical position that the unit of orchestration should be the asset. An @asset declaration says “this dataset/model exists, and depends on these other assets.” The runtime materializes assets in the right order, exposes partitions and freshness in the UI, and runs backfills by selecting partition ranges. For teams already on dbt, Dagster brings the same lineage philosophy to features and models.

2.3 Argo Workflows on Kubernetes

Argo Workflows is the Kubernetes-native workflow engine: workflows are CRDs, every step is a pod, and the controller reconciles state. It is declarative YAML — rarely used directly by data scientists, but it is the substrate KFP and other tools compile into.

2.4 Strengths and Weaknesses for ML

flowchart TD subgraph Authors[DAG Authors] DAGs[DAG Files
Python] end subgraph Control[Control Plane] Scheduler[Scheduler] Webserver[Webserver / UI] end subgraph State[State] MetaDB[(Metadata DB
Postgres / MySQL)] end subgraph Compute[Compute Plane] Executor[Executor
Celery / K8s / Local] W1[Worker 1] W2[Worker 2] W3[Worker N] end DAGs --> Scheduler Scheduler <--> MetaDB Webserver <--> MetaDB Scheduler --> Executor Executor --> W1 Executor --> W2 Executor --> W3 W1 --> MetaDB W2 --> MetaDB W3 --> MetaDB

Orchestrator	ML Strengths	ML Weaknesses
Airflow	Huge operator ecosystem, mature, ubiquitous in data engineering	No native model/dataset lineage, scheduler strains on many short tasks
Prefect	Pythonic, fast onboarding, easy to wrap ML scripts	No first-class asset model, smaller ecosystem
Dagster	Asset-based lineage, partitions, native backfills, freshness	Concept overhead (ops/jobs/assets/repos), not an experiment tracker
Argo	K8s-native, infinitely scalable, declarative	Too low-level for direct ML use; YAML-only

Key Points — Section 2

Airflow's four moving parts — metadata DB, scheduler, webserver, executor — are decoupled; ML support is via operators delegating to external systems.
Prefect wraps existing Python ML code with decorators and is the lightest-weight choice for a small team.
Dagster is uniquely strong for data-aware lineage and partition-driven backfills.
Argo is the K8s-native substrate other tools (notably KFP) compile to; rarely written by data scientists directly.
With the exception of Dagster, general-purpose tools treat ML artifacts as opaque side effects — plan to pair them with a model registry and experiment tracker.

Section 3: ML-Native Orchestrators

Pre-Quiz — Section 3

1. In Kubeflow Pipelines v2, artifact lineage (“this model came from these features, which came from this dataset”) is recorded in:

ML Metadata (MLMD) The Argo workflow controller A standalone PostgreSQL outside the cluster Git commits on the pipeline repository

2. Metaflow's signature feature for data-scientist productivity is:

Mandatory Kubernetes deployment Local-to-cloud transparency: same code runs locally and at scale on Batch/K8s by adding decorators A SQL-only DSL Built-in feature store and model registry replacing all external tools

3. ZenML is best described as:

A direct competitor to Kubernetes A meta-orchestrator that compiles ML-centric definitions to your existing backend (Airflow, KFP, Step Functions, Vertex) A managed SaaS owned by AWS A YAML-only declarative engine like Argo

ML-native orchestrators start from a different premise: pipelines are sequences of typed ML steps that produce versioned artifacts (datasets, models, metrics), and the platform should track those artifacts as first-class entities.

3.1 Kubeflow Pipelines

KFP v2 is the canonical open-source K8s-native ML orchestrator. Pipelines are written in a Python DSL that compiles to a Kubernetes workflow (Argo or Tekton). Every step is a containerized component with typed inputs and outputs; lineage is tracked in ML Metadata (MLMD). The cost is operational: running Kubeflow means running K8s, the KFP control plane, MLMD, MinIO/object store, and ideally Istio. Vertex AI Pipelines is Google's managed offering that speaks the KFP SDK.

flowchart TD SDK[KFP Python SDK] -->|compile| Spec[Pipeline Spec
YAML / IR] Spec --> API[KFP API Server] API --> Argo[Argo / Tekton
Workflow Controller] API <--> MLMD[(ML Metadata
MLMD)] Argo --> P1[Step Pod 1] Argo --> P2[Step Pod 2] Argo --> P3[Step Pod N] P1 --> Store[(Artifact Store
MinIO / GCS / S3)] P2 --> Store P3 --> Store P1 --> MLMD P2 --> MLMD P3 --> MLMD UI[KFP UI] <--> API UI <--> MLMD

3.2 Metaflow

Open-sourced by Netflix, Metaflow is unapologetically optimized for data scientists. A flow is a Python class with @step methods; you run python flow.py run locally and the same code scales out by adding @batch or @kubernetes to a step. Anything assigned to self becomes a versioned artifact persisted to S3 and queryable by run ID. The signature feature is local-to-cloud transparency: a 10-row sample locally and a 10-million-row partition on AWS Batch — same code.

3.3 ZenML and Flyte

Flyte originated at Lyft for ML at scale and is the strongest “K8s-native, typed, reproducible” option. Tasks are Python functions with type hints; Flyte uses those hints to serialize artifacts, content-address cache outputs, and validate the graph at compile time. Resource specs (GPU, memory, accelerator) are decorator arguments.

ZenML is a meta-orchestrator: rather than executing pipelines itself, it compiles ML-centric definitions into the backend you already have (Airflow, KFP, K8s, Step Functions, Vertex). On top of that it provides typed Artifact classes and stack abstractions for swapping artifact stores or trackers without rewriting code.

3.4 Vertex AI and SageMaker Pipelines

Vertex AI Pipelines (GCP) runs KFP-compatible pipelines without the K8s ops burden, integrates with Vertex Metadata, bills per pipeline-second. SageMaker Pipelines (AWS) provides a CI/CD-flavored DSL tightly integrated with SageMaker Training, Processing, and Model Registry. Both trade flexibility for managed convenience.

3.5 The Big Comparison

Tool	Paradigm	K8s	Artifact Tracking	Caching	Best Fit
Airflow 2.x	Task DAG	PodOperator	External	None native	Coarse-grained ML on existing data infra
Prefect 2.x	Pythonic flow/task	K8s workers	Generic result cache	Generic	Python ML teams, limited platform ops
Dagster 1.x	Asset / op / job	First-class	Materializations	Memoization	Modern data + ML platforms
Argo Workflows	Declarative YAML	Native	Volume artifacts	Manual	Substrate for higher-level tools
Kubeflow v2	Component DAG	Native only	MLMD (typed)	Content-addressable	K8s-first ML, GCP/Vertex
Flyte 1.x	Typed Python tasks	Native only	Platform-level lineage	Content-addressable	Large-scale K8s ML platforms
Metaflow 2.x	Pythonic `@step`	Optional	`self.x` versioned	Resume from past runs	ML developer productivity, AWS-leaning
ZenML	Meta-orchestrator	Delegated	Typed metadata store	Inherits + own	Avoiding lock-in to one backend

Key Points — Section 3

KFP v2 compiles a typed Python DSL to Argo/Tekton workflows on K8s and tracks artifact lineage in MLMD.
Metaflow wins on developer ergonomics; assignment to self automatically versions artifacts, and the same code runs locally or in the cloud.
Flyte uses Python type hints for serialization, caching, and graph validation, with resource specs as decorator arguments.
ZenML is a meta-orchestrator: it compiles ML pipelines onto your existing engine (Airflow, KFP, Step Functions, Vertex) so you can swap backends without rewriting code.
Vertex AI Pipelines and SageMaker Pipelines are the cloud-managed escape hatches that trade flexibility for zero K8s ops.

Section 4: Operationalizing Pipelines

Pre-Quiz — Section 4

1. The recommended default retry policy for ML pipeline tasks is:

Retry indefinitely with no delay 3-5 retries with exponential backoff (capped at 30-60 min) plus jitter to avoid thundering herd A single retry after exactly one hour No retries — page on-call on the first failure

2. Why is idempotency a prerequisite for safe retries?

It makes tasks run faster Without it, retrying a task that partially completed can duplicate or corrupt data (e.g., insert rows three times) It is required by Kubernetes It reduces cloud costs by 50%

3. When triggering a backfill of a year of daily partitions, the two operational rules from the chapter are:

Run all 365 partitions in parallel and never tag the runs Limit concurrency to avoid overloading downstream systems, and tag backfill runs distinctly from scheduled runs Pause the scheduler entirely until backfill finishes Use no caching so every partition recomputes from scratch

Choosing an orchestrator is the easy half. Operating one well — so it survives bad data, flaky infrastructure, half-written deployments, and frantic backfills — requires a small set of disciplines that are essentially the same across every tool.

4.1 A Worked Example: Daily Training DAG

The canonical Prefect example builds features for a date partition, validates them, conditionally trains one model per customer segment, evaluates each, and registers winners. Four things to notice:

Retries with jitter are declared on the decorator, not embedded in business logic.
The feature build is cached by input hash — rerunning with the same date skips recomputation.
Conditional execution is a plain Python if; Prefect treats it as a visible branch.
Training fans out dynamically via .map() — equivalents are Airflow .expand(), Dagster DynamicOutput, Flyte @dynamic, KFP dsl.ParallelFor.

4.2 Retries, Timeouts, and Idempotency

Default playbook: 3-5 retries with exponential backoff capped at 30-60 minutes, plus jitter to avoid thundering-herd. Without idempotency, retries are dangerous. The canonical patterns:

Partitioned writes keyed by date or dataset_id (INSERT OVERWRITE PARTITION, or overwrite the S3 prefix).
Versioned artifact paths for models (s3://models/.../{run_id}/model.pkl); consumers load by run_id, not “latest”.
Deduplication keys on warehouse writes — business key plus run_id enables clean MERGE/UPSERT replays.

Treat the orchestrator's state as ephemeral and the data system's state as the source of truth. Timeouts are the safety net — every long-running task should have a timeout shorter than the next scheduled run.

Animation A3 — Retry with Exponential Backoff

Three failures each double the wait (1s → 2s → 4s) before the fourth attempt succeeds. Jitter (random noise added to the delay) prevents many failed tasks from retrying in lockstep.

4.3 Backfills and Catch-Up

A backfill runs a pipeline for a historical range, usually because you fixed a bug, ingested missing data, or rebuilt features under a new schema. Mechanisms differ widely:

Airflow: catchup=True for automatic; airflow dags backfill -s START -e END CLI; {{ ds }} templates.
Dagster: first-class partitioned backfill UI/CLI; pick a range, one run per partition.
Prefect: no built-in concept; loop over dates and create parameterized runs.
Flyte: LaunchPlans with parameters; cache hits accelerate unchanged parts.
KFP: no first-class backfill; script that loops and submits runs.

Two operational rules: (1) limit concurrency — a year of daily features in parallel will overload the warehouse and cost a fortune. (2) Tag backfill runs distinctly so monitoring and cost dashboards can distinguish them from scheduled runs.

4.4 Parameterization

Parameterize everything that changes between runs: date, env, model_name, segment, hyperparameters. Good practice: type-hint every parameter, validate at flow start, and tag runs with parameter values so you can filter “all daily-training runs with model_type=xgboost in env=prod” later.

4.5 Resource Management and Queueing

Work pools / queues: partition workers by purpose (“gpu”, “default”, “backfill”).
Resource specs per task: KFP and Flyte let each step declare CPU/GPU/memory; the K8s scheduler bin-packs them.
Concurrency limits: per-DAG, per-task, per-pool — guards against pile-ups.
Priorities: prod runs should outrank ad-hoc experiments.

stateDiagram-v2 [*] --> Pending Pending --> Running: scheduler dispatches Running --> Success: exit 0 Running --> Failed: exit != 0 / timeout Failed --> Backoff: attempts < max Backoff --> Running: wait = base * 2^n + jitter Failed --> DeadLetter: attempts >= max Success --> [*] DeadLetter --> [*]: alert on-call

Key Points — Section 4

Default retry policy: 3-5 attempts, exponential backoff capped at 30-60 min, with jitter — declared on the decorator, not in business logic.
Idempotency is the prerequisite for safe retries. Use partitioned overwrites, versioned artifact paths, and dedup keys.
Timeouts on every long-running task prevent runaway jobs from blocking downstream pipelines.
Backfills must use bounded concurrency and distinct tagging; only Dagster and (partially) Airflow have first-class support.
Resource isolation via work pools, per-task resource specs, and concurrency limits keeps GPU jobs and feature builds from contending for the same slots.

Post-Quiz — Reinforcement

Post-Quiz — Section 1: Orchestration Concepts