Apply operational disciplines: idempotent retries with jittered exponential backoff, timeouts, backfills, parameterization, and resource pooling.
Read a Prefect DAG and identify caching, conditional execution, and dynamic fan-out patterns that generalize across frameworks.
Section 1: Orchestration Concepts
Pre-Quiz — Section 1
1. In every orchestrator, the component that decides what runs next by scanning DAG definitions and checking dependencies is the:
ExecutorSchedulerOperatorMetadata DB
2. The deeper paradigm split between Airflow and Dagster is best characterized as:
Python vs YAML authoringSingle-machine vs distributed executionTask-centric (“did it run?”) vs asset-centric (“is it fresh?”)Open source vs commercial
3. A team needs to fire a pipeline whenever a file lands in S3, without burning a worker slot continuously polling. The best modern primitive is:
A traditional sensor in a loopA cron schedule running every minuteA deferrable sensor or event-driven triggerA manual run triggered by the on-call engineer
An orchestrator is to ML pipelines what air traffic control is to aircraft: it does not fly the plane, but it sequences takeoffs, prevents collisions, and decides what is allowed to land. In production, something has to wake up at 02:00 UTC, run yesterday's feature build, fan out training across three customer segments, conditionally deploy only winning models, and retry tomorrow if the warehouse was flaky.
1.1 DAGs, Tasks, Operators, and Executors
Every modern orchestrator decomposes into four primitives:
DAG (Directed Acyclic Graph): the dependency structure. Nodes are work units; edges are “must run before” relationships; no cycles guarantees termination.
Task (also: step, op, component): a single unit of work — a Python function, SQL query, or container.
Operator: a reusable template for a class of tasks. Airflow's BashOperator, PythonOperator, and KubernetesPodOperator are canonical.
Executor: the engine that actually runs tasks — Local, Celery, Kubernetes, Dask.
The DAG is the plan; the executor is the labor. A “scaled” orchestrator usually means a scaled executor — but the scheduler is almost always the bottleneck first, especially with thousands of short tasks.
Animation A1 — DAG Execution Order
Tasks light up in topological order. The three training tasks activate simultaneously (parallel branches); evaluate_all only fires once all three upstream branches complete.
1.2 Imperative vs Declarative
Imperative workflows (Prefect, Metaflow) let you write Python that does things; decorators turn function calls into a runtime graph. Declarative workflows (Argo YAML) ask you to describe the desired graph and let the system schedule it. Airflow sits in between — Python files that read declaratively but execute imperatively at parse time. The trade-off: imperative is easy to author but hard to inspect statically; declarative is easier for platform teams to validate but feels verbose.
1.3 Materialization vs Orchestration
Orchestration-first systems (Airflow, Prefect, Argo) think in tasks: did this job run?Materialization-first systems (Dagster, Flyte, KFP, ZenML to varying degrees) think in assets: is this dataset, feature table, or model up-to-date relative to its inputs? For ML the canonical question is rarely “did training run?” — it is “is this model version fresh given the latest features and the latest code?”
1.4 Triggers — Cron, Sensors, Events
Cron schedules fire on calendar time. Simple and universal.
Sensors poll until an external condition is true. Convenient but can monopolize workers. Modern Airflow offers deferrable sensors that release the slot.
Events push a trigger from outside (Pub/Sub, S3 notification, webhook). Preferred over polling at scale.
Key Points — Section 1
The four orchestration primitives are DAG, task, operator, and executor; the scheduler is the often-overlooked fifth piece and is usually the first bottleneck.
Imperative (Python) authoring is comfortable for data scientists; declarative (YAML) is easier for platform teams to validate.
Task-centric orchestration answers “did it run?”; asset/materialization-centric orchestration answers “is it fresh?” — the latter is closer to what ML actually needs.
Triggers come in three flavors: cron, sensors, and events. Prefer events (or deferrable sensors) at scale to avoid wasting worker slots polling.
Lineage and incremental recomputation strategy follow directly from the task-vs-asset paradigm choice.
Section 2: General-Purpose Orchestrators
Pre-Quiz — Section 2
1. Apache Airflow's classical weakness for ML workloads is:
Inability to run Python codeScheduler strain on DAGs with many short, fine-grained tasks (e.g., large hyperparameter sweeps)Lack of a web UINo support for Kubernetes
2. The unique selling point of Dagster versus Airflow and Prefect is:
It is the only orchestrator with a UIAsset-based orchestration with first-class partitions, freshness, and backfillsIt does not require Python knowledgeIt is the cheapest commercial offering
3. Argo Workflows is most accurately described as:
A Python ML framework competing directly with MetaflowA Kubernetes-native declarative workflow engine that higher-level tools (like KFP) compile intoA managed cloud service from AWSA replacement for the Kubernetes scheduler itself
General-purpose orchestrators were designed for ETL and arbitrary batch jobs, not ML. That history is both their strength (mature, battle-tested, huge ecosystem) and weakness (they treat models and datasets as opaque side effects).
2.1 Apache Airflow Architecture
Airflow is the de facto standard for data engineering. Its four moving parts are:
Metadata database (Postgres/MySQL): stores DAG definitions, run history, task state.
Scheduler: scans DAG files, decides what is ready, dispatches it.
Webserver: renders the UI.
Executor(s): Local, Celery, Kubernetes, or LocalKubernetes — runs the tasks.
Animation A2 — Airflow Architecture: Task Flow
The scheduler scans DAG files and pushes a ready task into the queue; a worker picks it up, executes the operator, and writes the result back to the metadata DB. Each box highlights as the work passes through.
Airflow 2.x added the TaskFlow API, datasets as first-class citizens, a stable scheduler, a REST API, and DAG versioning. Its ML story is mostly through operators: KubernetesPodOperator, DatabricksRunNowOperator, SageMakerTrainingOperator. The orchestration is Airflow's; the work itself is delegated.
2.2 Prefect 2.x and Dagster
Prefect 2 (“Orion”) is the most Pythonic of the major orchestrators. A @flow is a normal Python function; @task decorates the steps. Workers pull from work pools and execute on whatever infrastructure you point them at.
Dagster takes the radical position that the unit of orchestration should be the asset. An @asset declaration says “this dataset/model exists, and depends on these other assets.” The runtime materializes assets in the right order, exposes partitions and freshness in the UI, and runs backfills by selecting partition ranges. For teams already on dbt, Dagster brings the same lineage philosophy to features and models.
2.3 Argo Workflows on Kubernetes
Argo Workflows is the Kubernetes-native workflow engine: workflows are CRDs, every step is a pod, and the controller reconciles state. It is declarative YAML — rarely used directly by data scientists, but it is the substrate KFP and other tools compile into.
Concept overhead (ops/jobs/assets/repos), not an experiment tracker
Argo
K8s-native, infinitely scalable, declarative
Too low-level for direct ML use; YAML-only
Key Points — Section 2
Airflow's four moving parts — metadata DB, scheduler, webserver, executor — are decoupled; ML support is via operators delegating to external systems.
Prefect wraps existing Python ML code with decorators and is the lightest-weight choice for a small team.
Dagster is uniquely strong for data-aware lineage and partition-driven backfills.
Argo is the K8s-native substrate other tools (notably KFP) compile to; rarely written by data scientists directly.
With the exception of Dagster, general-purpose tools treat ML artifacts as opaque side effects — plan to pair them with a model registry and experiment tracker.
Section 3: ML-Native Orchestrators
Pre-Quiz — Section 3
1. In Kubeflow Pipelines v2, artifact lineage (“this model came from these features, which came from this dataset”) is recorded in:
ML Metadata (MLMD)The Argo workflow controllerA standalone PostgreSQL outside the clusterGit commits on the pipeline repository
2. Metaflow's signature feature for data-scientist productivity is:
Mandatory Kubernetes deploymentLocal-to-cloud transparency: same code runs locally and at scale on Batch/K8s by adding decoratorsA SQL-only DSLBuilt-in feature store and model registry replacing all external tools
3. ZenML is best described as:
A direct competitor to KubernetesA meta-orchestrator that compiles ML-centric definitions to your existing backend (Airflow, KFP, Step Functions, Vertex)A managed SaaS owned by AWSA YAML-only declarative engine like Argo
ML-native orchestrators start from a different premise: pipelines are sequences of typed ML steps that produce versioned artifacts (datasets, models, metrics), and the platform should track those artifacts as first-class entities.
3.1 Kubeflow Pipelines
KFP v2 is the canonical open-source K8s-native ML orchestrator. Pipelines are written in a Python DSL that compiles to a Kubernetes workflow (Argo or Tekton). Every step is a containerized component with typed inputs and outputs; lineage is tracked in ML Metadata (MLMD). The cost is operational: running Kubeflow means running K8s, the KFP control plane, MLMD, MinIO/object store, and ideally Istio. Vertex AI Pipelines is Google's managed offering that speaks the KFP SDK.
flowchart TD
SDK[KFP Python SDK] -->|compile| Spec[Pipeline Spec YAML / IR]
Spec --> API[KFP API Server]
API --> Argo[Argo / Tekton Workflow Controller]
API <--> MLMD[(ML Metadata MLMD)]
Argo --> P1[Step Pod 1]
Argo --> P2[Step Pod 2]
Argo --> P3[Step Pod N]
P1 --> Store[(Artifact Store MinIO / GCS / S3)]
P2 --> Store
P3 --> Store
P1 --> MLMD
P2 --> MLMD
P3 --> MLMD
UI[KFP UI] <--> API
UI <--> MLMD
3.2 Metaflow
Open-sourced by Netflix, Metaflow is unapologetically optimized for data scientists. A flow is a Python class with @step methods; you run python flow.py run locally and the same code scales out by adding @batch or @kubernetes to a step. Anything assigned to self becomes a versioned artifact persisted to S3 and queryable by run ID. The signature feature is local-to-cloud transparency: a 10-row sample locally and a 10-million-row partition on AWS Batch — same code.
3.3 ZenML and Flyte
Flyte originated at Lyft for ML at scale and is the strongest “K8s-native, typed, reproducible” option. Tasks are Python functions with type hints; Flyte uses those hints to serialize artifacts, content-address cache outputs, and validate the graph at compile time. Resource specs (GPU, memory, accelerator) are decorator arguments.
ZenML is a meta-orchestrator: rather than executing pipelines itself, it compiles ML-centric definitions into the backend you already have (Airflow, KFP, K8s, Step Functions, Vertex). On top of that it provides typed Artifact classes and stack abstractions for swapping artifact stores or trackers without rewriting code.
3.4 Vertex AI and SageMaker Pipelines
Vertex AI Pipelines (GCP) runs KFP-compatible pipelines without the K8s ops burden, integrates with Vertex Metadata, bills per pipeline-second. SageMaker Pipelines (AWS) provides a CI/CD-flavored DSL tightly integrated with SageMaker Training, Processing, and Model Registry. Both trade flexibility for managed convenience.
3.5 The Big Comparison
Tool
Paradigm
K8s
Artifact Tracking
Caching
Best Fit
Airflow 2.x
Task DAG
PodOperator
External
None native
Coarse-grained ML on existing data infra
Prefect 2.x
Pythonic flow/task
K8s workers
Generic result cache
Generic
Python ML teams, limited platform ops
Dagster 1.x
Asset / op / job
First-class
Materializations
Memoization
Modern data + ML platforms
Argo Workflows
Declarative YAML
Native
Volume artifacts
Manual
Substrate for higher-level tools
Kubeflow v2
Component DAG
Native only
MLMD (typed)
Content-addressable
K8s-first ML, GCP/Vertex
Flyte 1.x
Typed Python tasks
Native only
Platform-level lineage
Content-addressable
Large-scale K8s ML platforms
Metaflow 2.x
Pythonic @step
Optional
self.x versioned
Resume from past runs
ML developer productivity, AWS-leaning
ZenML
Meta-orchestrator
Delegated
Typed metadata store
Inherits + own
Avoiding lock-in to one backend
Key Points — Section 3
KFP v2 compiles a typed Python DSL to Argo/Tekton workflows on K8s and tracks artifact lineage in MLMD.
Metaflow wins on developer ergonomics; assignment to self automatically versions artifacts, and the same code runs locally or in the cloud.
Flyte uses Python type hints for serialization, caching, and graph validation, with resource specs as decorator arguments.
ZenML is a meta-orchestrator: it compiles ML pipelines onto your existing engine (Airflow, KFP, Step Functions, Vertex) so you can swap backends without rewriting code.
Vertex AI Pipelines and SageMaker Pipelines are the cloud-managed escape hatches that trade flexibility for zero K8s ops.
Section 4: Operationalizing Pipelines
Pre-Quiz — Section 4
1. The recommended default retry policy for ML pipeline tasks is:
Retry indefinitely with no delay3-5 retries with exponential backoff (capped at 30-60 min) plus jitter to avoid thundering herdA single retry after exactly one hourNo retries — page on-call on the first failure
2. Why is idempotency a prerequisite for safe retries?
It makes tasks run fasterWithout it, retrying a task that partially completed can duplicate or corrupt data (e.g., insert rows three times)It is required by KubernetesIt reduces cloud costs by 50%
3. When triggering a backfill of a year of daily partitions, the two operational rules from the chapter are:
Run all 365 partitions in parallel and never tag the runsLimit concurrency to avoid overloading downstream systems, and tag backfill runs distinctly from scheduled runsPause the scheduler entirely until backfill finishesUse no caching so every partition recomputes from scratch
Choosing an orchestrator is the easy half. Operating one well — so it survives bad data, flaky infrastructure, half-written deployments, and frantic backfills — requires a small set of disciplines that are essentially the same across every tool.
4.1 A Worked Example: Daily Training DAG
The canonical Prefect example builds features for a date partition, validates them, conditionally trains one model per customer segment, evaluates each, and registers winners. Four things to notice:
Retries with jitter are declared on the decorator, not embedded in business logic.
The feature build is cached by input hash — rerunning with the same date skips recomputation.
Conditional execution is a plain Python if; Prefect treats it as a visible branch.
Training fans out dynamically via .map() — equivalents are Airflow .expand(), Dagster DynamicOutput, Flyte @dynamic, KFP dsl.ParallelFor.
4.2 Retries, Timeouts, and Idempotency
Default playbook: 3-5 retries with exponential backoff capped at 30-60 minutes, plus jitter to avoid thundering-herd. Without idempotency, retries are dangerous. The canonical patterns:
Partitioned writes keyed by date or dataset_id (INSERT OVERWRITE PARTITION, or overwrite the S3 prefix).
Versioned artifact paths for models (s3://models/.../{run_id}/model.pkl); consumers load by run_id, not “latest”.
Deduplication keys on warehouse writes — business key plus run_id enables clean MERGE/UPSERT replays.
Treat the orchestrator's state as ephemeral and the data system's state as the source of truth. Timeouts are the safety net — every long-running task should have a timeout shorter than the next scheduled run.
Animation A3 — Retry with Exponential Backoff
Three failures each double the wait (1s → 2s → 4s) before the fourth attempt succeeds. Jitter (random noise added to the delay) prevents many failed tasks from retrying in lockstep.
4.3 Backfills and Catch-Up
A backfill runs a pipeline for a historical range, usually because you fixed a bug, ingested missing data, or rebuilt features under a new schema. Mechanisms differ widely:
Airflow: catchup=True for automatic; airflow dags backfill -s START -e END CLI; {{ ds }} templates.
Dagster: first-class partitioned backfill UI/CLI; pick a range, one run per partition.
Prefect: no built-in concept; loop over dates and create parameterized runs.
Flyte: LaunchPlans with parameters; cache hits accelerate unchanged parts.
KFP: no first-class backfill; script that loops and submits runs.
Two operational rules: (1) limit concurrency — a year of daily features in parallel will overload the warehouse and cost a fortune. (2) Tag backfill runs distinctly so monitoring and cost dashboards can distinguish them from scheduled runs.
4.4 Parameterization
Parameterize everything that changes between runs: date, env, model_name, segment, hyperparameters. Good practice: type-hint every parameter, validate at flow start, and tag runs with parameter values so you can filter “all daily-training runs with model_type=xgboost in env=prod” later.
4.5 Resource Management and Queueing
Work pools / queues: partition workers by purpose (“gpu”, “default”, “backfill”).
Resource specs per task: KFP and Flyte let each step declare CPU/GPU/memory; the K8s scheduler bin-packs them.
Concurrency limits: per-DAG, per-task, per-pool — guards against pile-ups.
Priorities: prod runs should outrank ad-hoc experiments.
Default retry policy: 3-5 attempts, exponential backoff capped at 30-60 min, with jitter — declared on the decorator, not in business logic.
Idempotency is the prerequisite for safe retries. Use partitioned overwrites, versioned artifact paths, and dedup keys.
Timeouts on every long-running task prevent runaway jobs from blocking downstream pipelines.
Backfills must use bounded concurrency and distinct tagging; only Dagster and (partially) Airflow have first-class support.
Resource isolation via work pools, per-task resource specs, and concurrency limits keeps GPU jobs and feature builds from contending for the same slots.
Post-Quiz — Reinforcement
Post-Quiz — Section 1: Orchestration Concepts
1. In every orchestrator, the component that decides what runs next by scanning DAG definitions and checking dependencies is the:
ExecutorSchedulerOperatorMetadata DB
2. The deeper paradigm split between Airflow and Dagster is best characterized as:
Python vs YAML authoringSingle-machine vs distributed executionTask-centric (“did it run?”) vs asset-centric (“is it fresh?”)Open source vs commercial
3. A team needs to fire a pipeline whenever a file lands in S3, without burning a worker slot continuously polling. The best modern primitive is:
A traditional sensor in a loopA cron schedule running every minuteA deferrable sensor or event-driven triggerA manual run triggered by the on-call engineer
1. Apache Airflow's classical weakness for ML workloads is:
Inability to run Python codeScheduler strain on DAGs with many short, fine-grained tasks (e.g., large hyperparameter sweeps)Lack of a web UINo support for Kubernetes
2. The unique selling point of Dagster versus Airflow and Prefect is:
It is the only orchestrator with a UIAsset-based orchestration with first-class partitions, freshness, and backfillsIt does not require Python knowledgeIt is the cheapest commercial offering
3. Argo Workflows is most accurately described as:
A Python ML framework competing directly with MetaflowA Kubernetes-native declarative workflow engine that higher-level tools (like KFP) compile intoA managed cloud service from AWSA replacement for the Kubernetes scheduler itself
Post-Quiz — Section 3: ML-Native Orchestrators
1. In Kubeflow Pipelines v2, artifact lineage (“this model came from these features, which came from this dataset”) is recorded in:
ML Metadata (MLMD)The Argo workflow controllerA standalone PostgreSQL outside the clusterGit commits on the pipeline repository
2. Metaflow's signature feature for data-scientist productivity is:
Mandatory Kubernetes deploymentLocal-to-cloud transparency: same code runs locally and at scale on Batch/K8s by adding decoratorsA SQL-only DSLBuilt-in feature store and model registry replacing all external tools
3. ZenML is best described as:
A direct competitor to KubernetesA meta-orchestrator that compiles ML-centric definitions to your existing backend (Airflow, KFP, Step Functions, Vertex)A managed SaaS owned by AWSA YAML-only declarative engine like Argo
Post-Quiz — Section 4: Operationalizing Pipelines
1. The recommended default retry policy for ML pipeline tasks is:
Retry indefinitely with no delay3-5 retries with exponential backoff (capped at 30-60 min) plus jitter to avoid thundering herdA single retry after exactly one hourNo retries — page on-call on the first failure
2. Why is idempotency a prerequisite for safe retries?
It makes tasks run fasterWithout it, retrying a task that partially completed can duplicate or corrupt data (e.g., insert rows three times)It is required by KubernetesIt reduces cloud costs by 50%
3. When triggering a backfill of a year of daily partitions, the two operational rules from the chapter are:
Run all 365 partitions in parallel and never tag the runsLimit concurrency to avoid overloading downstream systems, and tag backfill runs distinctly from scheduled runsPause the scheduler entirely until backfill finishesUse no caching so every partition recomputes from scratch