Running AI Workloads on Kubernetes — Interactive Study Guide
1. What type of computation graph do ML pipelines use to represent step dependencies?
2. What does KFP v2's pipeline caching avoid when component inputs have not changed?
3. In MLflow's Model Registry, what does the "champion" alias indicate?
4. What problem does a feature store like Feast primarily solve?
5. In a GitOps workflow, what is the single source of truth for cluster state?
6. Which KFP v2 decorator wraps a non-Python tool or shell script into a pipeline step?
7. Why should ML pipeline steps pin container images by digest rather than tag?
8. What is the role of Argo Workflows in the Kubeflow Pipelines architecture?
9. Which Feast component provides low-latency feature lookups during model serving?
10. What does a validation gate do in an ML CI/CD pipeline?
11. Which tool is a general-purpose Kubernetes CI/CD framework often used alongside ArgoCD in MLOps?
12. In Feast, what process moves computed features from the offline store to Redis for serving?
An ML pipeline is a directed acyclic graph (DAG) of computational steps. Each node is a containerized task; edges carry data artifacts or parameter values from one task to the next. Getting this orchestration right on Kubernetes is the foundation of production MLOps.
Kubeflow Pipelines (KFP) provides a Python SDK for defining pipelines plus backend services for running them on any Kubernetes cluster. KFP v2 introduced major improvements over v1:
| Feature | KFP v1 | KFP v2 |
|---|---|---|
| Component decorator | @component (limited) | @dsl.component and @dsl.container_component |
| Intermediate representation | Argo YAML | Backend-agnostic IR |
| Artifact visibility | Hidden implementation detail | First-class DAG nodes |
| Nested pipelines | Not supported | Pipelines as pipeline components |
| Single component execution | Full pipeline required | Run individual components |
The @dsl.component decorator converts a plain Python function into a self-contained pipeline
step that KFP can containerize and schedule. The @dsl.container_component decorator gives
precise control over the container command — useful when wrapping non-Python tools or shell scripts.
from kfp import dsl
from kfp.dsl import Dataset, Model, Output, Input
@dsl.component(base_image="python:3.11-slim",
packages_to_install=["pandas", "scikit-learn"])
def preprocess(raw_data_path: str, dataset: Output[Dataset]):
import pandas as pd
df = pd.read_csv(raw_data_path).dropna()
df.to_csv(dataset.path, index=False)
@dsl.component(base_image="python:3.11-slim",
packages_to_install=["pandas", "scikit-learn"])
def train(dataset: Input[Dataset], model: Output[Model],
n_estimators: int = 100):
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import joblib
df = pd.read_csv(dataset.path)
X, y = df.drop("label", axis=1), df["label"]
clf = RandomForestClassifier(n_estimators=n_estimators)
clf.fit(X, y)
joblib.dump(clf, model.path)
@dsl.pipeline(name="random-forest-pipeline")
def rf_pipeline(data_path: str, n_estimators: int = 100):
prep = preprocess(raw_data_path=data_path)
fit = train(dataset=prep.outputs["dataset"],
n_estimators=n_estimators)
evaluate(dataset=prep.outputs["dataset"],
model=fit.outputs["model"])
Notice how Dataset and Model types are declared as Output and
Input parameters. KFP v2 surfaces these as first-class artifact nodes in the pipeline
visualization, so engineers can trace exactly which model artifact came from which training run.
Figure 7.1 — ML Pipeline DAG: steps execute in dependency order, passing typed artifacts along edges
KFP compiles Python pipelines into an intermediate representation executed by Argo Workflows as the default backend. Argo can also be used directly — teams working in multiple languages or wrapping legacy tools often prefer writing Argo YAML. Argo provides DAG-based execution, parameter passing, S3/GCS artifact management, and configurable retry policies.
Tekton occupies a different niche: it is a general-purpose CI/CD framework on Kubernetes. Its strength in MLOps is in the CI/CD layer — building container images, running linting and unit tests before training, and triggering downstream deployments after model validation. Tekton integrates naturally with ArgoCD to create a full GitOps deployment chain.
Effective pipelines treat all tunable values as explicit parameters. This enables:
data_path parameter to move from staging to production data
Pipeline caching is a standout feature: when a component's inputs (parameter values and artifact
content hashes) match a previous execution, KFP reuses the cached output. Change n_estimators and
only train and evaluate re-execute; preprocess returns instantly from cache.
Dataset, Model) make data lineage visible in the DAG.Every training run produces metrics, parameters, and artifacts. Without a system to capture them, teams lose track of which configuration produced which result. MLflow is the most widely adopted open-source solution, deployed on Kubernetes as a three-tier architecture:
| Kubernetes Object | Purpose |
|---|---|
Deployment | Runs the MLflow tracking server as a scalable pod |
Service + Ingress | Exposes the UI and REST API |
StatefulSet (PostgreSQL) | Persists run metadata and model registry entries |
Secret / ConfigMap | Stores database credentials and S3 endpoint configuration |
import mlflow
mlflow.set_tracking_uri(
"http://mlflow-service.mlops.svc.cluster.local:5000"
)
mlflow.set_experiment("fraud-detection-v2")
with mlflow.start_run():
mlflow.log_param("n_estimators", 200)
mlflow.log_param("max_depth", 8)
mlflow.log_metric("accuracy", 0.942)
mlflow.log_metric("f1_score", 0.917)
mlflow.sklearn.log_model(clf, "model")
The MLflow Model Registry provides a versioned catalog of model artifacts with semantic aliases that communicate production status:
| Alias | Meaning |
|---|---|
champion | The model currently serving production traffic |
challenger | A candidate model undergoing A/B testing |
shadow | A model receiving traffic copies for offline evaluation |
archived | A retired model retained for audit purposes |
Promotion between stages is a deliberate, audited action — not an automatic side-effect of training. Downstream systems reference aliases rather than hard-coded version numbers, enabling safe canary promotions and rollbacks.
For richer visualization and real-time collaboration, managed services like Weights & Biases and Neptune extend the MLflow pattern. Both integrate via environment variables and lightweight SDK calls. The key trade-off: self-hosted MLflow keeps data governance simple; managed SaaS reduces operational burden.
Imagine building a fraud detection model. During training, you compute "transactions in the last 24 hours" using a batch SQL query. In production, the same feature must be computed in real time from a live event stream. If these two computations diverge even slightly, the model degrades silently. This is training-serving skew — one of the most common causes of silent model degradation in production.
Feast solves this by providing a single versioned feature definition shared between training and serving. Its architecture separates two concerns with distinct storage backends:
| Component | Storage Backend | Use Case |
|---|---|---|
| Offline store | PostgreSQL, BigQuery, Snowflake | Historical feature retrieval for training |
| Online store | Redis | Low-latency feature lookup during serving |
| Registry | PostgreSQL | Feature definitions, versioning, metadata |
Figure 7.4 — Feast dual-path architecture: offline store feeds training, online store (Redis) serves inference, both share one feature definition
The workflow is:
FeatureView objects and committed to Gitget_historical_featuresget_online_features — using identical definitions# Training: historical features
training_df = store.get_historical_features(
entity_df=entity_df,
features=["user_stats:transactions_24h",
"user_stats:avg_amount_7d"]
).to_df()
# Serving: online features (identical feature names)
online_features = store.get_online_features(
features=["user_stats:transactions_24h",
"user_stats:avg_amount_7d"],
entity_rows=[{"user_id": "u-12345"}]
).to_dict()
ML CI/CD must do everything traditional CI/CD does and: retrain models on new data, validate statistical model quality before promotion, manage large binary artifacts (model weights), and handle the fact that "tests passing" does not guarantee a model will perform well on tomorrow's data distribution.
GitOps makes Git the single source of truth for cluster state. A GitOps controller (ArgoCD or Flux) watches a Git repository and continuously reconciles the cluster to match what is declared there. For ML, this means model configuration changes are committed to Git, ArgoCD detects the drift, and applies the update automatically.
Figure 7.5 — GitOps ML deployment loop: commit to Git triggers CI, ArgoCD syncs to cluster, monitoring feeds back into retraining
A validation gate blocks promotion unless a model meets defined quality thresholds:
| Gate Type | Example Criterion | Action on Failure |
|---|---|---|
| Accuracy threshold | Accuracy >= 0.90 on held-out test set | Block promotion, alert team |
| Regression check | F1 score >= 95% of current champion's F1 | Block promotion |
| Data quality check | Feature distributions within 2 sigma of training distribution | Block promotion |
| Latency check | p99 inference latency <= 100ms under load | Block promotion |
| Bias/fairness audit | Equal opportunity difference <= 0.05 | Block promotion |
ML images have unique challenges: a PyTorch training image with CUDA can exceed 10 GB, dependencies must be mutually compatible (CUDA + cuDNN + PyTorch + Python versions), and exact reproducibility months later is required.
| Practice | Implementation |
|---|---|
| Multi-stage builds | Separate build-time deps from runtime image |
| Pinned base images | FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime |
| Image signing | Cosign signatures verified at admission time |
| Shared base images | One org-wide CUDA base; per-project layers on top |
Reproducibility requires discipline at four layers:
torch.manual_seed(), numpy.random.seed()
Pinning pipeline steps to image digests (sha256:...) rather than tags is the single most
impactful reproducibility practice — digests are content-addressed and immutable, unlike tags which
can be overwritten.
1. What type of computation graph do ML pipelines use to represent step dependencies?
2. What does KFP v2's pipeline caching avoid when component inputs have not changed?
3. In MLflow's Model Registry, what does the "champion" alias indicate?
4. What problem does a feature store like Feast primarily solve?
5. In a GitOps workflow, what is the single source of truth for cluster state?
6. Which KFP v2 decorator wraps a non-Python tool or shell script into a pipeline step?
7. Why should ML pipeline steps pin container images by digest rather than tag?
8. What is the role of Argo Workflows in the Kubeflow Pipelines architecture?
9. Which Feast component provides low-latency feature lookups during model serving?
10. What does a validation gate do in an ML CI/CD pipeline?
11. Which tool is a general-purpose Kubernetes CI/CD framework often used alongside ArgoCD in MLOps?
12. In Feast, what process moves computed features from the offline store to Redis for serving?