Chapter 13 — Monitoring, CI/CD, and Production Operations
Learning Objectives
Distinguish operational metrics, data drift, and concept drift, and select the right detection methods (KS, PSI, rolling AUC) and tools (Evidently, WhyLabs, Arize, Fiddler).
Design a five-layer CI/CD/CT pipeline with data and model gates, GitOps deployment via Argo CD, and progressive delivery with Argo Rollouts.
Apply governance artifacts (model cards, datasheets) and map ML systems to EU AI Act risk classes while layering adversarial and PII defenses.
Define ML-aware SLOs and error budgets, route alerts through purpose-built runbooks, and execute Git-based rollback with blameless post-mortems.
Section 1: Production Monitoring
An ML service is first a network service: RED metrics (Rate, Errors, Duration) and USE metrics (Utilization, Saturation, Errors) describe whether the system is healthy. Latency is reported as p50/p95/p99, saturation as GPU utilization and queue depth, errors as HTTP 5xx and timeouts. Prometheus or OpenTelemetry usually feeds Grafana dashboards.
Beyond plumbing, ML systems track data drift (covariate shift) — when feature distributions move away from the training reference. A reference window (training set or recent known-good period) is compared against a current window (e.g., last 24 hours) using the Kolmogorov-Smirnov test for continuous features, chi-square for categoricals, or the Population Stability Index (PSI < 0.1 stable, 0.1–0.25 moderate, > 0.25 significant).
Concept drift is more insidious: inputs look the same but the input → output relationship has changed (spam tactics evolve, "engaged user" gets redefined). Detection requires labels, which may arrive with 90-day lag, so teams use rolling AUC/F1/MAE, proxy metrics like CTR, and human-in-the-loop sampling.
Tool landscape: Evidently and whylogs are open-source profiling libraries paired with Evidently Cloud / WhyLabs Observatory; Arize and Fiddler are commercial-only platforms. Evidently fits small batch teams, WhyLabs fits many-model fleets, Arize fits embedding/RAG/LLM debugging, and Fiddler fits regulated industries that need explainability and audit trails.
Animation A1: Concept Drift Detection Loop
From prediction stream to retraining trigger
Predictions stream in → feature distribution shifts → KS statistic crosses threshold → alert fires → CT retraining triggers → new model deploys.
flowchart TD
A[Model Serving Endpoint] --> B[Operational Metrics RED + USE]
A --> C[Data Drift KS / chi-square / PSI]
A --> D[Concept Drift rolling AUC / F1 / proxy]
B --> E[Prometheus / OpenTelemetry]
C --> F[Evidently / whylogs profiles]
D --> F
E --> G[Grafana Dashboards]
F --> H[WhyLabs / Arize / Fiddler]
G --> I[Alert Manager]
H --> I
I --> J[PagerDuty / Slack On-call]
J --> K[Incident Runbook]
Key Points
Three layers: ops metrics (RED/USE) catch outages, data drift catches input-distribution shifts, concept drift catches input→output relationship changes.
Concept drift needs labels — and labels often lag 90+ days, so use rolling metrics, proxies, and human-in-the-loop sampling.
Open-source + commercial pairing is the dominant pattern: Evidently/whylogs in-pipeline, plus WhyLabs/Arize/Fiddler for hosted UI, alerting, and collaboration.
Tool selection follows team profile: Evidently for cost-sensitive batch, Arize for LLM/RAG embedding debugging, Fiddler for regulated industries.
Pre-Section Quiz: Production Monitoring
1. A PSI value of 0.32 between the training reference window and the last 24 hours of feature data indicates:
Stable distribution — no action needed.Moderate drift — schedule a review next sprint.Significant drift — investigate and consider retraining.A bug in the PSI calculation — PSI cannot exceed 0.25.
2. A spam filter still receives email-shaped text, but spammers have changed tactics and the model's precision is collapsing. This is best described as:
Data drift (covariate shift).Concept drift (the input→output relationship changed).An operational latency issue.A schema validation failure.
3. A regulated bank wants drift detection plus explainability and detailed audit trails for examiners. Which platform best fits per the chapter?
Evidently OSS only, with Grafana alerting.Arize, for embedding-cluster RAG tracing.Fiddler, for regulated industries needing explainability + audit.whylogs profiles stored in S3 with no UI.
Section 2: Continuous Integration and Delivery
ML CI extends classical CI with data and model tests. The five-layer gating sequence is: static checks (lint, types) → unit tests (transforms, tokenizers) → data contract tests (Great Expectations / Soda schema and quality rules) → model fast tests (smoke train on a small sample, "AUC > 0.5", "no NaN predictions") → integration tests (serving API talks to feature store).
GitOps with Argo CD treats every deployable artifact — Kubernetes manifests, model version pointers — as a versioned file in Git. A typical four-repo layout: app/service repo, ML pipeline repo, GitOps env repos (staging/prod manifests), and a model registry (MLflow, SageMaker, Vertex). Argo CD watches only the env repos, diffs against live state, and reconciles. Rollback = git revert.
Continuous Training (CT) is the ML-native extension. Triggers are time-based (nightly cron), data-based (new batch arrives, PSI > 0.25, label drift), or performance-based (online metric breaches SLO). CT runs validation → feature compute → train → eval → registry promotion → opens a PR that bumps MODEL_VERSION in the env repo. Argo CD reconciles, pods restart.
Promotion gates compare candidate vs champion on held-out data, check fairness across protected subgroups, enforce latency/resource budgets, and run adversarial probes. Argo Rollouts ramps traffic (5% → 20% → 50% → 100%) with automated metric analysis; failures abort. Shadow deployments mirror prod traffic via Istio/Envoy/NGINX without returning responses, comparing champion vs challenger offline.
Animation A2: CI/CD/CT Pipeline
From code commit to canary, with CT loop back
Code commit → CI gates (unit, data, model) → image build → registry → CD deploys to staging → monitor → CT triggers retraining on drift.
flowchart LR
A[Code Commit] --> B[CI: lint + unit + data + model tests]
B --> C[Build Image + push registry]
C --> D[GitOps PR env repo]
D --> E[Argo CD Sync]
E --> F[Argo Rollouts canary / shadow]
F --> G[Production Serving]
G --> H[Monitoring ops + drift + perf]
H -.drift / SLO breach.-> I[CT Trigger]
H -.cron / new labels.-> I
I --> J[Data Validation + Training]
J --> K[Eval vs Champion + Fairness Gates]
K --> L[Model Registry promote]
L --> D
Animation A3: GitOps with Argo CD
Git is the source of truth; Argo CD reconciles
Developer commits manifest → Argo CD polls env repo → diffs cluster state → applies sync → pods updated. Rollback = revert the commit.
sequenceDiagram
participant Dev as Developer
participant App as App / ML Repo
participant Reg as Model Registry
participant Env as GitOps Env Repo
participant Argo as Argo CD
participant K8s as Kubernetes Cluster
Dev->>App: push code / pipeline change
App->>Reg: register new model version
App->>Env: open PR (image tag + MODEL_VERSION)
Dev->>Env: review + merge
Argo->>Env: poll desired state
Argo->>K8s: diff vs live state
Argo->>K8s: apply manifests (sync)
K8s-->>Argo: report health
Argo-->>Dev: status / drift alerts
Key Points
Five test layers gate every commit: static, unit, data contract, fast-model, integration — each must pass before the next runs.
GitOps makes every deployment a Git commit; Argo CD reconciles cluster state against the env repo, and rollback is git revert.
CT triggers are not code commits but data/perf events: cron, new labels, PSI > threshold, online metric SLO breach.
Promotion gates include champion-vs-challenger margin, fairness across subgroups, latency budgets, and adversarial robustness probes.
Argo Rollouts + shadow: canary ramps 5→20→50→100% with auto-abort; shadow mirrors prod traffic via service mesh without returning responses.
Pre-Section Quiz: CI/CD/CT
1. In a four-repo GitOps layout, what does Argo CD actually watch?
Only the model registry artifacts.Only the application code repo for Dockerfile changes.Only the GitOps env repos containing the Kubernetes manifests.All four repos simultaneously and merges PRs automatically.
2. Which of the following is NOT a typical Continuous Training (CT) trigger?
Nightly cron job to absorb new labeled data.PSI on a feature crosses 0.25.A developer pushing a lint-only change to the README.Online accuracy breaches the SLO threshold.
3. A shadow deployment differs from a canary in that it:
Ramps traffic in steps from 5% to 100% to the new version.Mirrors production traffic to the new model but does NOT return its responses to users.Only runs in staging and never sees production data.Is faster to roll back than a Git revert.
Section 3: Governance and Security
Model cards are versioned, immutable documents attached to every significant model version: identity, intended use and out-of-scope uses, training data summary, performance by subgroup, known failure modes, safety controls, and change history. They must be linked to deployments so auditors can trace which card was live on any given day. Datasheets for datasets play the same role for data: provenance, legal basis, schema, biases, preprocessing, retention, and usage constraints.
The EU AI Act classifies systems by risk and imposes proportionate obligations. Prohibited: social scoring, real-time public biometric ID, subliminal manipulation. High-risk: credit scoring, medical triage, employment screening, critical infrastructure — full risk management, technical docs, data governance, human oversight, post-market monitoring, EU registration. GPAI/Foundation: transparency, copyright, systemic-risk evaluation. Limited-risk: chatbot/deepfake disclosure. Minimal-risk: voluntary codes.
Adversarial robustness follows a program: threat model (white/gray/black box) → test with attack libraries → layer defenses (adversarial training, robust optimization, L2, input sanitization, anomaly detection) → continuous monitoring for error-rate spikes.
LLM-specific defenses against prompt injection must be layered: input filters at the boundary, output filters for toxicity/PII/self-harm, context isolation keeping system prompts away from user content and treating retrieved docs as untrusted, allow-listed tool access with least privilege, adversarial fine-tuning, and red-teaming. Monitor for repeated jailbreak attempts and unusual tool invocation patterns.
Data protection: PII detection + redaction/tokenization before logging; AES-256 at rest with KMS, TLS 1.2+ in transit; RBAC on the registry with separate training/serving service accounts; just-in-time prod elevation; immutable audit logs; secrets in HashiCorp Vault / AWS Secrets Manager with automated rotation; GDPR right-to-erasure must propagate to vector stores.
Key Points
Model cards + datasheets are versioned, immutable, and linked to deployments — they make audits tractable and satisfy EU AI Act traceability obligations.
EU AI Act risk classes: prohibited, high-risk, GPAI, limited-risk, minimal-risk; obligations are proportionate to class.
Prompt-injection defense is layered: no single control suffices — input filters + output filters + context isolation + tool allow-lists + adversarial fine-tuning + red-teaming.
Context isolation treats retrieved RAG documents as untrusted user content, not as trusted system instructions.
Data protection essentials: PII redaction in logs, encryption (AES-256 / TLS 1.2+), RBAC, vault-managed secrets, GDPR erasure that reaches vector stores.
Pre-Section Quiz: Governance and Security
1. Under the EU AI Act, a hospital's diagnostic triage model would fall into which risk class?
Minimal risk — voluntary codes of conduct only.Limited risk — chatbot-style transparency disclosure.High-risk — full technical docs, risk management, human oversight, EU registration.Prohibited — cannot be deployed in the EU.
2. Which prompt-injection defense is described as treating retrieved RAG documents as untrusted rather than as trusted system instructions?
Output filtering for toxicity.Context isolation.Just-in-time secret rotation.RBAC on the model registry.
3. Which statement about model cards is correct per the chapter?
They are written once at project inception and never updated.They are mutable working documents that anyone can edit live.They are versioned and immutable once published, and linked to the deployments they cover.They replace datasheets for datasets — only one artifact is needed.
Section 4: Operations and Reliability
SLOs and error budgets from classical SRE extend in three dimensions for ML. Operational: availability, p95 latency, error rate. Model quality: rolling AUC, calibration error, false-positive/negative rates. Safety: cap rate of policy-violating outputs (e.g., "no more than 0.1% of LLM responses flagged in any 30-day window"). Drift: cap PSI or Jensen-Shannon divergence on critical features. When the error budget burns down, deploys freeze and remediation takes priority.
Four runbooks cover the ML incident landscape:
General model runbook — SLO/error-budget breaches and unexpected behavior: switch to known-good fallback, capture diagnostics, notify stakeholders.
Data/privacy runbook — PII leakage: contain (revoke tokens, rotate keys), scope assessment, DPO coordination on GDPR / EU AI Act timelines.
Quality/drift runbook — distribution shifts: validate monitoring data itself, restrict/roll back model, investigate upstream data, decide between retrain and re-tune.
Rollback options: Git revert + Argo CD reconcile; Argo Rollouts auto-abort during canary; blue-green flip via Service/Ingress; registry-based pointer change in a ConfigMap. Blameless post-mortems follow every significant incident with action items, runbook updates, and (where relevant) entries into the EU AI Act technical file.
flowchart LR
A[Define SLIs latency, AUC, safety, PSI] --> B[Set SLOs per dimension]
B --> C[Compute Error Budget 100% - SLO]
C --> D[Measure Live SLIs]
D --> E{Budget remaining?}
E -- Yes --> F[Ship new release spend budget]
F --> D
E -- No --> G[Freeze deploys remediation only]
G --> H[Post-mortem + runbook update]
H --> D
flowchart TD
A[Alert Fires SLO / drift / safety / PII] --> B{Classify incident type}
B -- Ops / SLO --> C[General Model Runbook]
B -- Safety / LLM --> D[Guardrail Runbook]
B -- Privacy / PII --> E[Data Runbook]
B -- Drift / Quality --> F[Quality Runbook]
C --> G{Severity high?}
D --> G
E --> G
F --> G
G -- Yes --> H[Switch to fallback or rollback via Git]
G -- No --> I[Throttle / mitigate]
H --> J[Capture diagnostics + notify stakeholders]
I --> J
J --> K{Regulatory notification?}
K -- Yes --> L[DPO + EU AI Act filing]
K -- No --> M[Blameless post-mortem]
L --> M
M --> N[Update runbooks model cards, datasheets]
Key Points
ML SLOs span four dimensions: operational, model quality, safety, and drift — each with its own error budget.
Error budget exhaustion freezes deploys: business commits to remediation over new features when the budget burns down.
Four runbooks — general model, safety/LLM guardrail, data/privacy, quality/drift — each with roles, escalation, time-to-respond targets, and recovery preconditions.
Rollback toolbox: Git revert + Argo reconcile, Argo Rollouts auto-abort, blue-green Service flip, or registry-pointer ConfigMap change.
Blameless post-mortems close the loop by updating runbooks, model cards, datasheets — and feeding the EU AI Act technical file when relevant.
Pre-Section Quiz: Operations and Reliability
1. An ML team's safety SLO states "no more than 0.1% of LLM responses flagged by the safety classifier in any 30-day window." What does the error budget represent?
The 99.9% of responses that must pass the classifier; the budget caps the 0.1% allowed shortfall.The total monthly OpenAI bill.The latency p95 threshold for the endpoint.The number of model versions allowed in production simultaneously.
2. A canary deployment shows error-rate regression at the 20% traffic step. What does Argo Rollouts do automatically?
Immediately escalates to 100% to flush out the bug.Aborts the rollout and leaves the stable ReplicaSet serving all traffic.Pages the entire engineering org regardless of severity.Deletes the previous version to force a rebuild.
3. A PII leakage incident has been confirmed. Which runbook is appropriate, per the chapter's four-runbook taxonomy?
General model runbook — switch to fallback and capture diagnostics.Quality/drift runbook — re-tune thresholds and retrain.Data/privacy runbook — contain, scope assessment, DPO coordination.Safety/LLM guardrail runbook — update filters and adversarial training.
Post-Session Review Quizzes
Post Quiz: Production Monitoring
1. A PSI value of 0.32 between the training reference window and the last 24 hours of feature data indicates:
Stable distribution — no action needed.Moderate drift — schedule a review next sprint.Significant drift — investigate and consider retraining.A bug in the PSI calculation — PSI cannot exceed 0.25.
2. A spam filter still receives email-shaped text, but spammers have changed tactics and the model's precision is collapsing. This is best described as:
Data drift (covariate shift).Concept drift (the input→output relationship changed).An operational latency issue.A schema validation failure.
3. A regulated bank wants drift detection plus explainability and detailed audit trails for examiners. Which platform best fits per the chapter?
Evidently OSS only, with Grafana alerting.Arize, for embedding-cluster RAG tracing.Fiddler, for regulated industries needing explainability + audit.whylogs profiles stored in S3 with no UI.
Post Quiz: CI/CD/CT
1. In a four-repo GitOps layout, what does Argo CD actually watch?
Only the model registry artifacts.Only the application code repo for Dockerfile changes.Only the GitOps env repos containing the Kubernetes manifests.All four repos simultaneously and merges PRs automatically.
2. Which of the following is NOT a typical Continuous Training (CT) trigger?
Nightly cron job to absorb new labeled data.PSI on a feature crosses 0.25.A developer pushing a lint-only change to the README.Online accuracy breaches the SLO threshold.
3. A shadow deployment differs from a canary in that it:
Ramps traffic in steps from 5% to 100% to the new version.Mirrors production traffic to the new model but does NOT return its responses to users.Only runs in staging and never sees production data.Is faster to roll back than a Git revert.
Post Quiz: Governance and Security
1. Under the EU AI Act, a hospital's diagnostic triage model would fall into which risk class?
Minimal risk — voluntary codes of conduct only.Limited risk — chatbot-style transparency disclosure.High-risk — full technical docs, risk management, human oversight, EU registration.Prohibited — cannot be deployed in the EU.
2. Which prompt-injection defense is described as treating retrieved RAG documents as untrusted rather than as trusted system instructions?
Output filtering for toxicity.Context isolation.Just-in-time secret rotation.RBAC on the model registry.
3. Which statement about model cards is correct per the chapter?
They are written once at project inception and never updated.They are mutable working documents that anyone can edit live.They are versioned and immutable once published, and linked to the deployments they cover.They replace datasheets for datasets — only one artifact is needed.
Post Quiz: Operations and Reliability
1. An ML team's safety SLO states "no more than 0.1% of LLM responses flagged by the safety classifier in any 30-day window." What does the error budget represent?
The 99.9% of responses that must pass the classifier; the budget caps the 0.1% allowed shortfall.The total monthly OpenAI bill.The latency p95 threshold for the endpoint.The number of model versions allowed in production simultaneously.
2. A canary deployment shows error-rate regression at the 20% traffic step. What does Argo Rollouts do automatically?
Immediately escalates to 100% to flush out the bug.Aborts the rollout and leaves the stable ReplicaSet serving all traffic.Pages the entire engineering org regardless of severity.Deletes the previous version to force a rebuild.
3. A PII leakage incident has been confirmed. Which runbook is appropriate, per the chapter's four-runbook taxonomy?
General model runbook — switch to fallback and capture diagnostics.Quality/drift runbook — re-tune thresholds and retrain.Data/privacy runbook — contain, scope assessment, DPO coordination.Safety/LLM guardrail runbook — update filters and adversarial training.