Chapter 13 — Monitoring, CI/CD, and Production Operations

Learning Objectives

Section 1: Production Monitoring

An ML service is first a network service: RED metrics (Rate, Errors, Duration) and USE metrics (Utilization, Saturation, Errors) describe whether the system is healthy. Latency is reported as p50/p95/p99, saturation as GPU utilization and queue depth, errors as HTTP 5xx and timeouts. Prometheus or OpenTelemetry usually feeds Grafana dashboards.

Beyond plumbing, ML systems track data drift (covariate shift) — when feature distributions move away from the training reference. A reference window (training set or recent known-good period) is compared against a current window (e.g., last 24 hours) using the Kolmogorov-Smirnov test for continuous features, chi-square for categoricals, or the Population Stability Index (PSI < 0.1 stable, 0.1–0.25 moderate, > 0.25 significant).

Concept drift is more insidious: inputs look the same but the input → output relationship has changed (spam tactics evolve, "engaged user" gets redefined). Detection requires labels, which may arrive with 90-day lag, so teams use rolling AUC/F1/MAE, proxy metrics like CTR, and human-in-the-loop sampling.

Tool landscape: Evidently and whylogs are open-source profiling libraries paired with Evidently Cloud / WhyLabs Observatory; Arize and Fiddler are commercial-only platforms. Evidently fits small batch teams, WhyLabs fits many-model fleets, Arize fits embedding/RAG/LLM debugging, and Fiddler fits regulated industries that need explainability and audit trails.

Animation A1: Concept Drift Detection Loop

From prediction stream to retraining trigger

Predictions stream in → feature distribution shifts → KS statistic crosses threshold → alert fires → CT retraining triggers → new model deploys.
Prediction Stream incoming requests Feature Distribution reference vs current KS Test D-statistic threshold > 0.15 PagerDuty drift alert ! CT Trigger → Retraining New Model Live
flowchart TD A[Model Serving Endpoint] --> B[Operational Metrics
RED + USE] A --> C[Data Drift
KS / chi-square / PSI] A --> D[Concept Drift
rolling AUC / F1 / proxy] B --> E[Prometheus / OpenTelemetry] C --> F[Evidently / whylogs profiles] D --> F E --> G[Grafana Dashboards] F --> H[WhyLabs / Arize / Fiddler] G --> I[Alert Manager] H --> I I --> J[PagerDuty / Slack On-call] J --> K[Incident Runbook]

Key Points

Pre-Section Quiz: Production Monitoring

1. A PSI value of 0.32 between the training reference window and the last 24 hours of feature data indicates:

Stable distribution — no action needed. Moderate drift — schedule a review next sprint. Significant drift — investigate and consider retraining. A bug in the PSI calculation — PSI cannot exceed 0.25.

2. A spam filter still receives email-shaped text, but spammers have changed tactics and the model's precision is collapsing. This is best described as:

Data drift (covariate shift). Concept drift (the input→output relationship changed). An operational latency issue. A schema validation failure.

3. A regulated bank wants drift detection plus explainability and detailed audit trails for examiners. Which platform best fits per the chapter?

Evidently OSS only, with Grafana alerting. Arize, for embedding-cluster RAG tracing. Fiddler, for regulated industries needing explainability + audit. whylogs profiles stored in S3 with no UI.

Section 2: Continuous Integration and Delivery

ML CI extends classical CI with data and model tests. The five-layer gating sequence is: static checks (lint, types) → unit tests (transforms, tokenizers) → data contract tests (Great Expectations / Soda schema and quality rules) → model fast tests (smoke train on a small sample, "AUC > 0.5", "no NaN predictions") → integration tests (serving API talks to feature store).

GitOps with Argo CD treats every deployable artifact — Kubernetes manifests, model version pointers — as a versioned file in Git. A typical four-repo layout: app/service repo, ML pipeline repo, GitOps env repos (staging/prod manifests), and a model registry (MLflow, SageMaker, Vertex). Argo CD watches only the env repos, diffs against live state, and reconciles. Rollback = git revert.

Continuous Training (CT) is the ML-native extension. Triggers are time-based (nightly cron), data-based (new batch arrives, PSI > 0.25, label drift), or performance-based (online metric breaches SLO). CT runs validation → feature compute → train → eval → registry promotion → opens a PR that bumps MODEL_VERSION in the env repo. Argo CD reconciles, pods restart.

Promotion gates compare candidate vs champion on held-out data, check fairness across protected subgroups, enforce latency/resource budgets, and run adversarial probes. Argo Rollouts ramps traffic (5% → 20% → 50% → 100%) with automated metric analysis; failures abort. Shadow deployments mirror prod traffic via Istio/Envoy/NGINX without returning responses, comparing champion vs challenger offline.

Animation A2: CI/CD/CT Pipeline

From code commit to canary, with CT loop back

Code commit → CI gates (unit, data, model) → image build → registry → CD deploys to staging → monitor → CT triggers retraining on drift.
Code Commit CI Gates unit data model Image + Registry Argo CD Sync Canary + Staging Production Monitoring ops + drift + perf ! CT Trigger retrain + promote drift / SLO breach → CT → registry → CI
flowchart LR A[Code Commit] --> B[CI: lint + unit
+ data + model tests] B --> C[Build Image
+ push registry] C --> D[GitOps PR
env repo] D --> E[Argo CD Sync] E --> F[Argo Rollouts
canary / shadow] F --> G[Production Serving] G --> H[Monitoring
ops + drift + perf] H -.drift / SLO breach.-> I[CT Trigger] H -.cron / new labels.-> I I --> J[Data Validation
+ Training] J --> K[Eval vs Champion
+ Fairness Gates] K --> L[Model Registry
promote] L --> D

Animation A3: GitOps with Argo CD

Git is the source of truth; Argo CD reconciles

Developer commits manifest → Argo CD polls env repo → diffs cluster state → applies sync → pods updated. Rollback = revert the commit.
Git Env Repo source of truth commit: bump v1.4.2 Argo CD controller diff detected Kubernetes Cluster live state pod pod pod v1.4.2 deployed poll sync rollback = git revert → Argo re-syncs
sequenceDiagram participant Dev as Developer participant App as App / ML Repo participant Reg as Model Registry participant Env as GitOps Env Repo participant Argo as Argo CD participant K8s as Kubernetes Cluster Dev->>App: push code / pipeline change App->>Reg: register new model version App->>Env: open PR (image tag + MODEL_VERSION) Dev->>Env: review + merge Argo->>Env: poll desired state Argo->>K8s: diff vs live state Argo->>K8s: apply manifests (sync) K8s-->>Argo: report health Argo-->>Dev: status / drift alerts

Key Points

Pre-Section Quiz: CI/CD/CT

1. In a four-repo GitOps layout, what does Argo CD actually watch?

Only the model registry artifacts. Only the application code repo for Dockerfile changes. Only the GitOps env repos containing the Kubernetes manifests. All four repos simultaneously and merges PRs automatically.

2. Which of the following is NOT a typical Continuous Training (CT) trigger?

Nightly cron job to absorb new labeled data. PSI on a feature crosses 0.25. A developer pushing a lint-only change to the README. Online accuracy breaches the SLO threshold.

3. A shadow deployment differs from a canary in that it:

Ramps traffic in steps from 5% to 100% to the new version. Mirrors production traffic to the new model but does NOT return its responses to users. Only runs in staging and never sees production data. Is faster to roll back than a Git revert.

Section 3: Governance and Security

Model cards are versioned, immutable documents attached to every significant model version: identity, intended use and out-of-scope uses, training data summary, performance by subgroup, known failure modes, safety controls, and change history. They must be linked to deployments so auditors can trace which card was live on any given day. Datasheets for datasets play the same role for data: provenance, legal basis, schema, biases, preprocessing, retention, and usage constraints.

The EU AI Act classifies systems by risk and imposes proportionate obligations. Prohibited: social scoring, real-time public biometric ID, subliminal manipulation. High-risk: credit scoring, medical triage, employment screening, critical infrastructure — full risk management, technical docs, data governance, human oversight, post-market monitoring, EU registration. GPAI/Foundation: transparency, copyright, systemic-risk evaluation. Limited-risk: chatbot/deepfake disclosure. Minimal-risk: voluntary codes.

Adversarial robustness follows a program: threat model (white/gray/black box) → test with attack libraries → layer defenses (adversarial training, robust optimization, L2, input sanitization, anomaly detection) → continuous monitoring for error-rate spikes.

LLM-specific defenses against prompt injection must be layered: input filters at the boundary, output filters for toxicity/PII/self-harm, context isolation keeping system prompts away from user content and treating retrieved docs as untrusted, allow-listed tool access with least privilege, adversarial fine-tuning, and red-teaming. Monitor for repeated jailbreak attempts and unusual tool invocation patterns.

Data protection: PII detection + redaction/tokenization before logging; AES-256 at rest with KMS, TLS 1.2+ in transit; RBAC on the registry with separate training/serving service accounts; just-in-time prod elevation; immutable audit logs; secrets in HashiCorp Vault / AWS Secrets Manager with automated rotation; GDPR right-to-erasure must propagate to vector stores.

Key Points

Pre-Section Quiz: Governance and Security

1. Under the EU AI Act, a hospital's diagnostic triage model would fall into which risk class?

Minimal risk — voluntary codes of conduct only. Limited risk — chatbot-style transparency disclosure. High-risk — full technical docs, risk management, human oversight, EU registration. Prohibited — cannot be deployed in the EU.

2. Which prompt-injection defense is described as treating retrieved RAG documents as untrusted rather than as trusted system instructions?

Output filtering for toxicity. Context isolation. Just-in-time secret rotation. RBAC on the model registry.

3. Which statement about model cards is correct per the chapter?

They are written once at project inception and never updated. They are mutable working documents that anyone can edit live. They are versioned and immutable once published, and linked to the deployments they cover. They replace datasheets for datasets — only one artifact is needed.

Section 4: Operations and Reliability

SLOs and error budgets from classical SRE extend in three dimensions for ML. Operational: availability, p95 latency, error rate. Model quality: rolling AUC, calibration error, false-positive/negative rates. Safety: cap rate of policy-violating outputs (e.g., "no more than 0.1% of LLM responses flagged in any 30-day window"). Drift: cap PSI or Jensen-Shannon divergence on critical features. When the error budget burns down, deploys freeze and remediation takes priority.

Four runbooks cover the ML incident landscape:

Rollback options: Git revert + Argo CD reconcile; Argo Rollouts auto-abort during canary; blue-green flip via Service/Ingress; registry-based pointer change in a ConfigMap. Blameless post-mortems follow every significant incident with action items, runbook updates, and (where relevant) entries into the EU AI Act technical file.

Future workloads: LLMOps adds prompt versioning, response evaluation (LLM-as-a-judge), token-cost monitoring, per-tenant safety policies. RAG adds retrieval quality (recall@k, document relevance), chunking strategies, embedding drift, answer faithfulness. Agents add trace observability, tool privilege guardrails, and budget caps against runaway loops.

flowchart LR A[Define SLIs
latency, AUC, safety, PSI] --> B[Set SLOs
per dimension] B --> C[Compute Error Budget
100% - SLO] C --> D[Measure Live SLIs] D --> E{Budget
remaining?} E -- Yes --> F[Ship new release
spend budget] F --> D E -- No --> G[Freeze deploys
remediation only] G --> H[Post-mortem
+ runbook update] H --> D
flowchart TD A[Alert Fires
SLO / drift / safety / PII] --> B{Classify
incident type} B -- Ops / SLO --> C[General Model Runbook] B -- Safety / LLM --> D[Guardrail Runbook] B -- Privacy / PII --> E[Data Runbook] B -- Drift / Quality --> F[Quality Runbook] C --> G{Severity
high?} D --> G E --> G F --> G G -- Yes --> H[Switch to fallback
or rollback via Git] G -- No --> I[Throttle / mitigate] H --> J[Capture diagnostics
+ notify stakeholders] I --> J J --> K{Regulatory
notification?} K -- Yes --> L[DPO + EU AI Act filing] K -- No --> M[Blameless post-mortem] L --> M M --> N[Update runbooks
model cards, datasheets]

Key Points

Pre-Section Quiz: Operations and Reliability

1. An ML team's safety SLO states "no more than 0.1% of LLM responses flagged by the safety classifier in any 30-day window." What does the error budget represent?

The 99.9% of responses that must pass the classifier; the budget caps the 0.1% allowed shortfall. The total monthly OpenAI bill. The latency p95 threshold for the endpoint. The number of model versions allowed in production simultaneously.

2. A canary deployment shows error-rate regression at the 20% traffic step. What does Argo Rollouts do automatically?

Immediately escalates to 100% to flush out the bug. Aborts the rollout and leaves the stable ReplicaSet serving all traffic. Pages the entire engineering org regardless of severity. Deletes the previous version to force a rebuild.

3. A PII leakage incident has been confirmed. Which runbook is appropriate, per the chapter's four-runbook taxonomy?

General model runbook — switch to fallback and capture diagnostics. Quality/drift runbook — re-tune thresholds and retrain. Data/privacy runbook — contain, scope assessment, DPO coordination. Safety/LLM guardrail runbook — update filters and adversarial training.

Post-Session Review Quizzes

Post Quiz: Production Monitoring

1. A PSI value of 0.32 between the training reference window and the last 24 hours of feature data indicates:

Stable distribution — no action needed. Moderate drift — schedule a review next sprint. Significant drift — investigate and consider retraining. A bug in the PSI calculation — PSI cannot exceed 0.25.

2. A spam filter still receives email-shaped text, but spammers have changed tactics and the model's precision is collapsing. This is best described as:

Data drift (covariate shift). Concept drift (the input→output relationship changed). An operational latency issue. A schema validation failure.

3. A regulated bank wants drift detection plus explainability and detailed audit trails for examiners. Which platform best fits per the chapter?

Evidently OSS only, with Grafana alerting. Arize, for embedding-cluster RAG tracing. Fiddler, for regulated industries needing explainability + audit. whylogs profiles stored in S3 with no UI.
Post Quiz: CI/CD/CT

1. In a four-repo GitOps layout, what does Argo CD actually watch?

Only the model registry artifacts. Only the application code repo for Dockerfile changes. Only the GitOps env repos containing the Kubernetes manifests. All four repos simultaneously and merges PRs automatically.

2. Which of the following is NOT a typical Continuous Training (CT) trigger?

Nightly cron job to absorb new labeled data. PSI on a feature crosses 0.25. A developer pushing a lint-only change to the README. Online accuracy breaches the SLO threshold.

3. A shadow deployment differs from a canary in that it:

Ramps traffic in steps from 5% to 100% to the new version. Mirrors production traffic to the new model but does NOT return its responses to users. Only runs in staging and never sees production data. Is faster to roll back than a Git revert.
Post Quiz: Governance and Security

1. Under the EU AI Act, a hospital's diagnostic triage model would fall into which risk class?

Minimal risk — voluntary codes of conduct only. Limited risk — chatbot-style transparency disclosure. High-risk — full technical docs, risk management, human oversight, EU registration. Prohibited — cannot be deployed in the EU.

2. Which prompt-injection defense is described as treating retrieved RAG documents as untrusted rather than as trusted system instructions?

Output filtering for toxicity. Context isolation. Just-in-time secret rotation. RBAC on the model registry.

3. Which statement about model cards is correct per the chapter?

They are written once at project inception and never updated. They are mutable working documents that anyone can edit live. They are versioned and immutable once published, and linked to the deployments they cover. They replace datasheets for datasets — only one artifact is needed.
Post Quiz: Operations and Reliability

1. An ML team's safety SLO states "no more than 0.1% of LLM responses flagged by the safety classifier in any 30-day window." What does the error budget represent?

The 99.9% of responses that must pass the classifier; the budget caps the 0.1% allowed shortfall. The total monthly OpenAI bill. The latency p95 threshold for the endpoint. The number of model versions allowed in production simultaneously.

2. A canary deployment shows error-rate regression at the 20% traffic step. What does Argo Rollouts do automatically?

Immediately escalates to 100% to flush out the bug. Aborts the rollout and leaves the stable ReplicaSet serving all traffic. Pages the entire engineering org regardless of severity. Deletes the previous version to force a rebuild.

3. A PII leakage incident has been confirmed. Which runbook is appropriate, per the chapter's four-runbook taxonomy?

General model runbook — switch to fallback and capture diagnostics. Quality/drift runbook — re-tune thresholds and retrain. Data/privacy runbook — contain, scope assessment, DPO coordination. Safety/LLM guardrail runbook — update filters and adversarial training.

Your Progress

Answer Explanations