Chapter 12 — SLOs, Alerting, and Operational Excellence
Capstone chapter. From "we collect telemetry" to "we run services to a contract."
Learning Objectives
Define SLIs and SLOs using PromQL and OpenTelemetry-sourced metrics (availability, latency, freshness, correctness).
Build actionable, low-noise alerting based on multi-window multi-burn-rate logic and error-budget policy.
Configure Alertmanager routing, grouping, inhibition, and silencing to reduce on-call fatigue without losing real signal.
Design a complete observability platform that evolves from greenfield through a mature production deployment, including capacity planning for the platform itself.
Recognize emerging trends — continuous profiling, AI-assisted RCA, and semantic convention convergence — and decide when to invest in them.
Pre-Quiz — Half 1 (Sections 1 & 2)
1. For a service with a 99.9% availability SLO measured over a 30-day window, approximately how much "bad" time is the error budget?
A. About 4.3 minutes per month
B. About 43 minutes per month
C. About 7.2 hours per month
D. About 30 minutes per month (one per day)
2. A good Service Level Indicator (SLI) is best expressed as:
A. The raw count of 5xx responses returned by the service
B. A snapshot of CPU and memory utilization on the host
C. A ratio of "good events" to "valid events" from the user's perspective
D. The wall-clock time since the last deploy
3. If your service has an observed error fraction with burn rate B = 14.4 against a 30-day SLO window, roughly how long until the budget is fully exhausted?
A. ~2 days
B. ~14 hours
C. ~30 days (by definition)
D. Already gone — 14.4 is a per-minute exhaustion rate
4. Why does the canonical MWMBR alert pair a short (e.g. 5m) window with a long (e.g. 1h) window?
A. The short window fires fast; the long window prevents one-off blips from paging anyone
B. Prometheus cannot evaluate a single window faster than 1h
C. Alertmanager requires two windows for HA quorum
D. It doubles the recording-rule cost intentionally so the alert is more sensitive
5. Which of the following is the most common anti-pattern in alerting that this chapter warns against?
A. Linking a runbook URL on every page
B. Paging on raw causes (CPU > 90%) instead of on user-visible symptoms
C. Using a for: clause on alert rules
D. Inhibiting per-pod alerts when a node-down alert is firing
1. Service Level Indicators and Objectives
A production service is a promise to its users. SLIs measure how well you keep it, SLOs are the targets you commit to, and the error budget is the contractually allowed slack between perfect and "good enough."
Choosing Good SLIs
A good SLI is a ratio between good events and valid events — e.g. "the fraction of HTTP requests that returned a non-5xx within 300 ms" — measured from the user's perspective. Ratios normalize naturally with traffic: 1% errors at 10 RPS is the same as 1% errors at 10,000 RPS.
Analogy: SLIs are the dashboard of a car. Availability = "does the engine start?"; latency = "how fast can it accelerate?"; freshness = "how old is the GPS reading?"; correctness = "does the odometer match what we drove?"
Figure 12.1: SLI → SLO → Error Budget → Burn Rate
graph TD
SLI[Service Level Indicator good events / valid events e.g. non-5xx requests under 300ms]
SLO[Service Level Objective target on the SLI e.g. 99.9% over 30 days]
EB[Error Budget = 1 - SLO e.g. 0.1% ~ 43.2 min / month]
BR[Burn Rate observed_error_fraction / error_budget how fast budget is consumed]
A[Burn-Rate Alert fires when consumption is unsustainable page or ticket by severity]
SLI --> SLO
SLO --> EB
EB --> BR
BR --> A
Error Budgets
Once you commit to an SLO, the arithmetic is trivial: error_budget = 1 - SLO. For 99.9% over 30 days: 0.001 × 43,200 minutes ≈ 43.2 minutes/month. That number — not the percentage — is the most useful artifact in your program: it converts an abstraction into a budget engineers and PMs can reason about. If you have burned 30 min this month you have 13 left, which immediately changes deployment risk decisions.
SLO
Bad fraction
30-day budget
90-day budget
99%
1.0%
7 h 12 m
21 h 36 m
99.5%
0.5%
3 h 36 m
10 h 48 m
99.9%
0.1%
43.2 m
2 h 9.6 m
99.95%
0.05%
21.6 m
1 h 4.8 m
99.99%
0.01%
4.32 m
12.96 m
Animation A — Error Budget Burn (99.9% / 30-day)
A 99.9% SLO allots ~43 minutes of bad time per month. As incidents accrue, the bar depletes; at 50% you ticket, at 100% you freeze risky deploys until the window resets.
Burn Rate and Multi-Window Multi-Burn-Rate
The burn rateB = observed_error_fraction / error_budget answers the only question that matters: at the current pace, when do we run out? If B = 14.4, a 30-day budget is gone in ~2 days.
The naïve "alert when error rate > threshold" approach pages on transient blips and misses slow erosion. The Google SRE workbook's MWMBR pattern fixes both by requiring agreement across two windows:
Alert
Short
Long
Burn rate
Severity
Fast-burn page
5 m
1 h
14.4
page
Medium-burn page
30 m
6 h
6
page
Slow-burn ticket
2 h
24 h
3
ticket
Slowest-burn ticket
6 h
3 d
1
ticket
# Recording rules — compute once, reuse everywhere
- record: slo:http_errors:ratio_rate5m
expr: |
sum by (service, env) (rate(http_requests_total{code=~"5.."}[5m]))
/
sum by (service, env) (rate(http_requests_total[5m]))
# Fast-burn MWMBR alert
- alert: SLOErrorBudgetBurnFast
expr: |
(
slo:http_errors:ratio_rate5m{service="checkout"} > (14.4 * 0.001)
and
slo:http_errors:ratio_rate1h{service="checkout"} > (14.4 * 0.001)
)
and
sum by (service, env) (rate(http_requests_total{service="checkout"}[5m])) > 1
for: 2m
labels: { severity: page, slo: availability-99.9-30d, team: payments }
annotations:
summary: "Checkout burning error budget at >14.4x in {{ $labels.env }}"
runbook_url: "https://runbooks.example.com/checkout/slo-fast-burn"
Two production details: the sum(rate(...)) > 1 gate kills spurious 100% ratios in low-traffic windows; latency SLOs reuse the same machinery with a histogram le-bucket numerator.
Animation B — Multi-Window Multi-Burn-Rate (MWMBR)
A real outage spikes both the short and long windows past the threshold simultaneously. A transient blip only spikes the 5m window, so it never pages. The 6h window catches slow erosion that the short windows would miss.
Figure 12.2: MWMBR Alert Escalation
graph TD
ER[Observed Error Ratio recording rules 5m, 1h, 6h, 3d]
F1{5m AND 1h above 14.4x burn?}
M1{30m AND 6h above 6x burn?}
S1{2h AND 24h above 3x burn?}
SS1{6h AND 3d above 1x burn?}
P1[Fast-Burn PAGE ~2 days to exhaustion]
P2[Medium-Burn PAGE ~5 days]
T1[Slow-Burn TICKET ~10 days]
T2[Slowest-Burn TICKET exhaustion in window]
OK[No alert — budget healthy]
ER --> F1
F1 -->|yes| P1
F1 -->|no| M1
M1 -->|yes| P2
M1 -->|no| S1
S1 -->|yes| T1
S1 -->|no| SS1
SS1 -->|yes| T2
SS1 -->|no| OK
Key Takeaway — Section 1
SLIs are good/valid ratios, measured from the user's perspective.
error_budget = 1 − SLO. For 99.9% over 30 days that is ~43.2 minutes.
Burn rate B = observed_error_fraction / error_budget answers "how soon do we run out?"
MWMBR pairs short + long windows so you page on real spikes and catch slow erosion, without crying wolf.
2. Alerting Architecture
A well-tuned SLO is wasted if it lands in a flood of noise at 3 a.m. The alerting architecture — how alerts route, group, inhibit, and reach humans — decides whether on-call is sustainable.
Alertmanager: Grouping, Routing, Inhibition
Grouping collapses related alerts into one notification. group_by: [alertname, service, severity, env] produces one notification per actual incident; grouping by pod or container floods you during every rolling deploy.
Routing trees branch by severity, env, and team ownership. Critical prod → PagerDuty; warnings → Slack channel; info → low-priority or nowhere.
Inhibition suppresses symptom alerts when a known root-cause alert is firing (e.g. a node-down alert silences every per-pod alert on that node).
For HA, run three Alertmanager replicas with gossip clustering so a restart doesn't double-fire every active alert.
On-Call Ergonomics
Every page must be actionable: the on-call engineer should know within 60 seconds what to do. The hard rule is no page without a runbook URL. A runbook entry answers:
How do I confirm this is real? (PromQL, dashboard, log query)
What is the most likely cause?
What are the safe remediation steps?
When do I escalate, and to whom?
What is the rollback criterion?
Anti-Patterns
Paging on causes (CPU > 90%) instead of symptoms (users see errors). A CPU at 99% with happy users is not an incident; the same CPU at 60% during a brownout is. Anchor pages on the SLI ladder.
Static thresholds on dynamic systems — "latency > 500 ms" pages every Black Friday. Burn-rate alerts are the cure.
Alerts without owners — every rule must carry a team label.
Per-pod alerts in Kubernetes — alert on Deployment/Service aggregates, not individual replicas.
Unbounded silences — always set durations and require a ticket comment.
No for: clause — without it every blip becomes a page.
Figure 12.3: Alertmanager Routing with Grouping and Inhibition
flowchart TD
P[Prometheus alert rules fire] --> AM[Alertmanager HA cluster x3]
AM --> G[Group by alertname, service, severity, env]
G --> I{Inhibition rules root-cause active?}
I -->|suppressed| X[Drop symptom alert]
I -->|allowed| R{Route by severity}
R -->|critical| RE{env?}
R -->|warning| SW[Slack #alerts-warnings repeat 12h]
R -->|info| SI[Slack #alerts-info repeat 24h]
RE -->|prod| RT{team?}
RE -->|staging or dev| SN[Slack #nonprod]
RT -->|payments| PDP[PagerDuty payments rotation]
RT -->|platform| PDF[PagerDuty platform rotation]
PDP --> RB[Runbook URL + dashboard link + silence controls]
PDF --> RB
Key Takeaway — Section 2
Grouping + inhibition + severity-aware routing is what makes on-call humane.
Every page must link a runbook and point at a user-visible symptom, never a raw cause metric.
Run Alertmanager HA (3 replicas, gossip) to avoid duplicate or dropped notifications across restarts.
Track alerts/incident, % pages after-hours, and % alerts with runbooks. Any number trending wrong is fixable.
Post-Quiz — Half 1 (Sections 1 & 2)
1. For a service with a 99.9% availability SLO measured over a 30-day window, approximately how much "bad" time is the error budget?
A. About 4.3 minutes per month
B. About 43 minutes per month
C. About 7.2 hours per month
D. About 30 minutes per month (one per day)
2. A good Service Level Indicator (SLI) is best expressed as:
A. The raw count of 5xx responses returned by the service
B. A snapshot of CPU and memory utilization on the host
C. A ratio of "good events" to "valid events" from the user's perspective
D. The wall-clock time since the last deploy
3. If your service has an observed error fraction with burn rate B = 14.4 against a 30-day SLO window, roughly how long until the budget is fully exhausted?
A. ~2 days
B. ~14 hours
C. ~30 days (by definition)
D. Already gone — 14.4 is a per-minute exhaustion rate
4. Why does the canonical MWMBR alert pair a short (e.g. 5m) window with a long (e.g. 1h) window?
A. The short window fires fast; the long window prevents one-off blips from paging anyone
B. Prometheus cannot evaluate a single window faster than 1h
C. Alertmanager requires two windows for HA quorum
D. It doubles the recording-rule cost intentionally so the alert is more sensitive
5. Which of the following is the most common anti-pattern in alerting that this chapter warns against?
A. Linking a runbook URL on every page
B. Paging on raw causes (CPU > 90%) instead of on user-visible symptoms
C. Using a for: clause on alert rules
D. Inhibiting per-pod alerts when a node-down alert is firing
Pre-Quiz — Half 2 (Sections 3 & 4)
6. In the six-layer reference architecture, what is "meta-observability"?
A. A second Grafana dashboard for business KPIs
B. A small, segregated Prometheus + Alertmanager that monitors the observability stack itself
C. The OTel Collector's internal metrics endpoint
D. A vendor-specific term for trace metrics
7. Which Prometheus meta-metric is the #1 outage cause for the observability stack itself?
A. prometheus_target_scrape_pool_sync_total failures
B. prometheus_rule_evaluation_duration_seconds
C. prometheus_tsdb_head_series — cardinality runaway
D. up{job="prometheus"} flipping to 0
8. The chapter describes migrating from Prometheus-only stacks as "wrap, don't replace." What does that mean in practice?
A. Wrap every Prometheus alert in a Python script before sending it to PagerDuty
B. Keep Prometheus where it works; add OTel Collectors as the entry point for new signals (using prometheusremotewrite and the prometheus receiver)
C. Tunnel Prometheus traffic through a service mesh sidecar before scraping
D. Rip out Prometheus and replace it with the OTel Collector entirely on day one
9. What unique question does continuous profiling answer that metrics, logs, and traces cannot?
A. "Is something wrong?"
B. "Where in the request path is the slowness?"
C. "Exactly which code is consuming resources, and how did it change?"
D. "What error message did the user see?"
10. Why does the chapter argue that investing in OTel semantic-convention conformance pays off strategically?
A. It's required by Prometheus 3.x or scraping silently breaks
B. It shrinks vendor lock-in, enables fleet-wide queries, and makes AI-assisted RCA more useful
C. It reduces Prometheus memory usage by ~50% per series
D. It is the only way to enable inhibition rules in Alertmanager
3. Putting It All Together
You have all the pieces. The job now is integration — a platform that holds together at scale, can roll out incrementally, and is itself observable.
Reference Architecture — Six Logical Layers
Instrumentation — OTel SDKs + auto-instrumentation, emitting OTLP for metrics, traces, logs, and profiles.
Collection — OTel Collector DaemonSets + fan-in Deployments; Prometheus via the Operator for scrape targets (kube-state-metrics, node-exporter, cAdvisor).
Storage — Prometheus TSDB (short-term) → Thanos/Mimir/Cortex (long-term); Tempo/Jaeger for traces; Loki/Elasticsearch for logs; Pyroscope/Parca for profiles.
Visualization — Grafana with datasources across every signal store.
Meta-observability — a small, segregated Prometheus + Alertmanager whose only job is monitoring the observability stack itself.
Figure 12.4: Reference Observability Platform on Kubernetes
flowchart LR
subgraph L1[1. Instrumentation]
APP[Application Pods OTel SDKs auto-instrumentation]
end
subgraph L2[2. Collection]
DS[OTel Collector DaemonSet]
DEP[OTel Collector Deployment fan-in]
PROM[Prometheus Operator ServiceMonitors]
end
subgraph L3[3. Storage]
TSDB[Prometheus TSDB]
LTM[Thanos / Mimir]
TR[Tempo / Jaeger]
LG[Loki]
PR[Pyroscope / Parca]
end
subgraph L4[4. Rules & Alerting]
RR[Recording Rules]
AR[Alerting Rules MWMBR]
AM[Alertmanager HA]
end
subgraph L5[5. Visualization]
GR[Grafana]
end
subgraph L6[6. Meta-Observability]
WD[Watchdog Prometheus + Alertmanager]
end
APP --> DS
APP --> DEP
APP --> PROM
DS --> TSDB
DS --> TR
DS --> LG
DEP --> LTM
PROM --> TSDB
TSDB --> LTM
TSDB --> RR
RR --> AR
AR --> AM
TSDB --> GR
LTM --> GR
TR --> GR
LG --> GR
PR --> GR
AM --> GR
WD -.watches.-> L2
WD -.watches.-> L3
WD -.watches.-> L4
Greenfield Rollout
Week
Action
Outcome
1
Deploy kube-prometheus-stack with default rules
Cluster metrics, basic alerts, Grafana
2
OTel Collector DaemonSet → Prometheus + Tempo
First auto-instrumented traces
3
Define 2–3 critical SLOs with MWMBR alerts
First user-anchored pages
4
Alertmanager routing tree + inhibition + runbooks
Reduced noise; team ownership
5–6
Long-term store via remote-write
Multi-month retention
7–8
Logs (Loki) + continuous profiling (Pyroscope)
Full four-signal stack
Migration from Prometheus-only stacks follows the "wrap, don't replace" rule. Keep Prometheus where it works (scrape-based infra, alerting); add OTel Collectors as the entry point for new signals. The Collector's prometheusremotewrite exporter ships OTel metrics into Prometheus/Mimir, and the prometheus receiver scrapes existing exporters — a unified pipeline without a forklift migration.
Meta-Observability: Capacity Planning the Platform
Component
Meta-Metric
Why It Matters
Prometheus
prometheus_rule_evaluation_duration_seconds
Rule lag = late alerts
Prometheus
prometheus_tsdb_head_series
Cardinality runaway = #1 outage cause
Alertmanager
alertmanager_notifications_failed_total
Pages may not reach humans
Alertmanager
alertmanager_cluster_members
HA quorum lost → duplicates
OTel Collector
otelcol_exporter_queue_size
Backpressure → telemetry loss
OTel Collector
otelcol_processor_dropped_spans
Sampling working too aggressively
Sizing rules of thumb:
Prometheus memory: ~3 KB per active series in steady state; +25% headroom for query bursts.
Tempo/Jaeger: head-based sampling at 1–10%, plus tail-based sampling for errors/slow traces.
OTel Collector: 1 GB RAM per ~20k spans/sec with batching; CPU scales linearly with attribute count.
Loki: log volume dominates. WARN+ in prod is sustainable; DEBUG in prod is not.
Key Takeaway — Section 3
Six layers: instrument → collect → store → rule → visualize → observe itself.
Roll out incrementally; the Operator + Collector keep configuration as code (GitOps).
"Wrap, don't replace" Prometheus when introducing OpenTelemetry.
Watchdog Prometheus + Alertmanager monitor the monitors — a few hundred lines of YAML.
4. Where Observability Is Heading
The fundamentals — SLOs, MWMBR, Alertmanager hygiene — have been stable for nearly a decade. The interesting changes are at the edges.
Profiles as a Fourth Signal
Three pillars (metrics, logs, traces) become four. Continuous profiling — always-on stack-trace sampling at 50–200 Hz with 1–3% overhead — is the fourth.
Signal
Question
Metrics
Is something wrong?
Logs
What happened?
Traces
Where in the request path?
Profiles
Exactly which code is consuming resources?
Example: metrics show p99 rising from 200 → 500 ms after deploy. Traces narrow it to CalculateDiscounts spans. Profiles reveal 40% of CPU is in a new calculate_rewards_v2 function, dominated by hashmap operations. Without profiles you have a suspect; with profiles you have the exact lines of code.
Grafana Pyroscope (Prometheus-like labels, multi-language SDKs, eBPF) and Parca / Polar Signals (Kubernetes-native eBPF DaemonSet) lead the OSS space. The maturing OpenTelemetry profiling signal adds profiles as a first-class OTLP signal sharing resource attributes and trace/span IDs.
Animation C — The Four Pillars Converge Under OTel
Three pillars stand. Profiles slides in from the right as the fourth signal — answering "exactly which lines of code?" with flame-graph data. All four converge under shared OTel semantic-convention attributes that make signals portable across vendors and analyzable by AI assistants.
AI-Assisted Root Cause Analysis
Vendors and OSS projects are building systems that:
Correlate signals automatically when an SLO burn alert fires — matching traces, topology, profiles, recent commits, feature flags.
Summarize hypotheses — an LLM drafts "what changed?" and "where is load concentrated?" for the on-call to edit.
Suggest queries in natural language, grounded in the available metric and log schemas.
Two prerequisites: OTel semantic conventions for vocabulary, and structured runbooks for remediation playbooks. This is not a replacement for engineering judgment — it removes the manual query-writing tax during incidents.
Semantic Convention Convergence
The longest-running success of OTel is that vendors converge on the same names: service.name, http.response.status_code, db.system, k8s.pod.name, deployment.environment. Practical implications:
Vendor lock-in shrinks — switching APMs becomes a Collector reconfiguration.
Cross-team analytics work — fleet-wide SLO compliance from a single query.
AI assistants become more useful — standard names mean cross-org generalization.
Strategic takeaway: invest in conformance now. Add CI checks that reject non-standard attribute names; require OTel resource attributes via the Collector's resourcedetection processor.
Figure 12.5: The Four Pillars and the Path Forward
graph TD
subgraph PILLARS[Signals of Modern Observability]
M[Metrics Is something wrong? Prometheus, OTLP]
L[Logs What happened? Loki, Elasticsearch]
T[Traces Where in the request path? Tempo, Jaeger]
PF[Profiles Which lines of code? Pyroscope, Parca]
end
SC[OTel Semantic Conventions service.name, http.response.status_code, k8s.pod.name, deployment.environment]
AI[AI-Assisted RCA correlate signals, summarize hypotheses, suggest queries from runbooks]
INC[Incident faster MTTD & MTTR portable across vendors]
M --> SC
L --> SC
T --> SC
PF --> SC
SC --> AI
AI --> INC
Key Takeaway — Section 4
Profiles become the fourth signal — answering "which lines of code?" that metrics, logs, and traces cannot.
AI-assisted RCA accelerates incident response when grounded in standardized telemetry and structured runbooks.
Semantic-convention convergence erodes vendor lock-in. Conformance is a strategic investment, enforced in CI.
Build the foundation on OTel today; future capabilities will plug into the same pipeline without re-platforming.
Post-Quiz — Half 2 (Sections 3 & 4)
6. In the six-layer reference architecture, what is "meta-observability"?
A. A second Grafana dashboard for business KPIs
B. A small, segregated Prometheus + Alertmanager that monitors the observability stack itself
C. The OTel Collector's internal metrics endpoint
D. A vendor-specific term for trace metrics
7. Which Prometheus meta-metric is the #1 outage cause for the observability stack itself?
A. prometheus_target_scrape_pool_sync_total failures
B. prometheus_rule_evaluation_duration_seconds
C. prometheus_tsdb_head_series — cardinality runaway
D. up{job="prometheus"} flipping to 0
8. The chapter describes migrating from Prometheus-only stacks as "wrap, don't replace." What does that mean in practice?
A. Wrap every Prometheus alert in a Python script before sending it to PagerDuty
B. Keep Prometheus where it works; add OTel Collectors as the entry point for new signals (using prometheusremotewrite and the prometheus receiver)
C. Tunnel Prometheus traffic through a service mesh sidecar before scraping
D. Rip out Prometheus and replace it with the OTel Collector entirely on day one
9. What unique question does continuous profiling answer that metrics, logs, and traces cannot?
A. "Is something wrong?"
B. "Where in the request path is the slowness?"
C. "Exactly which code is consuming resources, and how did it change?"
D. "What error message did the user see?"
10. Why does the chapter argue that investing in OTel semantic-convention conformance pays off strategically?
A. It's required by Prometheus 3.x or scraping silently breaks
B. It shrinks vendor lock-in, enables fleet-wide queries, and makes AI-assisted RCA more useful
C. It reduces Prometheus memory usage by ~50% per series
D. It is the only way to enable inhibition rules in Alertmanager
Chapter Summary & Key Terms
You've crossed the bridge from "we collect telemetry" to "we operate to a contract." SLIs measure user-visible behavior as good/valid ratios. SLOs are targets on SLIs; their inverse is the error budget. MWMBR alerts page only when the budget is genuinely at risk. Alertmanager turns alerts into humane notifications via grouping, severity-aware routing, inhibition, runbook-linked annotations, and HA clustering. A reference architecture layers instrumentation, collection, storage, rules, visualization, and meta-observability. The next frontier is continuous profiling, AI-assisted RCA, and semantic-convention convergence — all of which plug into the OTel foundation you build today.
Term
Definition
SLI
Service Level Indicator — a ratio of good to valid events reflecting user experience.
SLO
Service Level Objective — target value/range for an SLI over a defined window.
Error budget
Allowed "bad" behavior, computed as 1 − SLO. ~43 min/mo for 99.9%.
Burn rate
Ratio of observed error fraction to error budget; B = 14.4 exhausts a 30-day budget in ~2 days.
MWMBR
Multi-window multi-burn-rate — short + long windows must both cross to fire.
Inhibition rule
Alertmanager rule suppressing symptom alerts while a root-cause alert is firing.
Runbook
Operational doc linked from every page; confirm, cause, remediate, escalate, rollback.
Meta-observability
Monitoring the observability platform itself via a segregated stack.
Continuous profiling
Always-on stack-trace sampling at 50–200 Hz with 1–3% overhead — the fourth signal.
OTel semantic conventions
Standardized attribute names enabling cross-vendor portability and AI tooling.