Study Guide: Chapter 12 — SLOs, Alerting, and Operational Excellence

Capstone chapter. From "we collect telemetry" to "we run services to a contract."

Pre-Quiz — Half 1 (Sections 1 & 2)

1. For a service with a 99.9% availability SLO measured over a 30-day window, approximately how much "bad" time is the error budget?

A. About 4.3 minutes per month

B. About 43 minutes per month

C. About 7.2 hours per month

D. About 30 minutes per month (one per day)

2. A good Service Level Indicator (SLI) is best expressed as:

A. The raw count of 5xx responses returned by the service

B. A snapshot of CPU and memory utilization on the host

C. A ratio of "good events" to "valid events" from the user's perspective

D. The wall-clock time since the last deploy

3. If your service has an observed error fraction with burn rate B = 14.4 against a 30-day SLO window, roughly how long until the budget is fully exhausted?

A. ~2 days

B. ~14 hours

C. ~30 days (by definition)

D. Already gone — 14.4 is a per-minute exhaustion rate

4. Why does the canonical MWMBR alert pair a short (e.g. 5m) window with a long (e.g. 1h) window?

A. The short window fires fast; the long window prevents one-off blips from paging anyone

B. Prometheus cannot evaluate a single window faster than 1h

C. Alertmanager requires two windows for HA quorum

D. It doubles the recording-rule cost intentionally so the alert is more sensitive

5. Which of the following is the most common anti-pattern in alerting that this chapter warns against?

A. Linking a runbook URL on every page

B. Paging on raw causes (CPU > 90%) instead of on user-visible symptoms

C. Using a for: clause on alert rules

D. Inhibiting per-pod alerts when a node-down alert is firing

1. Service Level Indicators and Objectives

A production service is a promise to its users. SLIs measure how well you keep it, SLOs are the targets you commit to, and the error budget is the contractually allowed slack between perfect and "good enough."

Choosing Good SLIs

A good SLI is a ratio between good events and valid events — e.g. "the fraction of HTTP requests that returned a non-5xx within 300 ms" — measured from the user's perspective. Ratios normalize naturally with traffic: 1% errors at 10 RPS is the same as 1% errors at 10,000 RPS.

Four families cover most services:

SLI Family	Question	PromQL Shape
Availability	Did the request succeed?	`sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))`
Latency	Was the response fast enough?	`sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))`
Freshness	Is the data recent enough?	`time() - max(pipeline_last_success_timestamp_seconds) < 600`
Correctness	Did we compute the right answer?	Domain-specific — reconciliation counters

Analogy: SLIs are the dashboard of a car. Availability = "does the engine start?"; latency = "how fast can it accelerate?"; freshness = "how old is the GPS reading?"; correctness = "does the odometer match what we drove?"

Figure 12.1: SLI → SLO → Error Budget → Burn Rate

graph TD SLI[Service Level Indicator
good events / valid events
e.g. non-5xx requests under 300ms] SLO[Service Level Objective
target on the SLI
e.g. 99.9% over 30 days] EB[Error Budget
= 1 - SLO
e.g. 0.1% ~ 43.2 min / month] BR[Burn Rate
observed_error_fraction / error_budget
how fast budget is consumed] A[Burn-Rate Alert
fires when consumption is unsustainable
page or ticket by severity] SLI --> SLO SLO --> EB EB --> BR BR --> A

Error Budgets

Once you commit to an SLO, the arithmetic is trivial: error_budget = 1 - SLO. For 99.9% over 30 days: 0.001 × 43,200 minutes ≈ 43.2 minutes/month. That number — not the percentage — is the most useful artifact in your program: it converts an abstraction into a budget engineers and PMs can reason about. If you have burned 30 min this month you have 13 left, which immediately changes deployment risk decisions.

SLO	Bad fraction	30-day budget	90-day budget
99%	1.0%	7 h 12 m	21 h 36 m
99.5%	0.5%	3 h 36 m	10 h 48 m
99.9%	0.1%	43.2 m	2 h 9.6 m
99.95%	0.05%	21.6 m	1 h 4.8 m
99.99%	0.01%	4.32 m	12.96 m

Burn Rate and Multi-Window Multi-Burn-Rate

The burn rate B = observed_error_fraction / error_budget answers the only question that matters: at the current pace, when do we run out? If B = 14.4, a 30-day budget is gone in ~2 days.

The naïve "alert when error rate > threshold" approach pages on transient blips and misses slow erosion. The Google SRE workbook's MWMBR pattern fixes both by requiring agreement across two windows:

Alert	Short	Long	Burn rate	Severity
Fast-burn page	5 m	1 h	14.4	page
Medium-burn page	30 m	6 h	6	page
Slow-burn ticket	2 h	24 h	3	ticket
Slowest-burn ticket	6 h	3 d	1	ticket

# Recording rules — compute once, reuse everywhere
- record: slo:http_errors:ratio_rate5m
  expr: |
    sum by (service, env) (rate(http_requests_total{code=~"5.."}[5m]))
    /
    sum by (service, env) (rate(http_requests_total[5m]))

# Fast-burn MWMBR alert
- alert: SLOErrorBudgetBurnFast
  expr: |
    (
      slo:http_errors:ratio_rate5m{service="checkout"} > (14.4 * 0.001)
      and
      slo:http_errors:ratio_rate1h{service="checkout"} > (14.4 * 0.001)
    )
    and
    sum by (service, env) (rate(http_requests_total{service="checkout"}[5m])) > 1
  for: 2m
  labels: { severity: page, slo: availability-99.9-30d, team: payments }
  annotations:
    summary: "Checkout burning error budget at >14.4x in {{ $labels.env }}"
    runbook_url: "https://runbooks.example.com/checkout/slo-fast-burn"

Two production details: the sum(rate(...)) > 1 gate kills spurious 100% ratios in low-traffic windows; latency SLOs reuse the same machinery with a histogram le-bucket numerator.

Figure 12.2: MWMBR Alert Escalation

graph TD ER[Observed Error Ratio
recording rules 5m, 1h, 6h, 3d] F1{5m AND 1h
above 14.4x burn?} M1{30m AND 6h
above 6x burn?} S1{2h AND 24h
above 3x burn?} SS1{6h AND 3d
above 1x burn?} P1[Fast-Burn PAGE
~2 days to exhaustion] P2[Medium-Burn PAGE
~5 days] T1[Slow-Burn TICKET
~10 days] T2[Slowest-Burn TICKET
exhaustion in window] OK[No alert — budget healthy] ER --> F1 F1 -->|yes| P1 F1 -->|no| M1 M1 -->|yes| P2 M1 -->|no| S1 S1 -->|yes| T1 S1 -->|no| SS1 SS1 -->|yes| T2 SS1 -->|no| OK

2. Alerting Architecture

A well-tuned SLO is wasted if it lands in a flood of noise at 3 a.m. The alerting architecture — how alerts route, group, inhibit, and reach humans — decides whether on-call is sustainable.

Alertmanager: Grouping, Routing, Inhibition

Grouping collapses related alerts into one notification. group_by: [alertname, service, severity, env] produces one notification per actual incident; grouping by pod or container floods you during every rolling deploy.
Routing trees branch by severity, env, and team ownership. Critical prod → PagerDuty; warnings → Slack channel; info → low-priority or nowhere.
Inhibition suppresses symptom alerts when a known root-cause alert is firing (e.g. a node-down alert silences every per-pod alert on that node).

route:
  receiver: 'default-slack'
  group_by: ['alertname', 'service', 'severity', 'env']
  group_wait: 60s
  group_interval: 10m
  repeat_interval: 4h
  routes:
    - matchers: [ severity="critical" ]
      routes:
        - matchers: [ env="prod" ]
          receiver: 'pagerduty-prod'
          continue: true
          routes:
            - matchers: [ team="payments" ]
              receiver: 'pagerduty-payments'

inhibit_rules:
  - source_matchers: [ severity="critical", alertname="KubernetesNodeDown" ]
    target_matchers: [ alertname=~"KubePodCrashLooping|KubePodNotReady|InstanceDown" ]
    equal: ['node', 'env']

For HA, run three Alertmanager replicas with gossip clustering so a restart doesn't double-fire every active alert.

On-Call Ergonomics

Every page must be actionable: the on-call engineer should know within 60 seconds what to do. The hard rule is no page without a runbook URL. A runbook entry answers:

How do I confirm this is real? (PromQL, dashboard, log query)
What is the most likely cause?
What are the safe remediation steps?
When do I escalate, and to whom?
What is the rollback criterion?

Anti-Patterns

Paging on causes (CPU > 90%) instead of symptoms (users see errors). A CPU at 99% with happy users is not an incident; the same CPU at 60% during a brownout is. Anchor pages on the SLI ladder.
Static thresholds on dynamic systems — "latency > 500 ms" pages every Black Friday. Burn-rate alerts are the cure.
Alerts without owners — every rule must carry a team label.
Per-pod alerts in Kubernetes — alert on Deployment/Service aggregates, not individual replicas.
Unbounded silences — always set durations and require a ticket comment.
No for: clause — without it every blip becomes a page.

Figure 12.3: Alertmanager Routing with Grouping and Inhibition

flowchart TD P[Prometheus
alert rules fire] --> AM[Alertmanager
HA cluster x3] AM --> G[Group by
alertname, service,
severity, env] G --> I{Inhibition rules
root-cause active?} I -->|suppressed| X[Drop symptom alert] I -->|allowed| R{Route by severity} R -->|critical| RE{env?} R -->|warning| SW[Slack #alerts-warnings
repeat 12h] R -->|info| SI[Slack #alerts-info
repeat 24h] RE -->|prod| RT{team?} RE -->|staging or dev| SN[Slack #nonprod] RT -->|payments| PDP[PagerDuty
payments rotation] RT -->|platform| PDF[PagerDuty
platform rotation] PDP --> RB[Runbook URL
+ dashboard link
+ silence controls] PDF --> RB

Post-Quiz — Half 1 (Sections 1 & 2)

1. For a service with a 99.9% availability SLO measured over a 30-day window, approximately how much "bad" time is the error budget?

A. About 4.3 minutes per month

B. About 43 minutes per month

C. About 7.2 hours per month

D. About 30 minutes per month (one per day)

2. A good Service Level Indicator (SLI) is best expressed as:

A. The raw count of 5xx responses returned by the service

B. A snapshot of CPU and memory utilization on the host

C. A ratio of "good events" to "valid events" from the user's perspective

D. The wall-clock time since the last deploy

3. If your service has an observed error fraction with burn rate B = 14.4 against a 30-day SLO window, roughly how long until the budget is fully exhausted?

A. ~2 days

B. ~14 hours

C. ~30 days (by definition)

D. Already gone — 14.4 is a per-minute exhaustion rate

4. Why does the canonical MWMBR alert pair a short (e.g. 5m) window with a long (e.g. 1h) window?

A. The short window fires fast; the long window prevents one-off blips from paging anyone

B. Prometheus cannot evaluate a single window faster than 1h

C. Alertmanager requires two windows for HA quorum

D. It doubles the recording-rule cost intentionally so the alert is more sensitive

5. Which of the following is the most common anti-pattern in alerting that this chapter warns against?

A. Linking a runbook URL on every page

B. Paging on raw causes (CPU > 90%) instead of on user-visible symptoms

C. Using a for: clause on alert rules

D. Inhibiting per-pod alerts when a node-down alert is firing

Pre-Quiz — Half 2 (Sections 3 & 4)

6. In the six-layer reference architecture, what is "meta-observability"?

A. A second Grafana dashboard for business KPIs

B. A small, segregated Prometheus + Alertmanager that monitors the observability stack itself

C. The OTel Collector's internal metrics endpoint

D. A vendor-specific term for trace metrics

7. Which Prometheus meta-metric is the #1 outage cause for the observability stack itself?

A. prometheus_target_scrape_pool_sync_total failures

B. prometheus_rule_evaluation_duration_seconds

C. prometheus_tsdb_head_series — cardinality runaway

D. up{job="prometheus"} flipping to 0

8. The chapter describes migrating from Prometheus-only stacks as "wrap, don't replace." What does that mean in practice?

A. Wrap every Prometheus alert in a Python script before sending it to PagerDuty

B. Keep Prometheus where it works; add OTel Collectors as the entry point for new signals (using prometheusremotewrite and the prometheus receiver)

C. Tunnel Prometheus traffic through a service mesh sidecar before scraping

D. Rip out Prometheus and replace it with the OTel Collector entirely on day one

9. What unique question does continuous profiling answer that metrics, logs, and traces cannot?

A. "Is something wrong?"

B. "Where in the request path is the slowness?"

C. "Exactly which code is consuming resources, and how did it change?"

D. "What error message did the user see?"

10. Why does the chapter argue that investing in OTel semantic-convention conformance pays off strategically?

A. It's required by Prometheus 3.x or scraping silently breaks

B. It shrinks vendor lock-in, enables fleet-wide queries, and makes AI-assisted RCA more useful

C. It reduces Prometheus memory usage by ~50% per series

D. It is the only way to enable inhibition rules in Alertmanager

3. Putting It All Together

You have all the pieces. The job now is integration — a platform that holds together at scale, can roll out incrementally, and is itself observable.

Reference Architecture — Six Logical Layers

Instrumentation — OTel SDKs + auto-instrumentation, emitting OTLP for metrics, traces, logs, and profiles.
Collection — OTel Collector DaemonSets + fan-in Deployments; Prometheus via the Operator for scrape targets (kube-state-metrics, node-exporter, cAdvisor).
Storage — Prometheus TSDB (short-term) → Thanos/Mimir/Cortex (long-term); Tempo/Jaeger for traces; Loki/Elasticsearch for logs; Pyroscope/Parca for profiles.
Rules & alerting — recording + alerting rules, Alertmanager HA, runbook hosting.
Visualization — Grafana with datasources across every signal store.
Meta-observability — a small, segregated Prometheus + Alertmanager whose only job is monitoring the observability stack itself.

Figure 12.4: Reference Observability Platform on Kubernetes

flowchart LR subgraph L1[1. Instrumentation] APP[Application Pods
OTel SDKs
auto-instrumentation] end subgraph L2[2. Collection] DS[OTel Collector
DaemonSet] DEP[OTel Collector
Deployment fan-in] PROM[Prometheus Operator
ServiceMonitors] end subgraph L3[3. Storage] TSDB[Prometheus TSDB] LTM[Thanos / Mimir] TR[Tempo / Jaeger] LG[Loki] PR[Pyroscope / Parca] end subgraph L4[4. Rules & Alerting] RR[Recording Rules] AR[Alerting Rules MWMBR] AM[Alertmanager HA] end subgraph L5[5. Visualization] GR[Grafana] end subgraph L6[6. Meta-Observability] WD[Watchdog Prometheus
+ Alertmanager] end APP --> DS APP --> DEP APP --> PROM DS --> TSDB DS --> TR DS --> LG DEP --> LTM PROM --> TSDB TSDB --> LTM TSDB --> RR RR --> AR AR --> AM TSDB --> GR LTM --> GR TR --> GR LG --> GR PR --> GR AM --> GR WD -.watches.-> L2 WD -.watches.-> L3 WD -.watches.-> L4

Greenfield Rollout

Week	Action	Outcome
1	Deploy kube-prometheus-stack with default rules	Cluster metrics, basic alerts, Grafana
2	OTel Collector DaemonSet → Prometheus + Tempo	First auto-instrumented traces
3	Define 2–3 critical SLOs with MWMBR alerts	First user-anchored pages
4	Alertmanager routing tree + inhibition + runbooks	Reduced noise; team ownership
5–6	Long-term store via remote-write	Multi-month retention
7–8	Logs (Loki) + continuous profiling (Pyroscope)	Full four-signal stack

Migration from Prometheus-only stacks follows the "wrap, don't replace" rule. Keep Prometheus where it works (scrape-based infra, alerting); add OTel Collectors as the entry point for new signals. The Collector's prometheusremotewrite exporter ships OTel metrics into Prometheus/Mimir, and the prometheus receiver scrapes existing exporters — a unified pipeline without a forklift migration.

Meta-Observability: Capacity Planning the Platform

Component	Meta-Metric	Why It Matters
Prometheus	`prometheus_rule_evaluation_duration_seconds`	Rule lag = late alerts
Prometheus	`prometheus_tsdb_head_series`	Cardinality runaway = #1 outage cause
Alertmanager	`alertmanager_notifications_failed_total`	Pages may not reach humans
Alertmanager	`alertmanager_cluster_members`	HA quorum lost → duplicates
OTel Collector	`otelcol_exporter_queue_size`	Backpressure → telemetry loss
OTel Collector	`otelcol_processor_dropped_spans`	Sampling working too aggressively

Sizing rules of thumb:

Prometheus memory: ~3 KB per active series in steady state; +25% headroom for query bursts.
Prometheus disk: ~1.5 bytes/sample after compression × samples/sec × retention seconds.
Tempo/Jaeger: head-based sampling at 1–10%, plus tail-based sampling for errors/slow traces.
OTel Collector: 1 GB RAM per ~20k spans/sec with batching; CPU scales linearly with attribute count.
Loki: log volume dominates. WARN+ in prod is sustainable; DEBUG in prod is not.

4. Where Observability Is Heading

The fundamentals — SLOs, MWMBR, Alertmanager hygiene — have been stable for nearly a decade. The interesting changes are at the edges.

Profiles as a Fourth Signal

Three pillars (metrics, logs, traces) become four. Continuous profiling — always-on stack-trace sampling at 50–200 Hz with 1–3% overhead — is the fourth.

Signal	Question
Metrics	Is something wrong?
Logs	What happened?
Traces	Where in the request path?
Profiles	Exactly which code is consuming resources?

Example: metrics show p99 rising from 200 → 500 ms after deploy. Traces narrow it to CalculateDiscounts spans. Profiles reveal 40% of CPU is in a new calculate_rewards_v2 function, dominated by hashmap operations. Without profiles you have a suspect; with profiles you have the exact lines of code.

Grafana Pyroscope (Prometheus-like labels, multi-language SDKs, eBPF) and Parca / Polar Signals (Kubernetes-native eBPF DaemonSet) lead the OSS space. The maturing OpenTelemetry profiling signal adds profiles as a first-class OTLP signal sharing resource attributes and trace/span IDs.

AI-Assisted Root Cause Analysis

Vendors and OSS projects are building systems that:

Correlate signals automatically when an SLO burn alert fires — matching traces, topology, profiles, recent commits, feature flags.
Summarize hypotheses — an LLM drafts "what changed?" and "where is load concentrated?" for the on-call to edit.
Suggest queries in natural language, grounded in the available metric and log schemas.

Two prerequisites: OTel semantic conventions for vocabulary, and structured runbooks for remediation playbooks. This is not a replacement for engineering judgment — it removes the manual query-writing tax during incidents.

Semantic Convention Convergence

The longest-running success of OTel is that vendors converge on the same names: service.name, http.response.status_code, db.system, k8s.pod.name, deployment.environment. Practical implications:

Vendor lock-in shrinks — switching APMs becomes a Collector reconfiguration.
Cross-team analytics work — fleet-wide SLO compliance from a single query.
AI assistants become more useful — standard names mean cross-org generalization.

Strategic takeaway: invest in conformance now. Add CI checks that reject non-standard attribute names; require OTel resource attributes via the Collector's resourcedetection processor.

Figure 12.5: The Four Pillars and the Path Forward

graph TD subgraph PILLARS[Signals of Modern Observability] M[Metrics
Is something wrong?
Prometheus, OTLP] L[Logs
What happened?
Loki, Elasticsearch] T[Traces
Where in the request path?
Tempo, Jaeger] PF[Profiles
Which lines of code?
Pyroscope, Parca] end SC[OTel Semantic Conventions
service.name, http.response.status_code,
k8s.pod.name, deployment.environment] AI[AI-Assisted RCA
correlate signals, summarize hypotheses,
suggest queries from runbooks] INC[Incident
faster MTTD & MTTR
portable across vendors] M --> SC L --> SC T --> SC PF --> SC SC --> AI AI --> INC

Post-Quiz — Half 2 (Sections 3 & 4)

6. In the six-layer reference architecture, what is "meta-observability"?

A. A second Grafana dashboard for business KPIs

B. A small, segregated Prometheus + Alertmanager that monitors the observability stack itself

C. The OTel Collector's internal metrics endpoint

D. A vendor-specific term for trace metrics

7. Which Prometheus meta-metric is the #1 outage cause for the observability stack itself?

A. prometheus_target_scrape_pool_sync_total failures

B. prometheus_rule_evaluation_duration_seconds

C. prometheus_tsdb_head_series — cardinality runaway

D. up{job="prometheus"} flipping to 0

8. The chapter describes migrating from Prometheus-only stacks as "wrap, don't replace." What does that mean in practice?

A. Wrap every Prometheus alert in a Python script before sending it to PagerDuty

B. Keep Prometheus where it works; add OTel Collectors as the entry point for new signals (using prometheusremotewrite and the prometheus receiver)

C. Tunnel Prometheus traffic through a service mesh sidecar before scraping

D. Rip out Prometheus and replace it with the OTel Collector entirely on day one

9. What unique question does continuous profiling answer that metrics, logs, and traces cannot?

A. "Is something wrong?"

B. "Where in the request path is the slowness?"

C. "Exactly which code is consuming resources, and how did it change?"

D. "What error message did the user see?"

10. Why does the chapter argue that investing in OTel semantic-convention conformance pays off strategically?

A. It's required by Prometheus 3.x or scraping silently breaks

B. It shrinks vendor lock-in, enables fleet-wide queries, and makes AI-assisted RCA more useful

C. It reduces Prometheus memory usage by ~50% per series

D. It is the only way to enable inhibition rules in Alertmanager

Chapter Summary & Key Terms

You've crossed the bridge from "we collect telemetry" to "we operate to a contract." SLIs measure user-visible behavior as good/valid ratios. SLOs are targets on SLIs; their inverse is the error budget. MWMBR alerts page only when the budget is genuinely at risk. Alertmanager turns alerts into humane notifications via grouping, severity-aware routing, inhibition, runbook-linked annotations, and HA clustering. A reference architecture layers instrumentation, collection, storage, rules, visualization, and meta-observability. The next frontier is continuous profiling, AI-assisted RCA, and semantic-convention convergence — all of which plug into the OTel foundation you build today.

Term	Definition
SLI	Service Level Indicator — a ratio of good to valid events reflecting user experience.
SLO	Service Level Objective — target value/range for an SLI over a defined window.
Error budget	Allowed "bad" behavior, computed as `1 − SLO`. ~43 min/mo for 99.9%.
Burn rate	Ratio of observed error fraction to error budget; `B = 14.4` exhausts a 30-day budget in ~2 days.
MWMBR	Multi-window multi-burn-rate — short + long windows must both cross to fire.
Inhibition rule	Alertmanager rule suppressing symptom alerts while a root-cause alert is firing.
Runbook	Operational doc linked from every page; confirm, cause, remediate, escalate, rollback.
Meta-observability	Monitoring the observability platform itself via a segregated stack.
Continuous profiling	Always-on stack-trace sampling at 50–200 Hz with 1–3% overhead — the fourth signal.
OTel semantic conventions	Standardized attribute names enabling cross-vendor portability and AI tooling.

Chapter 12 — SLOs, Alerting, and Operational Excellence

Learning Objectives

1. Service Level Indicators and Objectives

Choosing Good SLIs

Figure 12.1: SLI → SLO → Error Budget → Burn Rate

Error Budgets

Burn Rate and Multi-Window Multi-Burn-Rate

Figure 12.2: MWMBR Alert Escalation

Key Takeaway — Section 1

2. Alerting Architecture

Alertmanager: Grouping, Routing, Inhibition

On-Call Ergonomics

Anti-Patterns

Figure 12.3: Alertmanager Routing with Grouping and Inhibition

Key Takeaway — Section 2

3. Putting It All Together

Reference Architecture — Six Logical Layers

Figure 12.4: Reference Observability Platform on Kubernetes

Greenfield Rollout

Meta-Observability: Capacity Planning the Platform

Key Takeaway — Section 3

4. Where Observability Is Heading

Profiles as a Fourth Signal

AI-Assisted Root Cause Analysis

Semantic Convention Convergence

Figure 12.5: The Four Pillars and the Path Forward

Key Takeaway — Section 4

Chapter Summary & Key Terms

Your Progress

Answer Explanations