Chapter 12 — SLOs, Alerting, and Operational Excellence

Capstone chapter. From "we collect telemetry" to "we run services to a contract."

Learning Objectives

Pre-Quiz — Half 1 (Sections 1 & 2)

1. For a service with a 99.9% availability SLO measured over a 30-day window, approximately how much "bad" time is the error budget?

A. About 4.3 minutes per month
B. About 43 minutes per month
C. About 7.2 hours per month
D. About 30 minutes per month (one per day)

2. A good Service Level Indicator (SLI) is best expressed as:

A. The raw count of 5xx responses returned by the service
B. A snapshot of CPU and memory utilization on the host
C. A ratio of "good events" to "valid events" from the user's perspective
D. The wall-clock time since the last deploy

3. If your service has an observed error fraction with burn rate B = 14.4 against a 30-day SLO window, roughly how long until the budget is fully exhausted?

A. ~2 days
B. ~14 hours
C. ~30 days (by definition)
D. Already gone — 14.4 is a per-minute exhaustion rate

4. Why does the canonical MWMBR alert pair a short (e.g. 5m) window with a long (e.g. 1h) window?

A. The short window fires fast; the long window prevents one-off blips from paging anyone
B. Prometheus cannot evaluate a single window faster than 1h
C. Alertmanager requires two windows for HA quorum
D. It doubles the recording-rule cost intentionally so the alert is more sensitive

5. Which of the following is the most common anti-pattern in alerting that this chapter warns against?

A. Linking a runbook URL on every page
B. Paging on raw causes (CPU > 90%) instead of on user-visible symptoms
C. Using a for: clause on alert rules
D. Inhibiting per-pod alerts when a node-down alert is firing

1. Service Level Indicators and Objectives

A production service is a promise to its users. SLIs measure how well you keep it, SLOs are the targets you commit to, and the error budget is the contractually allowed slack between perfect and "good enough."

Choosing Good SLIs

A good SLI is a ratio between good events and valid events — e.g. "the fraction of HTTP requests that returned a non-5xx within 300 ms" — measured from the user's perspective. Ratios normalize naturally with traffic: 1% errors at 10 RPS is the same as 1% errors at 10,000 RPS.

Four families cover most services:

SLI FamilyQuestionPromQL Shape
AvailabilityDid the request succeed?sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
LatencyWas the response fast enough?sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))
FreshnessIs the data recent enough?time() - max(pipeline_last_success_timestamp_seconds) < 600
CorrectnessDid we compute the right answer?Domain-specific — reconciliation counters

Analogy: SLIs are the dashboard of a car. Availability = "does the engine start?"; latency = "how fast can it accelerate?"; freshness = "how old is the GPS reading?"; correctness = "does the odometer match what we drove?"

Figure 12.1: SLI → SLO → Error Budget → Burn Rate

graph TD SLI[Service Level Indicator
good events / valid events
e.g. non-5xx requests under 300ms] SLO[Service Level Objective
target on the SLI
e.g. 99.9% over 30 days] EB[Error Budget
= 1 - SLO
e.g. 0.1% ~ 43.2 min / month] BR[Burn Rate
observed_error_fraction / error_budget
how fast budget is consumed] A[Burn-Rate Alert
fires when consumption is unsustainable
page or ticket by severity] SLI --> SLO SLO --> EB EB --> BR BR --> A

Error Budgets

Once you commit to an SLO, the arithmetic is trivial: error_budget = 1 - SLO. For 99.9% over 30 days: 0.001 × 43,200 minutes ≈ 43.2 minutes/month. That number — not the percentage — is the most useful artifact in your program: it converts an abstraction into a budget engineers and PMs can reason about. If you have burned 30 min this month you have 13 left, which immediately changes deployment risk decisions.

SLOBad fraction30-day budget90-day budget
99%1.0%7 h 12 m21 h 36 m
99.5%0.5%3 h 36 m10 h 48 m
99.9%0.1%43.2 m2 h 9.6 m
99.95%0.05%21.6 m1 h 4.8 m
99.99%0.01%4.32 m12.96 m
Animation A — Error Budget Burn (99.9% / 30-day)
30-Day Error Budget: 43.2 minutes 100% budget 50% — yellow alert 0% — ship freeze Day 0 — Budget healthy (100%) incidents over the 30-day window YELLOW RED — SHIP FREEZE

A 99.9% SLO allots ~43 minutes of bad time per month. As incidents accrue, the bar depletes; at 50% you ticket, at 100% you freeze risky deploys until the window resets.

Burn Rate and Multi-Window Multi-Burn-Rate

The burn rate B = observed_error_fraction / error_budget answers the only question that matters: at the current pace, when do we run out? If B = 14.4, a 30-day budget is gone in ~2 days.

The naïve "alert when error rate > threshold" approach pages on transient blips and misses slow erosion. The Google SRE workbook's MWMBR pattern fixes both by requiring agreement across two windows:

AlertShortLongBurn rateSeverity
Fast-burn page5 m1 h14.4page
Medium-burn page30 m6 h6page
Slow-burn ticket2 h24 h3ticket
Slowest-burn ticket6 h3 d1ticket
# Recording rules — compute once, reuse everywhere
- record: slo:http_errors:ratio_rate5m
  expr: |
    sum by (service, env) (rate(http_requests_total{code=~"5.."}[5m]))
    /
    sum by (service, env) (rate(http_requests_total[5m]))

# Fast-burn MWMBR alert
- alert: SLOErrorBudgetBurnFast
  expr: |
    (
      slo:http_errors:ratio_rate5m{service="checkout"} > (14.4 * 0.001)
      and
      slo:http_errors:ratio_rate1h{service="checkout"} > (14.4 * 0.001)
    )
    and
    sum by (service, env) (rate(http_requests_total{service="checkout"}[5m])) > 1
  for: 2m
  labels: { severity: page, slo: availability-99.9-30d, team: payments }
  annotations:
    summary: "Checkout burning error budget at >14.4x in {{ $labels.env }}"
    runbook_url: "https://runbooks.example.com/checkout/slo-fast-burn"

Two production details: the sum(rate(...)) > 1 gate kills spurious 100% ratios in low-traffic windows; latency SLOs reuse the same machinery with a histogram le-bucket numerator.

Animation B — Multi-Window Multi-Burn-Rate (MWMBR)
Both short AND long window must cross threshold to page 5-minute window fast detection 14.4× observed burn 1-hour window confirmation 14.4× observed burn 6-hour window slow-burn catcher observed burn PAGE — both 5m and 1h above 14.4×

A real outage spikes both the short and long windows past the threshold simultaneously. A transient blip only spikes the 5m window, so it never pages. The 6h window catches slow erosion that the short windows would miss.

Figure 12.2: MWMBR Alert Escalation

graph TD ER[Observed Error Ratio
recording rules 5m, 1h, 6h, 3d] F1{5m AND 1h
above 14.4x burn?} M1{30m AND 6h
above 6x burn?} S1{2h AND 24h
above 3x burn?} SS1{6h AND 3d
above 1x burn?} P1[Fast-Burn PAGE
~2 days to exhaustion] P2[Medium-Burn PAGE
~5 days] T1[Slow-Burn TICKET
~10 days] T2[Slowest-Burn TICKET
exhaustion in window] OK[No alert — budget healthy] ER --> F1 F1 -->|yes| P1 F1 -->|no| M1 M1 -->|yes| P2 M1 -->|no| S1 S1 -->|yes| T1 S1 -->|no| SS1 SS1 -->|yes| T2 SS1 -->|no| OK

Key Takeaway — Section 1

2. Alerting Architecture

A well-tuned SLO is wasted if it lands in a flood of noise at 3 a.m. The alerting architecture — how alerts route, group, inhibit, and reach humans — decides whether on-call is sustainable.

Alertmanager: Grouping, Routing, Inhibition

route:
  receiver: 'default-slack'
  group_by: ['alertname', 'service', 'severity', 'env']
  group_wait: 60s
  group_interval: 10m
  repeat_interval: 4h
  routes:
    - matchers: [ severity="critical" ]
      routes:
        - matchers: [ env="prod" ]
          receiver: 'pagerduty-prod'
          continue: true
          routes:
            - matchers: [ team="payments" ]
              receiver: 'pagerduty-payments'

inhibit_rules:
  - source_matchers: [ severity="critical", alertname="KubernetesNodeDown" ]
    target_matchers: [ alertname=~"KubePodCrashLooping|KubePodNotReady|InstanceDown" ]
    equal: ['node', 'env']

For HA, run three Alertmanager replicas with gossip clustering so a restart doesn't double-fire every active alert.

On-Call Ergonomics

Every page must be actionable: the on-call engineer should know within 60 seconds what to do. The hard rule is no page without a runbook URL. A runbook entry answers:

  1. How do I confirm this is real? (PromQL, dashboard, log query)
  2. What is the most likely cause?
  3. What are the safe remediation steps?
  4. When do I escalate, and to whom?
  5. What is the rollback criterion?

Anti-Patterns

Figure 12.3: Alertmanager Routing with Grouping and Inhibition

flowchart TD P[Prometheus
alert rules fire] --> AM[Alertmanager
HA cluster x3] AM --> G[Group by
alertname, service,
severity, env] G --> I{Inhibition rules
root-cause active?} I -->|suppressed| X[Drop symptom alert] I -->|allowed| R{Route by severity} R -->|critical| RE{env?} R -->|warning| SW[Slack #alerts-warnings
repeat 12h] R -->|info| SI[Slack #alerts-info
repeat 24h] RE -->|prod| RT{team?} RE -->|staging or dev| SN[Slack #nonprod] RT -->|payments| PDP[PagerDuty
payments rotation] RT -->|platform| PDF[PagerDuty
platform rotation] PDP --> RB[Runbook URL
+ dashboard link
+ silence controls] PDF --> RB

Key Takeaway — Section 2

Post-Quiz — Half 1 (Sections 1 & 2)

1. For a service with a 99.9% availability SLO measured over a 30-day window, approximately how much "bad" time is the error budget?

A. About 4.3 minutes per month
B. About 43 minutes per month
C. About 7.2 hours per month
D. About 30 minutes per month (one per day)

2. A good Service Level Indicator (SLI) is best expressed as:

A. The raw count of 5xx responses returned by the service
B. A snapshot of CPU and memory utilization on the host
C. A ratio of "good events" to "valid events" from the user's perspective
D. The wall-clock time since the last deploy

3. If your service has an observed error fraction with burn rate B = 14.4 against a 30-day SLO window, roughly how long until the budget is fully exhausted?

A. ~2 days
B. ~14 hours
C. ~30 days (by definition)
D. Already gone — 14.4 is a per-minute exhaustion rate

4. Why does the canonical MWMBR alert pair a short (e.g. 5m) window with a long (e.g. 1h) window?

A. The short window fires fast; the long window prevents one-off blips from paging anyone
B. Prometheus cannot evaluate a single window faster than 1h
C. Alertmanager requires two windows for HA quorum
D. It doubles the recording-rule cost intentionally so the alert is more sensitive

5. Which of the following is the most common anti-pattern in alerting that this chapter warns against?

A. Linking a runbook URL on every page
B. Paging on raw causes (CPU > 90%) instead of on user-visible symptoms
C. Using a for: clause on alert rules
D. Inhibiting per-pod alerts when a node-down alert is firing
Pre-Quiz — Half 2 (Sections 3 & 4)

6. In the six-layer reference architecture, what is "meta-observability"?

A. A second Grafana dashboard for business KPIs
B. A small, segregated Prometheus + Alertmanager that monitors the observability stack itself
C. The OTel Collector's internal metrics endpoint
D. A vendor-specific term for trace metrics

7. Which Prometheus meta-metric is the #1 outage cause for the observability stack itself?

A. prometheus_target_scrape_pool_sync_total failures
B. prometheus_rule_evaluation_duration_seconds
C. prometheus_tsdb_head_series — cardinality runaway
D. up{job="prometheus"} flipping to 0

8. The chapter describes migrating from Prometheus-only stacks as "wrap, don't replace." What does that mean in practice?

A. Wrap every Prometheus alert in a Python script before sending it to PagerDuty
B. Keep Prometheus where it works; add OTel Collectors as the entry point for new signals (using prometheusremotewrite and the prometheus receiver)
C. Tunnel Prometheus traffic through a service mesh sidecar before scraping
D. Rip out Prometheus and replace it with the OTel Collector entirely on day one

9. What unique question does continuous profiling answer that metrics, logs, and traces cannot?

A. "Is something wrong?"
B. "Where in the request path is the slowness?"
C. "Exactly which code is consuming resources, and how did it change?"
D. "What error message did the user see?"

10. Why does the chapter argue that investing in OTel semantic-convention conformance pays off strategically?

A. It's required by Prometheus 3.x or scraping silently breaks
B. It shrinks vendor lock-in, enables fleet-wide queries, and makes AI-assisted RCA more useful
C. It reduces Prometheus memory usage by ~50% per series
D. It is the only way to enable inhibition rules in Alertmanager

3. Putting It All Together

You have all the pieces. The job now is integration — a platform that holds together at scale, can roll out incrementally, and is itself observable.

Reference Architecture — Six Logical Layers

  1. Instrumentation — OTel SDKs + auto-instrumentation, emitting OTLP for metrics, traces, logs, and profiles.
  2. Collection — OTel Collector DaemonSets + fan-in Deployments; Prometheus via the Operator for scrape targets (kube-state-metrics, node-exporter, cAdvisor).
  3. Storage — Prometheus TSDB (short-term) → Thanos/Mimir/Cortex (long-term); Tempo/Jaeger for traces; Loki/Elasticsearch for logs; Pyroscope/Parca for profiles.
  4. Rules & alerting — recording + alerting rules, Alertmanager HA, runbook hosting.
  5. Visualization — Grafana with datasources across every signal store.
  6. Meta-observability — a small, segregated Prometheus + Alertmanager whose only job is monitoring the observability stack itself.

Figure 12.4: Reference Observability Platform on Kubernetes

flowchart LR subgraph L1[1. Instrumentation] APP[Application Pods
OTel SDKs
auto-instrumentation] end subgraph L2[2. Collection] DS[OTel Collector
DaemonSet] DEP[OTel Collector
Deployment fan-in] PROM[Prometheus Operator
ServiceMonitors] end subgraph L3[3. Storage] TSDB[Prometheus TSDB] LTM[Thanos / Mimir] TR[Tempo / Jaeger] LG[Loki] PR[Pyroscope / Parca] end subgraph L4[4. Rules & Alerting] RR[Recording Rules] AR[Alerting Rules MWMBR] AM[Alertmanager HA] end subgraph L5[5. Visualization] GR[Grafana] end subgraph L6[6. Meta-Observability] WD[Watchdog Prometheus
+ Alertmanager] end APP --> DS APP --> DEP APP --> PROM DS --> TSDB DS --> TR DS --> LG DEP --> LTM PROM --> TSDB TSDB --> LTM TSDB --> RR RR --> AR AR --> AM TSDB --> GR LTM --> GR TR --> GR LG --> GR PR --> GR AM --> GR WD -.watches.-> L2 WD -.watches.-> L3 WD -.watches.-> L4

Greenfield Rollout

WeekActionOutcome
1Deploy kube-prometheus-stack with default rulesCluster metrics, basic alerts, Grafana
2OTel Collector DaemonSet → Prometheus + TempoFirst auto-instrumented traces
3Define 2–3 critical SLOs with MWMBR alertsFirst user-anchored pages
4Alertmanager routing tree + inhibition + runbooksReduced noise; team ownership
5–6Long-term store via remote-writeMulti-month retention
7–8Logs (Loki) + continuous profiling (Pyroscope)Full four-signal stack

Migration from Prometheus-only stacks follows the "wrap, don't replace" rule. Keep Prometheus where it works (scrape-based infra, alerting); add OTel Collectors as the entry point for new signals. The Collector's prometheusremotewrite exporter ships OTel metrics into Prometheus/Mimir, and the prometheus receiver scrapes existing exporters — a unified pipeline without a forklift migration.

Meta-Observability: Capacity Planning the Platform

ComponentMeta-MetricWhy It Matters
Prometheusprometheus_rule_evaluation_duration_secondsRule lag = late alerts
Prometheusprometheus_tsdb_head_seriesCardinality runaway = #1 outage cause
Alertmanageralertmanager_notifications_failed_totalPages may not reach humans
Alertmanageralertmanager_cluster_membersHA quorum lost → duplicates
OTel Collectorotelcol_exporter_queue_sizeBackpressure → telemetry loss
OTel Collectorotelcol_processor_dropped_spansSampling working too aggressively

Sizing rules of thumb:

Key Takeaway — Section 3

4. Where Observability Is Heading

The fundamentals — SLOs, MWMBR, Alertmanager hygiene — have been stable for nearly a decade. The interesting changes are at the edges.

Profiles as a Fourth Signal

Three pillars (metrics, logs, traces) become four. Continuous profiling — always-on stack-trace sampling at 50–200 Hz with 1–3% overhead — is the fourth.

SignalQuestion
MetricsIs something wrong?
LogsWhat happened?
TracesWhere in the request path?
ProfilesExactly which code is consuming resources?

Example: metrics show p99 rising from 200 → 500 ms after deploy. Traces narrow it to CalculateDiscounts spans. Profiles reveal 40% of CPU is in a new calculate_rewards_v2 function, dominated by hashmap operations. Without profiles you have a suspect; with profiles you have the exact lines of code.

Grafana Pyroscope (Prometheus-like labels, multi-language SDKs, eBPF) and Parca / Polar Signals (Kubernetes-native eBPF DaemonSet) lead the OSS space. The maturing OpenTelemetry profiling signal adds profiles as a first-class OTLP signal sharing resource attributes and trace/span IDs.

Animation C — The Four Pillars Converge Under OTel
The Four Pillars of Modern Observability Metrics Is something wrong? Logs What happened? Traces Where in the request path? Profiles Which lines of code? NEW OpenTelemetry Semantic Conventions service.name · http.response.status_code · k8s.pod.name

Three pillars stand. Profiles slides in from the right as the fourth signal — answering "exactly which lines of code?" with flame-graph data. All four converge under shared OTel semantic-convention attributes that make signals portable across vendors and analyzable by AI assistants.

AI-Assisted Root Cause Analysis

Vendors and OSS projects are building systems that:

Two prerequisites: OTel semantic conventions for vocabulary, and structured runbooks for remediation playbooks. This is not a replacement for engineering judgment — it removes the manual query-writing tax during incidents.

Semantic Convention Convergence

The longest-running success of OTel is that vendors converge on the same names: service.name, http.response.status_code, db.system, k8s.pod.name, deployment.environment. Practical implications:

Strategic takeaway: invest in conformance now. Add CI checks that reject non-standard attribute names; require OTel resource attributes via the Collector's resourcedetection processor.

Figure 12.5: The Four Pillars and the Path Forward

graph TD subgraph PILLARS[Signals of Modern Observability] M[Metrics
Is something wrong?
Prometheus, OTLP] L[Logs
What happened?
Loki, Elasticsearch] T[Traces
Where in the request path?
Tempo, Jaeger] PF[Profiles
Which lines of code?
Pyroscope, Parca] end SC[OTel Semantic Conventions
service.name, http.response.status_code,
k8s.pod.name, deployment.environment] AI[AI-Assisted RCA
correlate signals, summarize hypotheses,
suggest queries from runbooks] INC[Incident
faster MTTD & MTTR
portable across vendors] M --> SC L --> SC T --> SC PF --> SC SC --> AI AI --> INC

Key Takeaway — Section 4

Post-Quiz — Half 2 (Sections 3 & 4)

6. In the six-layer reference architecture, what is "meta-observability"?

A. A second Grafana dashboard for business KPIs
B. A small, segregated Prometheus + Alertmanager that monitors the observability stack itself
C. The OTel Collector's internal metrics endpoint
D. A vendor-specific term for trace metrics

7. Which Prometheus meta-metric is the #1 outage cause for the observability stack itself?

A. prometheus_target_scrape_pool_sync_total failures
B. prometheus_rule_evaluation_duration_seconds
C. prometheus_tsdb_head_series — cardinality runaway
D. up{job="prometheus"} flipping to 0

8. The chapter describes migrating from Prometheus-only stacks as "wrap, don't replace." What does that mean in practice?

A. Wrap every Prometheus alert in a Python script before sending it to PagerDuty
B. Keep Prometheus where it works; add OTel Collectors as the entry point for new signals (using prometheusremotewrite and the prometheus receiver)
C. Tunnel Prometheus traffic through a service mesh sidecar before scraping
D. Rip out Prometheus and replace it with the OTel Collector entirely on day one

9. What unique question does continuous profiling answer that metrics, logs, and traces cannot?

A. "Is something wrong?"
B. "Where in the request path is the slowness?"
C. "Exactly which code is consuming resources, and how did it change?"
D. "What error message did the user see?"

10. Why does the chapter argue that investing in OTel semantic-convention conformance pays off strategically?

A. It's required by Prometheus 3.x or scraping silently breaks
B. It shrinks vendor lock-in, enables fleet-wide queries, and makes AI-assisted RCA more useful
C. It reduces Prometheus memory usage by ~50% per series
D. It is the only way to enable inhibition rules in Alertmanager

Chapter Summary & Key Terms

You've crossed the bridge from "we collect telemetry" to "we operate to a contract." SLIs measure user-visible behavior as good/valid ratios. SLOs are targets on SLIs; their inverse is the error budget. MWMBR alerts page only when the budget is genuinely at risk. Alertmanager turns alerts into humane notifications via grouping, severity-aware routing, inhibition, runbook-linked annotations, and HA clustering. A reference architecture layers instrumentation, collection, storage, rules, visualization, and meta-observability. The next frontier is continuous profiling, AI-assisted RCA, and semantic-convention convergence — all of which plug into the OTel foundation you build today.

TermDefinition
SLIService Level Indicator — a ratio of good to valid events reflecting user experience.
SLOService Level Objective — target value/range for an SLI over a defined window.
Error budgetAllowed "bad" behavior, computed as 1 − SLO. ~43 min/mo for 99.9%.
Burn rateRatio of observed error fraction to error budget; B = 14.4 exhausts a 30-day budget in ~2 days.
MWMBRMulti-window multi-burn-rate — short + long windows must both cross to fire.
Inhibition ruleAlertmanager rule suppressing symptom alerts while a root-cause alert is firing.
RunbookOperational doc linked from every page; confirm, cause, remediate, escalate, rollback.
Meta-observabilityMonitoring the observability platform itself via a segregated stack.
Continuous profilingAlways-on stack-trace sampling at 50–200 Hz with 1–3% overhead — the fourth signal.
OTel semantic conventionsStandardized attribute names enabling cross-vendor portability and AI tooling.

Your Progress

Answer Explanations