Chapter 1: Foundations of Cloud-native Observability

Learning Objectives

Part 1 — From Monitoring to Observability & The Three Pillars

Pre-Reading Quiz — Part 1

1. A team has solid dashboards for CPU, error rate, and p99 latency, yet a novel bug only appears when tenant_type="enterprise" hits a specific endpoint on a single deployment version. Why is observability (not just better monitoring) the right answer?

Dashboards always undersample CPU data so the team should switch to per-second collection. Observability lets engineers slice telemetry by arbitrary high-cardinality labels and pivot across signals to surface unknown-unknowns. Monitoring only works when there are fewer than 10 services in production. Observability replaces alerts with machine learning that automatically files incidents.

2. Which best characterizes the practical investigative path observability enables during an incident?

Alert → runbook → reboot the affected node. Metric anomaly → pivot to matching traces → drill into the specific log lines for the failing span. Open a ticket, wait for the next standup, and assign it to the responsible team. Search every log line in the cluster for the word "error".

3. Why would adding a user_id label to a Prometheus counter like http_requests_total be a poor design choice?

Prometheus cannot store labels longer than 8 characters. It introduces unbounded high cardinality, causing each user to produce a new time series and risking memory exhaustion on the Prometheus server. User IDs change at every deployment, so the metric becomes inaccurate. Prometheus requires user IDs to be hashed before scraping.

4. Two engineers are debating: "We have logs and metrics — why do we still need traces?" What is the strongest argument for adding distributed tracing?

Traces replace logs at lower storage cost. Traces natively capture the causal structure of a request across multiple services, which neither metrics nor logs reconstruct by themselves. Traces are required for any Prometheus scrape configuration to work. Traces are the only way to graph CPU usage.

5. A latency histogram bucket shows a tall bar at the 2-second mark for thousands of requests. How do exemplars help an SRE here?

They reduce the histogram's bucket count to make queries faster. They link that specific metric data point to one concrete trace that contributed to the bucket, enabling a one-click jump from "the chart looks bad" to a representative slow trace. They convert metrics into logs automatically. They guarantee 100% trace sampling for slow requests.

1. From Monitoring to Observability

Key Points

Observability originated in control theory: the degree to which a system's internal state can be inferred from its external outputs. Modern software borrows the idea — a system is observable when its telemetry (metrics, logs, traces) is rich enough that engineers can reconstruct what is happening inside it, even for situations no one anticipated.

This contrasts with monitoring, which is fundamentally about checking known signals against predefined thresholds. Monitoring answers the questions you wrote down ahead of time: "Is checkout-service returning too many 5xx errors?" or "Is CPU on node-3 above 90%?" It is a tripwire built around failure modes you already understand.

Known unknowns are failure modes you can name (database connections, disk full, CrashLoopBackOff) — perfect targets for pre-built alerts. Unknown unknowns are the novel combinations no one imagined. Consider sporadic 500s only during peak traffic, only at /checkout/card, only on v2, only for tenant_type="enterprise". With pure monitoring you see only a vague error spike; with observability you slice metrics by labels, pivot to traces, then jump to the logs for the failing spans — uncovering a misconfigured PAYMENT_TIMEOUT=200ms in v2.

Dashboards summarize telemetry into a small number of pre-chosen charts. That works when the questions are stable. In a microservice deployment with 30 services, 5 in-flight versions, dozens of tenants, multiple regions, and Kubernetes attributes (namespace, deployment, pod, node), the meaningful dimensional space runs into the millions. No fixed dashboard can pre-render every slice.

Figure 1.1: Monitoring dashboards vs. observability query interfaces

flowchart LR subgraph Monitoring["Monitoring (pre-built dashboards)"] direction TB G1["Gauge: CPU %"] G2["Gauge: Error rate"] G3["Gauge: Latency p99"] G4["Fixed thresholds
and alerts"] end subgraph Observability["Observability (ad-hoc query interface)"] direction TB Q["Free-form query
by service, version,
tenant, region, endpoint"] Q --> R1["Slice metrics"] Q --> R2["Pivot to traces"] Q --> R3["Drill into logs"] end Monitoring -. "answers known questions" .-> Known["Known unknowns"] Observability -. "discovers new questions" .-> Unknown["Unknown unknowns"]

Animation 1.1 — Monitoring vs. Observability: Surfacing Unknown-Unknowns

Monitoring dashboard
CPU %
Error rate
p99 latency
Memory %
Unknown-unknowns
? ? ? ? ? ?
Observability query:
tenant_type="enterprise"
& endpoint="/checkout/card"
& version="v2"

Monitoring shows the known charts (left). The novel failure (right) hides among combinations no dashboard pre-renders — until observability lets you query the right slice.

2. The Three Pillars: Metrics, Logs, and Traces

Key Points

The three pillars are complementary because each handles cardinality, cost, and temporal granularity differently. Metrics are cheap, summarizable, ideal for alerting; a typical sample is http_requests_total{service="api",status="500"}. Their weakness is aggregation — you trade per-event detail for compactness. Logs are discrete events, usually structured JSON, that capture exact parameter values and stack traces; their weakness is storage and indexing volume. Traces are made of spans — units of work in one service with parent-child links — propagated via headers like W3C traceparent; their weakness is also volume, which is why traces are usually sampled.

SignalBest forCardinality toleranceTypical cost driver
MetricsAlerting, trends, SLOsLowNumber of time series
LogsDetailed per-event contextHighStorage and indexing volume
TracesCross-service request flowHighSample rate and span count

Figure 1.2: The three pillars and their connective tissue

flowchart TD Obs["Observability"] Obs --> M["Metrics
aggregated numeric
time series"] Obs --> L["Logs
discrete structured
events"] Obs --> T["Traces
causal request flow
across services"] M -- "exemplars
(metric point to trace)" --> T T -- "trace_id in log lines" --> L L -- "span_id back to trace" --> T M -. "resource attributes
(k8s.namespace, service, version)" .- L M -. "resource attributes" .- T L -. "resource attributes" .- T

The leap from "three pillars" to "one fabric" happens when signals are correlated. Exemplars attach pointers to metric data points linking a histogram bucket to one concrete trace. Trace IDs embedded in structured logs let you jump from any log line to the full trace, and from any span back to that span's logs. Resource attributes (k8s.namespace.name, k8s.deployment.name) tie all three pillars to the underlying Kubernetes workload.

Cardinality — the number of unique label combinations — is the most important operational concept here. A Prometheus counter labeled http_requests_total{service, endpoint, status, pod} with 10 services × 50 endpoints × 5 statuses × 200 pods produces 500,000 time series. Add user_id and it explodes to millions, risking OOM crashes. The rule: low-cardinality labels in metrics; high-cardinality detail in logs and traces; sample traces (head-based is cheap but blind to errors; tail-based retains 100% of errors and slow traces).

Animation 1.2 — Three Pillars and Their Correlation Links

Metrics

Low-cardinality time series. Cheap, summarizable, ideal for alerts and SLOs.

Logs

High-cardinality discrete events. Capture exact parameters, exceptions, business context.

Traces

Causal end-to-end request flow. The only signal that natively captures distributed structure.

exemplars trace_id resource attrs

Each pillar appears in sequence (0.15s stagger), then correlation links draw in: exemplars (metrics↔traces), trace IDs in logs (logs↔traces), and shared resource attributes binding all three.

Post-Reading Quiz — Part 1

1. A team has solid dashboards for CPU, error rate, and p99 latency, yet a novel bug only appears when tenant_type="enterprise" hits a specific endpoint on a single deployment version. Why is observability (not just better monitoring) the right answer?

Dashboards always undersample CPU data so the team should switch to per-second collection. Observability lets engineers slice telemetry by arbitrary high-cardinality labels and pivot across signals to surface unknown-unknowns. Monitoring only works when there are fewer than 10 services in production. Observability replaces alerts with machine learning that automatically files incidents.

2. Which best characterizes the practical investigative path observability enables during an incident?

Alert → runbook → reboot the affected node. Metric anomaly → pivot to matching traces → drill into the specific log lines for the failing span. Open a ticket, wait for the next standup, and assign it to the responsible team. Search every log line in the cluster for the word "error".

3. Why would adding a user_id label to a Prometheus counter like http_requests_total be a poor design choice?

Prometheus cannot store labels longer than 8 characters. It introduces unbounded high cardinality, causing each user to produce a new time series and risking memory exhaustion on the Prometheus server. User IDs change at every deployment, so the metric becomes inaccurate. Prometheus requires user IDs to be hashed before scraping.

4. Two engineers are debating: "We have logs and metrics — why do we still need traces?" What is the strongest argument for adding distributed tracing?

Traces replace logs at lower storage cost. Traces natively capture the causal structure of a request across multiple services, which neither metrics nor logs reconstruct by themselves. Traces are required for any Prometheus scrape configuration to work. Traces are the only way to graph CPU usage.

5. A latency histogram bucket shows a tall bar at the 2-second mark for thousands of requests. How do exemplars help an SRE here?

They reduce the histogram's bucket count to make queries faster. They link that specific metric data point to one concrete trace that contributed to the bucket, enabling a one-click jump from "the chart looks bad" to a representative slow trace. They convert metrics into logs automatically. They guarantee 100% trace sampling for slow requests.

Part 2 — Cloud-native Operational Context & The CNCF Landscape

Pre-Reading Quiz — Part 2

6. A dashboard is keyed by individual pod name. After a rolling deployment the panels are full of orphaned series. What is the correct cloud-native fix?

Disable rolling deployments so pod names remain stable. Aggregate at the workload level using stable labels like service, namespace, and version, and use Kubernetes service discovery for scrape targets. Increase the metrics retention window to 1 year. Move all metric collection into application logs.

7. Why are traditional in-process APM agents insufficient for a polyglot microservices architecture?

They cannot read JSON payloads. They are tightly coupled to specific language runtimes and cannot reliably stitch a request together across services, languages, and message-queue hops. They emit too few metrics per minute. They require Kubernetes to be installed on every developer laptop.

8. The application metrics for checkout-service look fine, but users report intermittent failures. The team uses Istio. What observability blind spot is the most likely culprit?

PromQL queries are case-sensitive. The Envoy sidecar's mesh-level behavior (retries, mTLS failures, outlier detection) is invisible because only application metrics are being scraped. The pods are running on the wrong node taint. The application is not using a service account.

9. Which best describes the complementary roles of Prometheus and OpenTelemetry in a 2025 CNCF stack?

They are direct competitors and a team must pick exactly one. OpenTelemetry handles how telemetry is produced and routed (SDKs, conventions, Collector, OTLP); Prometheus handles where metrics live and how they are queried and alerted on (TSDB, PromQL, Alertmanager). OpenTelemetry only exists for traces; Prometheus only exists for logs. Both store traces, metrics, and logs and one is chosen at random.

10. Why does OpenTelemetry's vendor neutrality matter operationally?

It guarantees the Collector is faster than every commercial agent. Instrumentation is decoupled from backend choice, so switching or combining vendors becomes a Collector configuration change rather than an application-code rewrite. It blocks commercial vendors from accepting OTLP data. It forces all telemetry to be stored in a single backend.

3. Cloud-native Operational Context

Key Points

Traditional monitoring assumed long-lived hosts with stable identities. Kubernetes inverts that: pods are created and destroyed in seconds due to autoscaling, rolling deployments, crash loops, and probe failures. A dashboard keyed by pod name becomes unreadable; historical series for a specific pod are meaningless once it's gone; alerts firing on a pod that just terminated produce 404s when an operator clicks through.

Figure 1.3: Rolling deployment re-keys metrics from pods to the Deployment

flowchart LR subgraph T0["t=0: v1 steady state"] P1A["pod checkout-v1-a"] P1B["pod checkout-v1-b"] P1C["pod checkout-v1-c"] end subgraph T1["t=1: rolling update"] P1B2["pod checkout-v1-b"] P2A["pod checkout-v2-a"] P2B["pod checkout-v2-b"] end subgraph T2["t=2: v2 steady state"] P2A2["pod checkout-v2-a"] P2B2["pod checkout-v2-b"] P2C["pod checkout-v2-c"] end T0 --> T1 --> T2 T0 --> Q["Stable query:
sum by (service, version)
(rate(http_requests_total))"] T1 --> Q T2 --> Q Q --> Dash["Continuous series
keyed by Deployment +
version, not pod name"]

The cloud-native fix: treat workloads as the unit of observability. Aggregate across all pods in a Deployment using service, namespace, and version. Use Prometheus's Kubernetes service discovery to find scrape targets dynamically. Send container logs via a node-level DaemonSet collector (Fluent Bit, Vector) to a centralized store so records survive the pod.

In a monolith, debugging meant reading one process's stack trace. Microservices destroy that comfort: a single request can traverse dozens of services in different languages over multiple protocols (HTTP, gRPC, Kafka). There is no single stack — only a distributed call hierarchy that lives in no one process's memory. Traditional APM agents tied to specific runtimes cannot stitch this together. The cloud-native answer is distributed tracing on open standards: OpenTelemetry SDKs in every language, W3C Trace Context propagation, automatic injection/extraction in HTTP clients, gRPC interceptors, and message brokers.

Animation 1.3 — The Death of the Call Stack: A Distributed Trace

gateway (Go) auth (Java) checkout (Node.js) payment (Python) db (Postgres)

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

A single user request enters at the gateway and threads through four polyglot services. The pulse carries the traceparent header so every span shares one trace ID — making the distributed call hierarchy reconstructable.

A service mesh (Istio, Linkerd, Consul Connect) injects a sidecar proxy — commonly Envoy — next to each application pod. mTLS, retries, timeouts, circuit breakers, and traffic splitting execute in the proxy, not the app. This creates a blind spot: the app may report a healthy error rate while the mesh is silently retrying 503s or failing TLS handshakes. Treat mesh proxies as first-class observability targets — scrape Envoy stats, collect mesh access logs, and combine application spans with mesh spans so end-to-end traces show exactly where time was spent.

Figure 1.4: Application and sidecar emit separate, correlated telemetry streams

flowchart LR subgraph Pod["Kubernetes Pod"] App["Application container
(business logic)"] Envoy["Envoy sidecar
(mTLS, retries, routing)"] App <-->|"localhost"| Envoy end App -->|"app metrics
(/metrics)"| Prom["Prometheus"] App -->|"app traces
(OTLP spans)"| Backend["Tracing backend
(Jaeger / Tempo)"] Envoy -->|"mesh metrics
(Envoy stats)"| Prom Envoy -->|"mesh access logs"| Logs["Log backend
(Loki)"] Envoy -->|"mesh spans"| Backend Prom --> Graf["Grafana
correlated view"] Backend --> Graf Logs --> Graf

4. The CNCF Observability Landscape

Key Points

Prometheus is the dominant CNCF metrics system — CNCF Graduated alongside Kubernetes. It is pull-based: services expose HTTP /metrics endpoints in the Prometheus exposition (or OpenMetrics) format and Prometheus scrapes them on a schedule, storing the data in a purpose-built TSDB queried with PromQL. Alerts are PromQL rules routed through Alertmanager. Prometheus is metrics-only — that single-responsibility focus is a feature, not a limitation. For scale beyond a single instance, the ecosystem provides PromQL-compatible long-term stores: Thanos, Cortex, Mimir, and VictoriaMetrics.

OpenTelemetry (OTel) is the CNCF's answer to vendor lock-in — also CNCF Graduated as of 2025. Its scope is fundamentally different from Prometheus's: it is not a backend, it is an instrumentation and pipeline standard. Components:

Instrument once with OpenTelemetry, and backend choice becomes a configuration decision rather than a code rewrite. Migrating from one tracing vendor to another, or running multiple in parallel, requires changes only in Collector pipelines.

BackendSignalRole
PrometheusMetricsLocal scraping, TSDB, PromQL, alerting
Mimir / Cortex / Thanos / VictoriaMetricsMetricsLong-term, horizontally scalable Prometheus-compatible storage
JaegerTracesDistributed tracing backend, CNCF Graduated
TempoTracesGrafana-stack trace backend optimized for object storage
LokiLogsGrafana-stack log aggregation, label-indexed storage
GrafanaVisualizationDashboards, alerting UI, multi-backend query interface

Commercial vendors — Datadog, New Relic, Splunk, Honeycomb, Dynatrace, Chronosphere, Lightstep — natively ingest OTLP and mostly support Prometheus remote-write. Teams can move between open-source and SaaS, or run hybrid, without rewriting instrumentation.

Figure 1.5: End-to-end cloud-native observability stack

flowchart LR Apps["Applications
(Go, Java, Python, ...)"] --> SDK["OpenTelemetry SDK
+ auto-instrumentation"] SDK -->|"OTLP (gRPC/HTTP)"| Col["OpenTelemetry Collector
(DaemonSet / gateway)"] Scrape["Prometheus scrape
(/metrics endpoints)"] --> Prom["Prometheus
(local TSDB + PromQL)"] Apps -. "expose /metrics" .-> Scrape Col -->|"metrics
(remote-write)"| Mimir["Mimir / Thanos /
VictoriaMetrics
(long-term metrics)"] Prom -->|"remote-write"| Mimir Col -->|"traces"| Tempo["Tempo / Jaeger
(trace storage)"] Col -->|"logs"| Loki["Loki / OpenSearch
(log storage)"] Prom --> Graf["Grafana
(dashboards + alerts)"] Mimir --> Graf Tempo --> Graf Loki --> Graf Prom --> AM["Alertmanager"]

In this arrangement Prometheus and OpenTelemetry are complementary, not competitors: use OpenTelemetry for how you instrument and move telemetry; use Prometheus for where metrics live and how you query and alert on them.

Post-Reading Quiz — Part 2

6. A dashboard is keyed by individual pod name. After a rolling deployment the panels are full of orphaned series. What is the correct cloud-native fix?

Disable rolling deployments so pod names remain stable. Aggregate at the workload level using stable labels like service, namespace, and version, and use Kubernetes service discovery for scrape targets. Increase the metrics retention window to 1 year. Move all metric collection into application logs.

7. Why are traditional in-process APM agents insufficient for a polyglot microservices architecture?

They cannot read JSON payloads. They are tightly coupled to specific language runtimes and cannot reliably stitch a request together across services, languages, and message-queue hops. They emit too few metrics per minute. They require Kubernetes to be installed on every developer laptop.

8. The application metrics for checkout-service look fine, but users report intermittent failures. The team uses Istio. What observability blind spot is the most likely culprit?

PromQL queries are case-sensitive. The Envoy sidecar's mesh-level behavior (retries, mTLS failures, outlier detection) is invisible because only application metrics are being scraped. The pods are running on the wrong node taint. The application is not using a service account.

9. Which best describes the complementary roles of Prometheus and OpenTelemetry in a 2025 CNCF stack?

They are direct competitors and a team must pick exactly one. OpenTelemetry handles how telemetry is produced and routed (SDKs, conventions, Collector, OTLP); Prometheus handles where metrics live and how they are queried and alerted on (TSDB, PromQL, Alertmanager). OpenTelemetry only exists for traces; Prometheus only exists for logs. Both store traces, metrics, and logs and one is chosen at random.

10. Why does OpenTelemetry's vendor neutrality matter operationally?

It guarantees the Collector is faster than every commercial agent. Instrumentation is decoupled from backend choice, so switching or combining vendors becomes a Collector configuration change rather than an application-code rewrite. It blocks commercial vendors from accepting OTLP data. It forces all telemetry to be stored in a single backend.

Your Progress

Answer Explanations