Chapter 1: Foundations of Cloud-native Observability
Learning Objectives
Distinguish between monitoring and observability and explain why cloud-native systems require the latter
Identify the three pillars of observability and the role each plays in incident response
Describe the cloud-native operational challenges (ephemeral workloads, distributed services, polyglot stacks) that motivate Prometheus and OpenTelemetry
Part 1 — From Monitoring to Observability & The Three Pillars
Pre-Reading Quiz — Part 1
1. A team has solid dashboards for CPU, error rate, and p99 latency, yet a novel bug only appears when tenant_type="enterprise" hits a specific endpoint on a single deployment version. Why is observability (not just better monitoring) the right answer?
Dashboards always undersample CPU data so the team should switch to per-second collection.Observability lets engineers slice telemetry by arbitrary high-cardinality labels and pivot across signals to surface unknown-unknowns.Monitoring only works when there are fewer than 10 services in production.Observability replaces alerts with machine learning that automatically files incidents.
2. Which best characterizes the practical investigative path observability enables during an incident?
Alert → runbook → reboot the affected node.Metric anomaly → pivot to matching traces → drill into the specific log lines for the failing span.Open a ticket, wait for the next standup, and assign it to the responsible team.Search every log line in the cluster for the word "error".
3. Why would adding a user_id label to a Prometheus counter like http_requests_total be a poor design choice?
Prometheus cannot store labels longer than 8 characters.It introduces unbounded high cardinality, causing each user to produce a new time series and risking memory exhaustion on the Prometheus server.User IDs change at every deployment, so the metric becomes inaccurate.Prometheus requires user IDs to be hashed before scraping.
4. Two engineers are debating: "We have logs and metrics — why do we still need traces?" What is the strongest argument for adding distributed tracing?
Traces replace logs at lower storage cost.Traces natively capture the causal structure of a request across multiple services, which neither metrics nor logs reconstruct by themselves.Traces are required for any Prometheus scrape configuration to work.Traces are the only way to graph CPU usage.
5. A latency histogram bucket shows a tall bar at the 2-second mark for thousands of requests. How do exemplars help an SRE here?
They reduce the histogram's bucket count to make queries faster.They link that specific metric data point to one concrete trace that contributed to the bucket, enabling a one-click jump from "the chart looks bad" to a representative slow trace.They convert metrics into logs automatically.They guarantee 100% trace sampling for slow requests.
1. From Monitoring to Observability
Key Points
Observability is borrowed from control theory: a system is observable when its emitted telemetry is rich enough to reconstruct its internal state — including for situations no one anticipated.
Monitoring checks known signals against thresholds; observability supports ad-hoc exploration of correlated telemetry.
Known unknowns vs. unknown unknowns: monitoring covers failure modes you can enumerate in advance; observability is what surfaces the novel combinations distributed systems produce.
Dashboards summarize signals but don't anticipate every meaningful slice. The dimensional space across services, versions, tenants, regions, and Kubernetes attributes is millions wide.
The signature workflow: metric → trace → log — pivoting fluidly between pillars instead of staring at pre-built charts.
Observability originated in control theory: the degree to which a system's internal state can be inferred from its external outputs. Modern software borrows the idea — a system is observable when its telemetry (metrics, logs, traces) is rich enough that engineers can reconstruct what is happening inside it, even for situations no one anticipated.
This contrasts with monitoring, which is fundamentally about checking known signals against predefined thresholds. Monitoring answers the questions you wrote down ahead of time: "Is checkout-service returning too many 5xx errors?" or "Is CPU on node-3 above 90%?" It is a tripwire built around failure modes you already understand.
Known unknowns are failure modes you can name (database connections, disk full, CrashLoopBackOff) — perfect targets for pre-built alerts. Unknown unknowns are the novel combinations no one imagined. Consider sporadic 500s only during peak traffic, only at /checkout/card, only on v2, only for tenant_type="enterprise". With pure monitoring you see only a vague error spike; with observability you slice metrics by labels, pivot to traces, then jump to the logs for the failing spans — uncovering a misconfigured PAYMENT_TIMEOUT=200ms in v2.
Dashboards summarize telemetry into a small number of pre-chosen charts. That works when the questions are stable. In a microservice deployment with 30 services, 5 in-flight versions, dozens of tenants, multiple regions, and Kubernetes attributes (namespace, deployment, pod, node), the meaningful dimensional space runs into the millions. No fixed dashboard can pre-render every slice.
Figure 1.1: Monitoring dashboards vs. observability query interfaces
flowchart LR
subgraph Monitoring["Monitoring (pre-built dashboards)"]
direction TB
G1["Gauge: CPU %"]
G2["Gauge: Error rate"]
G3["Gauge: Latency p99"]
G4["Fixed thresholds and alerts"]
end
subgraph Observability["Observability (ad-hoc query interface)"]
direction TB
Q["Free-form query by service, version, tenant, region, endpoint"]
Q --> R1["Slice metrics"]
Q --> R2["Pivot to traces"]
Q --> R3["Drill into logs"]
end
Monitoring -. "answers known questions" .-> Known["Known unknowns"]
Observability -. "discovers new questions" .-> Unknown["Unknown unknowns"]
Animation 1.1 — Monitoring vs. Observability: Surfacing Unknown-Unknowns
Monitoring shows the known charts (left). The novel failure (right) hides among combinations no dashboard pre-renders — until observability lets you query the right slice.
2. The Three Pillars: Metrics, Logs, and Traces
Key Points
Metrics are low-cardinality numeric time series — ideal for alerting, SLOs, and trends.
Logs are high-cardinality structured events — ideal for capturing what exactly happened in one moment.
Traces are end-to-end causal records of a request across services — the only signal that captures distributed call structure.
Correlation turns three pillars into one fabric: exemplars link metrics to traces, trace IDs in logs link logs to traces, and resource attributes tie everything to the workload.
Cardinality is a first-class design concern. Push low-cardinality dimensions to metrics; high-cardinality detail to logs and traces; sample traces aggressively.
The three pillars are complementary because each handles cardinality, cost, and temporal granularity differently. Metrics are cheap, summarizable, ideal for alerting; a typical sample is http_requests_total{service="api",status="500"}. Their weakness is aggregation — you trade per-event detail for compactness. Logs are discrete events, usually structured JSON, that capture exact parameter values and stack traces; their weakness is storage and indexing volume. Traces are made of spans — units of work in one service with parent-child links — propagated via headers like W3C traceparent; their weakness is also volume, which is why traces are usually sampled.
Signal
Best for
Cardinality tolerance
Typical cost driver
Metrics
Alerting, trends, SLOs
Low
Number of time series
Logs
Detailed per-event context
High
Storage and indexing volume
Traces
Cross-service request flow
High
Sample rate and span count
Figure 1.2: The three pillars and their connective tissue
flowchart TD
Obs["Observability"]
Obs --> M["Metrics aggregated numeric time series"]
Obs --> L["Logs discrete structured events"]
Obs --> T["Traces causal request flow across services"]
M -- "exemplars (metric point to trace)" --> T
T -- "trace_id in log lines" --> L
L -- "span_id back to trace" --> T
M -. "resource attributes (k8s.namespace, service, version)" .- L
M -. "resource attributes" .- T
L -. "resource attributes" .- T
The leap from "three pillars" to "one fabric" happens when signals are correlated. Exemplars attach pointers to metric data points linking a histogram bucket to one concrete trace. Trace IDs embedded in structured logs let you jump from any log line to the full trace, and from any span back to that span's logs. Resource attributes (k8s.namespace.name, k8s.deployment.name) tie all three pillars to the underlying Kubernetes workload.
Cardinality — the number of unique label combinations — is the most important operational concept here. A Prometheus counter labeled http_requests_total{service, endpoint, status, pod} with 10 services × 50 endpoints × 5 statuses × 200 pods produces 500,000 time series. Add user_id and it explodes to millions, risking OOM crashes. The rule: low-cardinality labels in metrics; high-cardinality detail in logs and traces; sample traces (head-based is cheap but blind to errors; tail-based retains 100% of errors and slow traces).
Animation 1.2 — Three Pillars and Their Correlation Links
▲
Metrics
Low-cardinality time series. Cheap, summarizable, ideal for alerts and SLOs.
☰
Logs
High-cardinality discrete events. Capture exact parameters, exceptions, business context.
➤
Traces
Causal end-to-end request flow. The only signal that natively captures distributed structure.
Each pillar appears in sequence (0.15s stagger), then correlation links draw in: exemplars (metrics↔traces), trace IDs in logs (logs↔traces), and shared resource attributes binding all three.
Post-Reading Quiz — Part 1
1. A team has solid dashboards for CPU, error rate, and p99 latency, yet a novel bug only appears when tenant_type="enterprise" hits a specific endpoint on a single deployment version. Why is observability (not just better monitoring) the right answer?
Dashboards always undersample CPU data so the team should switch to per-second collection.Observability lets engineers slice telemetry by arbitrary high-cardinality labels and pivot across signals to surface unknown-unknowns.Monitoring only works when there are fewer than 10 services in production.Observability replaces alerts with machine learning that automatically files incidents.
2. Which best characterizes the practical investigative path observability enables during an incident?
Alert → runbook → reboot the affected node.Metric anomaly → pivot to matching traces → drill into the specific log lines for the failing span.Open a ticket, wait for the next standup, and assign it to the responsible team.Search every log line in the cluster for the word "error".
3. Why would adding a user_id label to a Prometheus counter like http_requests_total be a poor design choice?
Prometheus cannot store labels longer than 8 characters.It introduces unbounded high cardinality, causing each user to produce a new time series and risking memory exhaustion on the Prometheus server.User IDs change at every deployment, so the metric becomes inaccurate.Prometheus requires user IDs to be hashed before scraping.
4. Two engineers are debating: "We have logs and metrics — why do we still need traces?" What is the strongest argument for adding distributed tracing?
Traces replace logs at lower storage cost.Traces natively capture the causal structure of a request across multiple services, which neither metrics nor logs reconstruct by themselves.Traces are required for any Prometheus scrape configuration to work.Traces are the only way to graph CPU usage.
5. A latency histogram bucket shows a tall bar at the 2-second mark for thousands of requests. How do exemplars help an SRE here?
They reduce the histogram's bucket count to make queries faster.They link that specific metric data point to one concrete trace that contributed to the bucket, enabling a one-click jump from "the chart looks bad" to a representative slow trace.They convert metrics into logs automatically.They guarantee 100% trace sampling for slow requests.
Part 2 — Cloud-native Operational Context & The CNCF Landscape
Pre-Reading Quiz — Part 2
6. A dashboard is keyed by individual pod name. After a rolling deployment the panels are full of orphaned series. What is the correct cloud-native fix?
Disable rolling deployments so pod names remain stable.Aggregate at the workload level using stable labels like service, namespace, and version, and use Kubernetes service discovery for scrape targets.Increase the metrics retention window to 1 year.Move all metric collection into application logs.
7. Why are traditional in-process APM agents insufficient for a polyglot microservices architecture?
They cannot read JSON payloads.They are tightly coupled to specific language runtimes and cannot reliably stitch a request together across services, languages, and message-queue hops.They emit too few metrics per minute.They require Kubernetes to be installed on every developer laptop.
8. The application metrics for checkout-service look fine, but users report intermittent failures. The team uses Istio. What observability blind spot is the most likely culprit?
PromQL queries are case-sensitive.The Envoy sidecar's mesh-level behavior (retries, mTLS failures, outlier detection) is invisible because only application metrics are being scraped.The pods are running on the wrong node taint.The application is not using a service account.
9. Which best describes the complementary roles of Prometheus and OpenTelemetry in a 2025 CNCF stack?
They are direct competitors and a team must pick exactly one.OpenTelemetry handles how telemetry is produced and routed (SDKs, conventions, Collector, OTLP); Prometheus handles where metrics live and how they are queried and alerted on (TSDB, PromQL, Alertmanager).OpenTelemetry only exists for traces; Prometheus only exists for logs.Both store traces, metrics, and logs and one is chosen at random.
10. Why does OpenTelemetry's vendor neutrality matter operationally?
It guarantees the Collector is faster than every commercial agent.Instrumentation is decoupled from backend choice, so switching or combining vendors becomes a Collector configuration change rather than an application-code rewrite.It blocks commercial vendors from accepting OTLP data.It forces all telemetry to be stored in a single backend.
3. Cloud-native Operational Context
Key Points
Pods are ephemeral by design. Names and IPs are not stable identifiers; host-centric monitoring breaks down on pod-keyed dashboards and stale alerts.
Workloads — not pods — are the unit of observability. Aggregate by service, namespace, deployment, version.
Kubernetes service discovery dynamically finds scrape targets via the API server instead of maintaining static lists.
Microservices kill the single call stack. Distributed tracing with OpenTelemetry + W3C Trace Context is the only practical way to reconstruct a polyglot request's full path.
Service mesh sidecars are first-class observability targets. mTLS, retries, and routing happen in Envoy — not the app — so mesh proxies must be scraped and traced alongside applications.
Traditional monitoring assumed long-lived hosts with stable identities. Kubernetes inverts that: pods are created and destroyed in seconds due to autoscaling, rolling deployments, crash loops, and probe failures. A dashboard keyed by pod name becomes unreadable; historical series for a specific pod are meaningless once it's gone; alerts firing on a pod that just terminated produce 404s when an operator clicks through.
Figure 1.3: Rolling deployment re-keys metrics from pods to the Deployment
flowchart LR
subgraph T0["t=0: v1 steady state"]
P1A["pod checkout-v1-a"]
P1B["pod checkout-v1-b"]
P1C["pod checkout-v1-c"]
end
subgraph T1["t=1: rolling update"]
P1B2["pod checkout-v1-b"]
P2A["pod checkout-v2-a"]
P2B["pod checkout-v2-b"]
end
subgraph T2["t=2: v2 steady state"]
P2A2["pod checkout-v2-a"]
P2B2["pod checkout-v2-b"]
P2C["pod checkout-v2-c"]
end
T0 --> T1 --> T2
T0 --> Q["Stable query: sum by (service, version) (rate(http_requests_total))"]
T1 --> Q
T2 --> Q
Q --> Dash["Continuous series keyed by Deployment + version, not pod name"]
The cloud-native fix: treat workloads as the unit of observability. Aggregate across all pods in a Deployment using service, namespace, and version. Use Prometheus's Kubernetes service discovery to find scrape targets dynamically. Send container logs via a node-level DaemonSet collector (Fluent Bit, Vector) to a centralized store so records survive the pod.
In a monolith, debugging meant reading one process's stack trace. Microservices destroy that comfort: a single request can traverse dozens of services in different languages over multiple protocols (HTTP, gRPC, Kafka). There is no single stack — only a distributed call hierarchy that lives in no one process's memory. Traditional APM agents tied to specific runtimes cannot stitch this together. The cloud-native answer is distributed tracing on open standards: OpenTelemetry SDKs in every language, W3C Trace Context propagation, automatic injection/extraction in HTTP clients, gRPC interceptors, and message brokers.
Animation 1.3 — The Death of the Call Stack: A Distributed Trace
A single user request enters at the gateway and threads through four polyglot services. The pulse carries the traceparent header so every span shares one trace ID — making the distributed call hierarchy reconstructable.
A service mesh (Istio, Linkerd, Consul Connect) injects a sidecar proxy — commonly Envoy — next to each application pod. mTLS, retries, timeouts, circuit breakers, and traffic splitting execute in the proxy, not the app. This creates a blind spot: the app may report a healthy error rate while the mesh is silently retrying 503s or failing TLS handshakes. Treat mesh proxies as first-class observability targets — scrape Envoy stats, collect mesh access logs, and combine application spans with mesh spans so end-to-end traces show exactly where time was spent.
Figure 1.4: Application and sidecar emit separate, correlated telemetry streams
Prometheus is the de-facto metrics standard — a CNCF Graduated, pull-based TSDB with PromQL, Alertmanager, and a massive exporter ecosystem. Metrics only.
OpenTelemetry is the vendor-neutral instrumentation standard — SDKs in every language, semantic conventions, the Collector, and OTLP across traces, metrics, and logs.
They are complementary, not competitors: OpenTelemetry covers how telemetry is produced and routed; Prometheus covers where metrics live and how they are queried.
The 2025 stack: instrument with OTel SDKs → collect via OTel Collector → store metrics in Prometheus + Mimir/Thanos/VictoriaMetrics → traces in Jaeger/Tempo → logs in Loki → visualize in Grafana.
Vendor neutrality via OTLP means switching or combining commercial backends (Datadog, Honeycomb, New Relic, etc.) is a Collector config change — not a code rewrite.
Prometheus is the dominant CNCF metrics system — CNCF Graduated alongside Kubernetes. It is pull-based: services expose HTTP /metrics endpoints in the Prometheus exposition (or OpenMetrics) format and Prometheus scrapes them on a schedule, storing the data in a purpose-built TSDB queried with PromQL. Alerts are PromQL rules routed through Alertmanager. Prometheus is metrics-only — that single-responsibility focus is a feature, not a limitation. For scale beyond a single instance, the ecosystem provides PromQL-compatible long-term stores: Thanos, Cortex, Mimir, and VictoriaMetrics.
OpenTelemetry (OTel) is the CNCF's answer to vendor lock-in — also CNCF Graduated as of 2025. Its scope is fundamentally different from Prometheus's: it is not a backend, it is an instrumentation and pipeline standard. Components:
SDKs in Go, Java, Python, Node.js, Rust, .NET, Ruby, and more — for traces, metrics, and logs.
Auto-instrumentation packages that wrap common libraries (HTTP, gRPC, DB drivers, message queues).
Semantic conventions so http.route, k8s.namespace.name, service.version mean the same thing everywhere.
OpenTelemetry Collector — vendor-neutral receive/process/export pipeline; deploy as sidecar, DaemonSet, or gateway.
OTLP — the standard wire protocol for all three signals (gRPC or HTTP).
Instrument once with OpenTelemetry, and backend choice becomes a configuration decision rather than a code rewrite. Migrating from one tracing vendor to another, or running multiple in parallel, requires changes only in Collector pipelines.
Commercial vendors — Datadog, New Relic, Splunk, Honeycomb, Dynatrace, Chronosphere, Lightstep — natively ingest OTLP and mostly support Prometheus remote-write. Teams can move between open-source and SaaS, or run hybrid, without rewriting instrumentation.
In this arrangement Prometheus and OpenTelemetry are complementary, not competitors: use OpenTelemetry for how you instrument and move telemetry; use Prometheus for where metrics live and how you query and alert on them.
Post-Reading Quiz — Part 2
6. A dashboard is keyed by individual pod name. After a rolling deployment the panels are full of orphaned series. What is the correct cloud-native fix?
Disable rolling deployments so pod names remain stable.Aggregate at the workload level using stable labels like service, namespace, and version, and use Kubernetes service discovery for scrape targets.Increase the metrics retention window to 1 year.Move all metric collection into application logs.
7. Why are traditional in-process APM agents insufficient for a polyglot microservices architecture?
They cannot read JSON payloads.They are tightly coupled to specific language runtimes and cannot reliably stitch a request together across services, languages, and message-queue hops.They emit too few metrics per minute.They require Kubernetes to be installed on every developer laptop.
8. The application metrics for checkout-service look fine, but users report intermittent failures. The team uses Istio. What observability blind spot is the most likely culprit?
PromQL queries are case-sensitive.The Envoy sidecar's mesh-level behavior (retries, mTLS failures, outlier detection) is invisible because only application metrics are being scraped.The pods are running on the wrong node taint.The application is not using a service account.
9. Which best describes the complementary roles of Prometheus and OpenTelemetry in a 2025 CNCF stack?
They are direct competitors and a team must pick exactly one.OpenTelemetry handles how telemetry is produced and routed (SDKs, conventions, Collector, OTLP); Prometheus handles where metrics live and how they are queried and alerted on (TSDB, PromQL, Alertmanager).OpenTelemetry only exists for traces; Prometheus only exists for logs.Both store traces, metrics, and logs and one is chosen at random.
10. Why does OpenTelemetry's vendor neutrality matter operationally?
It guarantees the Collector is faster than every commercial agent.Instrumentation is decoupled from backend choice, so switching or combining vendors becomes a Collector configuration change rather than an application-code rewrite.It blocks commercial vendors from accepting OTLP data.It forces all telemetry to be stored in a single backend.