Cloud-native Observability with Prometheus and OpenTelemetry
An intermediate, hands-on guide to instrumenting, collecting, querying, and operationalizing telemetry in cloud-native systems using Prometheus and OpenTelemetry.
Table of Contents
- Chapter 1: Foundations of Cloud-native Observability
- Chapter 2: The Three Signals: Metrics, Logs, and Traces in Depth
- Chapter 3: Prometheus Architecture and Data Model
- Chapter 4: PromQL: Querying Time-Series Data
- Chapter 5: OpenTelemetry Architecture: API, SDK, and Collector
- Chapter 6: Instrumentation: Manual, Automatic, and Zero-Code
- Chapter 7: Distributed Tracing with OpenTelemetry
- Chapter 8: Metrics Pipeline: Bridging OpenTelemetry and Prometheus
- Chapter 9: Logs, Events, and Cross-Signal Correlation
- Chapter 10: The OpenTelemetry Collector in Depth
- Chapter 11: Sampling, Performance, and Cost Control
- Chapter 12: SLOs, Alerting, and Operational Excellence
Chapter 1: Foundations of Cloud-native Observability
Learning Objectives
- Distinguish between monitoring and observability, and explain why cloud-native systems require the latter to operate safely at scale.
- Identify the three pillars of observability — metrics, logs, and traces — and articulate the role each plays during incident response.
- Describe the cloud-native operational challenges (ephemeral workloads, distributed services, polyglot stacks, service meshes) that motivate Prometheus and OpenTelemetry as foundational tooling.
From Monitoring to Observability
Definition of observability borrowed from control theory
The word observability originated in control theory, where it describes the degree to which the internal state of a system can be inferred from its external outputs. Modern software borrows that idea: a system is observable when its emitted telemetry — metrics, logs, and traces — is rich enough that engineers can reconstruct what is happening inside it, even for situations no one anticipated [Source: https://opentelemetry.io].
This stands in contrast to monitoring, which is fundamentally about checking known signals against predefined thresholds. Monitoring answers questions you wrote down ahead of time: “Is checkout-service returning too many 5xx errors?” or “Is CPU usage on node node-3 above 90%?” [Source: https://prometheus.io]. It is a detection mechanism — a tripwire — built around failure modes you already understand.
A useful analogy: monitoring is like the dashboard of a car. Speed, fuel level, engine temperature, and a few warning lights cover the most common failure cases. Observability is more like having a full diagnostic port plus a mechanic’s toolkit on hand. When the car behaves strangely in a way the dashboard cannot explain — a vibration only at 73 mph in the rain — you need the ability to probe, query, and correlate signals that were never wired up to a warning light.
Key Takeaway: Monitoring tells you that something is wrong by checking predefined signals; observability gives you the raw material to figure out what and why by enabling ad-hoc exploration of correlated telemetry. Both are necessary, but observability is what makes complex cloud-native systems debuggable.
Known-unknowns vs unknown-unknowns
The classic framing for this distinction is known unknowns versus unknown unknowns. A known unknown is a failure mode you can name in advance: “the database might run out of connections,” “the disk might fill up,” “a pod might enter CrashLoopBackOff.” For each of these, you can pre-build a metric, threshold, and alert [Source: https://prometheus.io].
Unknown unknowns are the failures that nobody on the team has imagined yet. Consider a real-world scenario: sporadic 500 errors appear from checkout-service only during peak traffic, only for the /checkout/card endpoint, only on version v2, and only for tenant_type="enterprise". No engineer wrote an alert for that combination of conditions. With pure monitoring you might see only a vague error-rate spike. With observability, you can slice the metric by labels, pivot to the traces matching the failing requests, and finally jump to the logs for the specific failing spans to discover that a misconfigured PAYMENT_TIMEOUT=200ms environment variable was deployed in v2 [Source: https://opentelemetry.io].
That investigative path — metric → trace → log — is the practical essence of observability. It lets you discover the question you didn’t know to ask.
Key Takeaway: Monitoring is sufficient when you can enumerate failure modes in advance. Distributed cloud-native systems generate too many novel failure combinations for that approach, which is why observability — high-cardinality, correlated, queryable telemetry — has become a baseline requirement rather than a luxury.
Why dashboards alone are insufficient in distributed systems
Dashboards are fundamentally a summarization tool: they aggregate raw telemetry into a small number of charts chosen ahead of time. That works beautifully when the number of important questions is small and stable. It breaks down when the number of meaningful slices through your data explodes.
In a microservice deployment with 30 services, 5 versions in flight at any moment, dozens of tenants, multiple regions, feature flags, and Kubernetes-supplied attributes (namespace, deployment, pod, node), the dimensional space of interesting views runs into the millions. No fixed dashboard can pre-render every relevant slice [Source: https://www.cncf.io].
Figure 1.1: Monitoring dashboards vs. observability query interfaces
flowchart LR
subgraph Monitoring["Monitoring (pre-built dashboards)"]
direction TB
G1["Gauge: CPU %"]
G2["Gauge: Error rate"]
G3["Gauge: Latency p99"]
G4["Fixed thresholds<br/>and alerts"]
end
subgraph Observability["Observability (ad-hoc query interface)"]
direction TB
Q["Free-form query<br/>by service, version,<br/>tenant, region, endpoint"]
Q --> R1["Slice metrics"]
Q --> R2["Pivot to traces"]
Q --> R3["Drill into logs"]
end
Monitoring -. "answers known questions" .-> Known["Known unknowns"]
Observability -. "discovers new questions" .-> Unknown["Unknown unknowns"]
A second problem is that dashboards summarize signals, but rarely correlate across signals. When an alert fires on a latency metric, the human operator must mentally bridge from “the chart looks bad” to “let me find a representative trace” to “let me read the logs for that trace.” Observability tooling closes that loop with exemplars and trace IDs that make the pivot a single click rather than a sprawling investigation.
Key Takeaway: Dashboards are a useful summary surface, but they cannot anticipate every question a distributed system will provoke. Observability shifts the workflow from “look at pre-built charts” to “ask arbitrary questions of correlated telemetry.”
The Three Pillars: Metrics, Logs, and Traces
Strengths and weaknesses of each signal
The three pillars of observability — metrics, logs, and traces — are complementary because each has different strengths along the axes of cardinality, cost, and temporal granularity.
Metrics are numeric time series: counters, gauges, and histograms sampled at fixed intervals. They are cheap, summarizable, and ideal for alerting. A typical Kubernetes metric looks like http_requests_total{service="api",status="500"} or a latency histogram like request_duration_seconds_bucket{le="0.1",service="checkout"} [Source: https://prometheus.io]. Their weakness is that they are inherently aggregated: you trade per-event detail for compactness.
Logs are discrete events — usually structured JSON in cloud-native systems — emitted by applications and infrastructure. In Kubernetes this includes application logs from containers, kubelet logs, API server logs, and ingress controller logs. Logs are excellent for capturing what exactly happened in a single moment, including exception stack traces, parameter values, and business context. Their weakness is volume: at scale, indexing and storing every log line is expensive [Source: https://opentelemetry.io].
Traces are end-to-end records of a request as it flows through multiple services. A trace is composed of spans, each representing a unit of work in one service, with parent-child relationships representing the call hierarchy. Trace context is propagated across processes via HTTP headers like W3C traceparent. Traces are the only signal type that natively captures the structure of a distributed request. Their weakness is also volume, which is why traces are usually sampled rather than collected exhaustively [Source: https://opentelemetry.io].
| Signal | Best for | Cardinality tolerance | Typical cost driver |
|---|---|---|---|
| Metrics | Alerting, trends, SLOs | Low | Number of time series |
| Logs | Detailed per-event context | High | Storage and indexing volume |
| Traces | Cross-service request flow | High | Sample rate and span count |
Figure 1.2: The three pillars and their connective tissue
flowchart TD
Obs["Observability"]
Obs --> M["Metrics<br/>aggregated numeric<br/>time series"]
Obs --> L["Logs<br/>discrete structured<br/>events"]
Obs --> T["Traces<br/>causal request flow<br/>across services"]
M -- "exemplars<br/>(metric point to trace)" --> T
T -- "trace_id in log lines" --> L
L -- "span_id back to trace" --> T
M -. "resource attributes<br/>(k8s.namespace, service, version)" .- L
M -. "resource attributes" .- T
L -. "resource attributes" .- T
Key Takeaway: No single pillar is sufficient on its own. Metrics tell you that something is wrong at scale; logs tell you what exactly happened in one event; traces tell you where in the call chain the problem lives. A mature observability stack uses all three deliberately.
Correlation between signals via exemplars and trace IDs
The leap from “three separate pillars” to “one observability fabric” happens when the signals are correlated. Two mechanisms make this possible.
First, exemplars are pointers attached to metric data points that link a specific bucket of a histogram (for example, a latency point at 2.3 seconds) to one concrete trace that contributed to that bucket. When you see a tall bar in a latency histogram, an exemplar lets you click straight to a representative slow trace, rather than guessing which of millions of requests it was [Source: https://prometheus.io].
Second, trace IDs embedded in structured logs allow you to jump from any log line to the full trace it belongs to — and from any span in a trace to the exact log lines that span emitted. When a trace shows a span failing with a TimeoutError, you can pivot directly to that span’s logs to see parameters, exception details, and surrounding context [Source: https://opentelemetry.io].
Resource attributes such as k8s.namespace.name, k8s.deployment.name, and k8s.pod.name tie all three pillars to the underlying Kubernetes workload, so a single label can pivot you from a metric chart to a log query to a trace search without losing context.
Key Takeaway: Exemplars connect metrics to traces; trace IDs in logs connect logs to traces; resource attributes connect everything to the workload. Together these three correlations let an operator move fluidly between pillars during an incident.
Cardinality, sampling, and cost trade-offs
Cardinality is the number of unique combinations of labels in a telemetry dataset. It is the single most important operational concept in cloud-native observability, because each pillar handles it differently — and getting it wrong is expensive [Source: https://prometheus.io].
In a Prometheus-style metrics system, every unique combination of label values creates a new time series. A counter like http_requests_total{service, endpoint, status, pod} with 10 services × 50 endpoints × 5 statuses × 200 pods produces 500,000 time series. Add user_id to the label set and the count explodes to millions, causing memory pressure, slow queries, and even out-of-memory crashes on the Prometheus server [Source: https://prometheus.io].
The practical rule is:
- Metrics should carry low-cardinality, operationally meaningful labels:
service,namespace,deployment,version,region,endpoint(templated, not raw),status. Avoiduser_id,request_id, raw URLs, and pod UIDs. - Logs are built to handle high-cardinality data — request IDs, user IDs, dynamic fields — because they are indexed and stored differently than time series. Keep them structured so selected fields can be indexed.
- Traces are inherently high-cardinality (every span has many attributes) and are usually sampled: only a fraction of requests have their full trace retained.
Sampling strategies vary. Head-based sampling decides at the start of a request whether to keep it; it is cheap but blind to errors. Tail-based sampling buffers spans and decides after the request completes, allowing you to retain 100% of errors and slow traces while sampling normal traffic at a low rate.
Key Takeaway: Cardinality is a first-class design concern. Push low-cardinality dimensions to metrics, high-cardinality detail to logs and traces, and sample aggressively to control cost. Misplacing a label can quietly bankrupt a metrics backend.
Cloud-native Operational Context
Kubernetes, containers, and ephemeral workloads
Traditional monitoring was built for a world of long-lived hosts with stable identities: hostnames, IPs, and static service inventories. Kubernetes turns those assumptions upside down. Pods are short-lived by design — they are created and destroyed in seconds due to autoscaling, rolling deployments, crash loops, and health-probe failures. Pod names and IPs are not stable identifiers; each restart can produce new ones [Source: https://kubernetes.io].
Several practical pathologies follow from this. A dashboard keyed by pod name becomes unreadable as identities churn every few minutes. Historical time series for a specific pod are meaningless once that pod is gone. Alerts that fire on a pod that has already terminated produce “resource not found” errors when an operator clicks through. Batch jobs or crashing pods may exist for only seconds — too briefly for a legacy agent to discover them at all [Source: https://kubernetes.io].
Figure 1.3: Rolling deployment re-keys metrics from pods to the Deployment
flowchart LR
subgraph T0["t=0: v1 steady state"]
P1A["pod checkout-v1-a"]
P1B["pod checkout-v1-b"]
P1C["pod checkout-v1-c"]
end
subgraph T1["t=1: rolling update"]
P1B2["pod checkout-v1-b"]
P2A["pod checkout-v2-a"]
P2B["pod checkout-v2-b"]
end
subgraph T2["t=2: v2 steady state"]
P2A2["pod checkout-v2-a"]
P2B2["pod checkout-v2-b"]
P2C["pod checkout-v2-c"]
end
T0 --> T1 --> T2
T0 --> Q["Stable query:<br/>sum by (service, version)<br/>(rate(http_requests_total))"]
T1 --> Q
T2 --> Q
Q --> Dash["Continuous series<br/>keyed by Deployment +<br/>version, not pod name"]
The cloud-native response is to treat workloads as the unit of observability, not pods or hosts. Aggregate metrics across all pods in a Deployment using labels like service, namespace, and version. Use Prometheus’s built-in Kubernetes service discovery to find scrape targets dynamically via the Kubernetes API rather than maintaining a static target list. Send logs from containers via a node-level DaemonSet collector (Fluent Bit, Vector) to a centralized store so that records survive the pod that created them [Source: https://prometheus.io].
A concrete example: during a rolling deployment from v1 to v2, you should be able to issue the query sum(rate(http_requests_total{service="checkout", version="v2"}[5m])) and trust the answer even though the underlying pods that contributed to it may all have been replaced during the query window.
Key Takeaway: Ephemeral pods break host-centric monitoring. Cloud-native observability aggregates at the workload level using stable labels (deployment, namespace, version) and relies on dynamic Kubernetes service discovery rather than static target lists.
Microservices and the death of the call stack
In a monolith, debugging a user request usually means reading a single process’s stack trace. Almost all interesting work happens in-process, in a single language runtime, so a JVM profiler or .NET APM agent can see the whole transaction.
Microservices destroy this comfortable assumption. A single user request can traverse dozens of services, each potentially written in a different language (Go, Java, Node.js, Python, Rust, .NET) and communicating over multiple protocols (HTTP, gRPC, Kafka, SQS, gRPC streaming) [Source: https://opentelemetry.io]. There is no single stack trace — there is only a distributed call hierarchy that lives in no one process’s memory.
Traditional APM tools were designed around proprietary agents tightly coupled to specific runtimes. They cannot reliably stitch a request together when it crosses language boundaries or hops onto a message queue. The result is that engineers see traces only within one service, and cross-team debugging becomes guesswork: “Is it my service or theirs?” without a shared trace ID [Source: https://opentelemetry.io].
The cloud-native answer is distributed tracing built on open standards:
- Use OpenTelemetry SDKs in every language so all services produce traces in the same data model.
- Adopt W3C Trace Context (
traceparent,tracestate) so trace IDs propagate consistently across HTTP, gRPC, and (where possible) message queues. - Configure HTTP clients, gRPC interceptors, and message producers/consumers to automatically inject and extract trace context — engineers should not have to remember to do this manually.
Key Takeaway: The call stack does not span service boundaries in a microservices architecture. Distributed tracing — with vendor-neutral SDKs and W3C Trace Context propagation — is the only practical mechanism for reconstructing a request’s full path across polyglot services.
Service mesh, sidecars, and infrastructure telemetry
A service mesh like Istio, Linkerd, or Consul Connect injects a sidecar proxy (commonly Envoy) next to each application pod. Critical request behaviors — mTLS termination, retries, timeouts, circuit breakers, traffic splitting, routing — execute in the proxy, not in the application code [Source: https://istio.io].
This creates an observability gap for any tool that only watches application processes. The app may report a healthy error rate while the mesh is silently retrying 503 responses, throttling traffic, or failing TLS handshakes. A misconfigured destination rule or an aggressive outlier-detection policy can degrade user experience without ever touching application metrics [Source: https://istio.io].
The implication is that mesh proxies must be treated as first-class observability targets. Scrape Envoy stats endpoints alongside application metrics. Collect mesh access logs to see per-request behavior, including retries and route decisions. Combine application spans with mesh spans (via OpenTelemetry or Envoy tracing) so that an end-to-end trace shows exactly where time was spent — in the app, in the sidecar, or on the wire between them.
Figure 1.4: Application and sidecar emit separate, correlated telemetry streams
flowchart LR
subgraph Pod["Kubernetes Pod"]
App["Application container<br/>(business logic)"]
Envoy["Envoy sidecar<br/>(mTLS, retries, routing)"]
App <-->|"localhost"| Envoy
end
App -->|"app metrics<br/>(/metrics)"| Prom["Prometheus"]
App -->|"app traces<br/>(OTLP spans)"| Backend["Tracing backend<br/>(Jaeger / Tempo)"]
Envoy -->|"mesh metrics<br/>(Envoy stats)"| Prom
Envoy -->|"mesh access logs"| Logs["Log backend<br/>(Loki)"]
Envoy -->|"mesh spans"| Backend
Prom --> Graf["Grafana<br/>correlated view"]
Backend --> Graf
Logs --> Graf
Beyond the mesh, cloud-native observability must cover infrastructure telemetry as well: kubelet and control-plane metrics from kube-state-metrics, node-level metrics from the node exporter, and Kubernetes events that describe pod restarts, evictions, and scheduling decisions. Correlating these with application telemetry is what lets you say, “the latency spike at 03:14 UTC coincided with the autoscaler evicting three pods from node-7.”
Key Takeaway: Modern request behavior is distributed across application code, sidecar proxies, and Kubernetes control-plane components. Cloud-native observability must instrument all three layers — and correlate them through shared labels and trace IDs — to give a complete picture of what happened.
The CNCF Observability Landscape
Prometheus as the de-facto metrics standard
Prometheus is the dominant metrics monitoring system in the CNCF ecosystem. It is a CNCF Graduated project — among the earliest to reach that maturity tier, alongside Kubernetes itself [Source: https://www.cncf.io]. Graduation signals widespread, production-grade adoption, strong governance, and a healthy ecosystem of integrations.
Architecturally, Prometheus is a pull-based system. It expects services to expose HTTP /metrics endpoints in the Prometheus exposition format (or the equivalent OpenMetrics format) and scrapes them on a schedule. The scraped data is stored in a purpose-built time-series database, and engineers query it with PromQL, a domain-specific language designed for metrics analysis. Alerts are evaluated as PromQL rules and routed through the Alertmanager component [Source: https://prometheus.io].
Prometheus is metrics-only. It does not natively handle logs or traces — and that single-responsibility focus is part of why it is so durable. Its strengths are:
- Fast, local metrics scraping inside a Kubernetes cluster.
- Real-time evaluation of alert rules.
- A mature ecosystem of exporters (for databases, queues, hardware, and cloud services) and Kubernetes-native integrations like
ServiceMonitorresources from the kube-prometheus-stack. - A query language (PromQL) that most SREs in cloud-native shops already know.
For scaling beyond a single Prometheus instance, the ecosystem provides compatible long-term storage and federation systems — Thanos, Cortex, Mimir, and VictoriaMetrics — all of which speak PromQL and the Prometheus remote-write protocol.
Key Takeaway: Prometheus is the de-facto metrics standard in cloud-native systems: a CNCF Graduated, pull-based metrics database with PromQL, Alertmanager, and a vast exporter ecosystem. It excels at metrics — and only metrics.
OpenTelemetry as the vendor-neutral instrumentation standard
OpenTelemetry (often abbreviated OTel) is the CNCF’s answer to the historical chaos of proprietary, vendor-specific instrumentation. It is also a CNCF Graduated project as of 2025, reflecting the maturity of its specifications, SDKs, and Collector [Source: https://opentelemetry.io].
OpenTelemetry’s scope is fundamentally different from Prometheus’s. It is not a backend; it is an instrumentation and pipeline standard. Its components are:
- SDKs in every major language (Go, Java, Python, Node.js, Rust, .NET, Ruby, and more) for emitting traces, metrics, and logs.
- Auto-instrumentation packages that wrap common libraries (HTTP clients, gRPC, database drivers, message queues) so applications can be instrumented with minimal code changes.
- A unified data model and semantic conventions so that attributes like
http.route,k8s.namespace.name, andservice.versionmean the same thing across languages and signals. - The OpenTelemetry Collector, a vendor-neutral process that receives, processes, and exports telemetry — deployable as a sidecar, DaemonSet, or central gateway.
- OTLP (OpenTelemetry Protocol), the standard wire protocol for traces, metrics, and logs, transported over gRPC or HTTP [Source: https://opentelemetry.io].
The vendor-neutrality argument is OpenTelemetry’s signature value. Instrument once with OpenTelemetry, and the choice of backend becomes a configuration decision rather than a code rewrite. Migrating from one tracing vendor to another — or running multiple in parallel — requires changes only in Collector pipelines, not in application source [Source: https://opentelemetry.io].
Key Takeaway: OpenTelemetry is the vendor-neutral instrumentation standard for cloud-native telemetry. It provides SDKs, auto-instrumentation, semantic conventions, a Collector, and the OTLP wire protocol — covering all three pillars (traces, metrics, logs) with one consistent model.
Backends: Grafana, Jaeger, Tempo, Loki, Mimir, and commercial vendors
Once telemetry is instrumented (via OpenTelemetry) and metrics are scraped or collected (via Prometheus or the Collector), it has to land somewhere. The CNCF backend ecosystem in 2025 is rich and specialized:
| Backend | Signal | Role |
|---|---|---|
| Prometheus | Metrics | Local scraping, TSDB, PromQL, alerting |
| Mimir / Cortex / Thanos / VictoriaMetrics | Metrics | Long-term, horizontally scalable Prometheus-compatible storage |
| Jaeger | Traces | Distributed tracing backend, open-source, CNCF Graduated |
| Tempo | Traces | Grafana-stack trace backend optimized for object storage |
| Loki | Logs | Grafana-stack log aggregation, label-indexed (cheap) storage |
| Grafana | Visualization | Dashboards, alerting UI, multi-backend query interface |
Beyond the open-source landscape, every major commercial observability vendor — Datadog, New Relic, Splunk, Honeycomb, Dynatrace, Chronosphere, Lightstep, and others — natively ingests OTLP and most also support Prometheus remote-write [Source: https://opentelemetry.io]. This is the practical payoff of vendor neutrality: a team can move between open-source and SaaS backends, or even run hybrid configurations, without rewriting instrumentation.
The dominant 2025 pattern is hybrid:
- Instrument with OpenTelemetry SDKs in application code.
- Collect and route with the OpenTelemetry Collector deployed as a DaemonSet or gateway.
- Store metrics in Prometheus (and a long-term Prometheus-compatible backend like Mimir or VictoriaMetrics for retention).
- Store traces in Jaeger or Tempo.
- Store logs in Loki, OpenSearch, or a commercial backend.
- Visualize and alert in Grafana (or a vendor UI), with PromQL for metrics queries and trace/log queries cross-linked via shared IDs.
Figure 1.5: End-to-end cloud-native observability stack
flowchart LR
Apps["Applications<br/>(Go, Java, Python, ...)"] --> SDK["OpenTelemetry SDK<br/>+ auto-instrumentation"]
SDK -->|"OTLP (gRPC/HTTP)"| Col["OpenTelemetry Collector<br/>(DaemonSet / gateway)"]
Scrape["Prometheus scrape<br/>(/metrics endpoints)"] --> Prom["Prometheus<br/>(local TSDB + PromQL)"]
Apps -. "expose /metrics" .-> Scrape
Col -->|"metrics<br/>(remote-write)"| Mimir["Mimir / Thanos /<br/>VictoriaMetrics<br/>(long-term metrics)"]
Prom -->|"remote-write"| Mimir
Col -->|"traces"| Tempo["Tempo / Jaeger<br/>(trace storage)"]
Col -->|"logs"| Loki["Loki / OpenSearch<br/>(log storage)"]
Prom --> Graf["Grafana<br/>(dashboards + alerts)"]
Mimir --> Graf
Tempo --> Graf
Loki --> Graf
Prom --> AM["Alertmanager"]
In this arrangement Prometheus and OpenTelemetry are complementary, not competitors. As the CNCF SIG Observability community puts it in spirit: use OpenTelemetry for how you instrument and move telemetry; use Prometheus for where metrics live and how you query and alert on them [Source: https://www.cncf.io].
Key Takeaway: The 2025 CNCF observability stack pairs OpenTelemetry (for instrumentation and routing) with Prometheus (for metrics storage and PromQL alerting), feeding specialized backends — Jaeger or Tempo for traces, Loki for logs, Grafana for visualization — and remaining portable between open-source and commercial vendors through OTLP.
Chapter Summary
Observability is a property of a system: the degree to which an operator can infer its internal state from its external outputs. In cloud-native environments — where workloads are ephemeral, requests cross dozens of services in different languages, and infrastructure behavior lives partially in sidecar proxies — that property must be designed in deliberately. Traditional monitoring, which checks predefined signals against thresholds, remains valuable for catching known failure modes but cannot explain the novel, unknown-unknown incidents that distributed systems routinely produce.
The three pillars — metrics, logs, and traces — are the raw materials of observability, each with different cardinality tolerance and cost characteristics. Metrics are low-cardinality time series ideal for alerting and SLOs; logs are high-cardinality discrete events ideal for detailed per-request context; traces are end-to-end records of a request’s path across services. The killer feature is correlation: exemplars link metrics to traces, trace IDs in structured logs link logs to traces, and Kubernetes resource attributes tie everything to the underlying workload, letting an operator pivot from “this chart looks bad” to “this exact log line in this exact span” in seconds.
In the CNCF ecosystem of 2025, two graduated projects anchor the stack. Prometheus provides metrics scraping, the PromQL query language, and Alertmanager — the battle-tested metrics brain. OpenTelemetry provides language SDKs, semantic conventions, the Collector, and the OTLP wire protocol — the vendor-neutral nervous system for all three pillars. The dominant pattern is to instrument with OpenTelemetry, store metrics in Prometheus (or a Prometheus-compatible long-term store), and route traces and logs to specialized backends like Jaeger, Tempo, and Loki, with Grafana stitching it all together. The rest of this book builds out the practical mechanics of that stack.
Key Terms
| Term | Definition |
|---|---|
| observability | The property of a system whereby its internal state can be inferred from its external outputs (telemetry); in software, achieved by emitting rich, correlated metrics, logs, and traces. |
| telemetry | The data emitted by a system that describes its behavior — metrics, logs, traces, and increasingly profiles — collected for monitoring and observability purposes. |
| signal | A category of telemetry data. In OpenTelemetry parlance, the canonical signals are metrics, logs, and traces, each with its own data model. |
| cardinality | The number of unique combinations of label or attribute values in a telemetry dataset. High cardinality is cheap for logs/traces but expensive for metrics, where each combination creates a separate time series. |
| ephemeral workload | A short-lived compute unit — most commonly a Kubernetes pod — whose identity (name, IP) is not stable and that may be created and destroyed within seconds by deployments, autoscaling, or crashes. |
| service mesh | An infrastructure layer (e.g., Istio, Linkerd) that injects sidecar proxies next to application pods to handle mTLS, retries, timeouts, routing, and traffic policy at the network layer rather than in application code. |
| CNCF | The Cloud Native Computing Foundation, the open-source foundation that hosts Kubernetes, Prometheus, OpenTelemetry, Jaeger, and many other graduated and incubating cloud-native projects. |
| vendor neutrality | A design principle, central to OpenTelemetry, in which instrumentation and telemetry pipelines are independent of any specific backend so that operators can switch or combine vendors without rewriting application code. |
Chapter 2: The Three Signals: Metrics, Logs, and Traces in Depth
Learning Objectives
- Compare the data models of metrics, logs, and traces and identify when each is the right tool for a given investigative question.
- Explain how exemplars and trace IDs correlate signals into a single investigative narrative across Prometheus, OpenTelemetry, and trace backends like Tempo or Jaeger.
- Quantify the storage and cardinality cost of each signal type, including the trade-offs of Prometheus histograms versus summaries and classic versus native histograms.
Chapter 1 introduced observability as a property that emerges from three complementary signals: metrics, logs, and traces. That framing is useful, but it can be misleading if you treat the three as interchangeable buckets of “telemetry.” They are not. Each signal has a distinct data model, distinct ingestion economics, and distinct questions it can answer well. Choosing the wrong signal for a question is like trying to measure room temperature with a video camera — you can sort of do it, but you’re paying enormous storage and processing costs for information that a $5 thermometer would deliver instantly.
This chapter goes one level deeper. We dissect the wire format of each signal, examine the cardinality math that drives cost, and then show how exemplars and W3C Trace Context turn three independent data streams into a single navigable investigative narrative.
Figure 2.1: The three signals and their correlation hooks
graph TD
M["Metrics<br/>counts & aggregates<br/>'how many / how fast'"]
L["Logs<br/>discrete events<br/>'what happened'"]
T["Traces<br/>causal span tree<br/>'why was it slow'"]
M -- "exemplars<br/>(trace_id pointers)" --- T
T -- "trace_id / span_id<br/>in LogRecord" --- L
M -- "shared Resource<br/>(service.name, k8s.*)" --- L
Hub(("Unified<br/>investigation"))
M --> Hub
L --> Hub
T --> Hub
Metrics — Numbers Over Time
A metric is a numeric measurement of a system property recorded at a point in time. Metrics are the cheapest of the three signals because the storage system records aggregates, not individual events. A counter that has been incremented one billion times occupies the same space as one that has been incremented ten times — both are just a current value plus a timestamp series.
Think of metrics as the dashboard gauges in a car: speed, RPM, fuel level. They tell you the state of the system at a glance, but they don’t tell you why the engine started knocking three miles back.
Counters, gauges, histograms, and summaries
Prometheus, which has become the de facto standard for cloud-native metrics, defines four core instrument types:
- Counter — a monotonically increasing value that only goes up (or resets to zero on process restart). Examples:
http_requests_total,errors_total,bytes_sent_total. Counters answer “how many?” and “at what rate?” via PromQL’srate()function. - Gauge — a value that can go up or down. Examples:
memory_bytes_in_use,queue_depth,temperature_celsius. Gauges represent instantaneous state. - Histogram — a distribution recorded as a set of cumulative buckets. Exposes
*_bucket{le="0.1"},*_bucket{le="0.5"}, etc., plus_sumand_count. Quantiles such as p90 and p99 are computed server-side in PromQL usinghistogram_quantile()over these buckets [Source: https://community.openai.com/t/how-to-confirm-that-you-got-the-correct-value-from-a-text-other-than-repeating-the-same-prompt-over-and-over/922371]. - Summary — a distribution where the client process computes quantiles locally (using a sliding-window algorithm like CKMS) and exposes them directly as series like
request_duration_seconds{quantile="0.99"}[Source: https://blog.codinghorror.com/the-problem-with-logging/].
The histogram-versus-summary choice is one of the most consequential decisions in Prometheus instrumentation, and it trips up nearly every team at least once. The critical difference: summary quantiles cannot be aggregated across instances. Averaging the p99 from each of your ten pods does not give you the true global p99. You can safely aggregate only _sum and _count from a summary. Histogram buckets, in contrast, are fully aggregatable — you can sum() buckets across instances, zones, or services and then call histogram_quantile() on the aggregate to get a meaningful service-wide tail latency [Source: https://natesnewsletter.substack.com/p/context-windows-are-a-lie-the-myth].
| Aspect | Counter | Gauge | Histogram | Summary |
|---|---|---|---|---|
| Direction | Up only | Any | Distribution | Distribution |
| Quantile compute | N/A | N/A | Server-side (PromQL) | Client-side (in process) |
| Cross-instance aggregation | Trivial | Trivial | Yes (sum buckets) | Only _sum/_count |
| Cardinality per metric | 1 series | 1 series | N buckets + 2 | N quantiles + 2 |
| Best for | Rates, totals | Levels, sizes | SLO latency, fleet stats | Per-instance fixed quantiles |
The practical rule: for SLO-critical latency where you need fleet-wide p99 and the ability to choose new quantiles later without redeploying code, use histograms. For low-cardinality, per-process diagnostics where you only ever care about one box at a time, summaries are acceptable [Source: https://magazine.sebastianraschka.com/p/llm-evaluation-4-approaches].
Classic histograms have a well-known drawback: bucket count multiplies cardinality. Native histograms, introduced in Prometheus 2.40, encode dynamic log-spaced buckets inside a single time series using a binary format, dramatically reducing series count while supporting high resolution [Source: https://pubmed.ncbi.nlm.nih.gov/help/]. If your stack supports them end-to-end (client library, server, remote storage), they are usually the better default for new metrics.
Time-series labels and dimensionality
A Prometheus time series is uniquely identified by its metric name plus the set of label key-value pairs:
http_requests_total{method="GET", path="/api/orders", status="200", service="checkout"}
Each unique combination of labels creates a new time series with its own storage allocation. This is the cardinality of a metric, and it is the single biggest cost driver in any Prometheus deployment.
The math is brutal. Suppose http_requests_total carries five labels with these cardinalities:
method: 5 (GET, POST, PUT, DELETE, PATCH)path: 50 endpointsstatus: 6 (2xx, 3xx, 4xx grouped)service: 20region: 4
Total potential series: 5 × 50 × 6 × 20 × 4 = 120,000 series for one metric. Add a user_id label with even 10,000 unique values and you have 1.2 billion potential series. This is why veterans say “never label by user ID, request ID, or trace ID.” It’s also why exemplars exist — they let you reference a trace ID without making it a series-defining label.
A useful analogy: labels are dimensions of a spreadsheet. Each dimension multiplies the number of cells. A 2D sheet of methods × paths is manageable; a 7D sheet quickly exceeds the heat death of the universe.
Aggregation, downsampling, and retention
Once metrics are in the time-series database, the next concern is keeping them around without going broke. Prometheus by default scrapes every 15 seconds and stores raw samples for ~15 days. For longer retention, the standard pattern is:
- Recording rules evaluate expensive PromQL queries periodically and store the result as a new, cheaper time series. For example,
instance:cpu_usage:rate5mis far cheaper to query a year of than re-aggregating raw CPU samples each time. - Remote write ships data to long-term storage backends like Thanos, Cortex, or Mimir, which handle compaction and downsampling.
- Downsampling reduces resolution as data ages: raw 15-second samples become 5-minute averages after 30 days and 1-hour averages after a year.
For histograms, downsampling is more nuanced. Classic histogram buckets compose cleanly — sum the buckets, then compute the quantile. Native histograms can merge layouts more flexibly. Summaries cannot be downsampled meaningfully at all, because the quantile series are already lossy projections.
Key Takeaway: Metrics are cheap, aggregated, and ideal for “how many” and “how fast” questions. The cost is dominated by cardinality — label combinations multiply storage — and the histogram-versus-summary choice determines whether you can meaningfully ask fleet-wide tail-latency questions.
Logs — Structured Events
A log is a discrete event record emitted by an application at a specific moment. Where a metric says “23,481 requests in the last minute,” a log says “request #382749 from user 42 to /api/orders failed at 10:03:21 with NullPointerException.” Logs preserve the individual story that metrics aggregate away.
The car analogy: if metrics are the dashboard, logs are the diagnostic trouble code log that the mechanic plugs into. They record specific events with their context.
Structured vs unstructured logging
For decades, application logs looked like this:
2024-05-10T10:00:00Z my-host app: ERROR user 42 not found
This is unstructured logging — a free-form string. Humans can read it; computers struggle. To search for “all errors involving user 42 across the fleet,” some downstream tool has to parse that string with regular expressions, hope every team used the same format, and reconstruct fields that the application already knew. It’s an information-destruction pipeline.
Structured logging flips this around: the application emits machine-readable key-value records from the start.
{
"ts": "2024-05-10T10:00:00Z",
"level": "info",
"service": "billing",
"request_id": "abc-123",
"user_id": 42,
"msg": "charged credit card",
"amount": 99.99
}
Now “all errors involving user 42” is an indexed field lookup, not a regex hunt. OpenTelemetry takes structured logging a step further by defining a formal LogRecord data model [Source: https://blog.codinghorror.com/the-problem-with-logging/]. A LogRecord includes:
timestamp— when the event occurred (application time).observed_timestamp— when the collector saw it (pipeline time).severity_number— a normalized numeric severity on a 1–24 scale: TRACE (1–4), DEBUG (5–8), INFO (9–12), WARN (13–16), ERROR (17–20), FATAL (21–24).severity_text— the original severity string from the source (“WARN”, “Warning”, “warn”).body— the main content, typed asAnyValue(string, number, map, or list).attributes— structured key-value context (e.g.,http.method = "GET",db.system = "postgresql").resource— service/host/cluster metadata shared with traces and metrics from the same process.trace_id,span_id,trace_flags— first-class trace context fields.instrumentation_scope— which library produced the record.dropped_attributes_count— bookkeeping for attribute limits.
The genius of severity_number is that it normalizes severities across logging frameworks. A backend query like severity_number >= 17 (“ERROR or worse”) works whether the source used Python’s logging.ERROR, Java’s WARN, Go’s log.Error, or .NET’s LogLevel.Error [Source: https://natesnewsletter.substack.com/p/context-windows-are-a-lie-the-myth].
Figure 2.2: OpenTelemetry LogRecord schema
graph TD
LR["LogRecord"]
LR --> TS["timestamp<br/>application time"]
LR --> OTS["observed_timestamp<br/>pipeline time"]
LR --> SEV["Severity"]
SEV --> SN["severity_number<br/>1-24 normalized"]
SEV --> ST["severity_text<br/>'WARN', 'Warning'..."]
LR --> B["body : AnyValue<br/>string / num / map / list"]
LR --> A["attributes<br/>http.method, db.system..."]
LR --> R["resource<br/>service.name, k8s.*, host.*"]
LR --> TC["Trace Context"]
TC --> TID["trace_id"]
TC --> SID["span_id"]
TC --> TF["trace_flags"]
LR --> IS["instrumentation_scope"]
LR --> DA["dropped_attributes_count"]
Log levels, ingestion pipelines, and indexing strategies
Application code chooses a severity level at the log call site:
- DEBUG/TRACE — verbose internal state, off in production.
- INFO — normal lifecycle events: startup, request completed.
- WARN — degraded behavior that did not fail (retry succeeded, fallback used).
- ERROR — operation failed, often actionable.
- FATAL — process-ending or critical.
A typical pipeline looks like this:
- Application writes structured logs to stdout (the Twelve-Factor recommendation).
- Node agent (Fluent Bit, Vector, or OpenTelemetry Collector) tails container stdout, parses, enriches with Kubernetes metadata (pod name, namespace, node), and ships.
- Aggregator/gateway buffers, batches, and may sample or redact.
- Storage backend indexes for search. Common choices: Elasticsearch (full-text inverted index), Loki (label-only index over chunked raw lines), or vendor services.
Indexing strategy matters enormously for cost. Elasticsearch-style full-text indexing lets you search any word in any log instantly, but the index can be larger than the data itself and grows roughly linearly with log volume. Loki-style label-only indexing keeps the index tiny by only indexing a handful of labels (service, pod, namespace) and storing raw log lines compressed; queries then grep through chunks. The trade-off: Loki is dramatically cheaper to operate but slower for arbitrary substring searches.
Compared to traditional shippers like Fluentd or Logstash, the OpenTelemetry approach standardizes the data model itself. Fluentd and Logstash are pipelines — powerful at routing and transforming, but they impose no global schema. Each route defines its own JSON shape, severity semantics, and field names. The OpenTelemetry Collector plays a similar pipeline role but operates on the typed OTLP model across all three signals [Source: http://susandumais.com/CHI2012-12-tailanswers-chi2012.pdf].
| Aspect | OpenTelemetry Logs | Fluentd / Logstash |
|---|---|---|
| Data model | Standardized LogRecord schema | No global standard; per-pipeline JSON |
| Severity | Normalized severity_number + severity_text | String, ad-hoc normalization |
| Service context | Built-in Resource shared with traces/metrics | Plugin-specific conventions |
| Trace correlation | First-class trace_id/span_id fields | Manual injection + custom parsing |
| Multi-signal | Logs, traces, metrics share Resource | Logs only |
| Pipeline | OTel Collector (multi-signal) | Log-centric |
| Semantic conventions | Official spec (http., db., rpc.*) | None standardized |
Why logs alone are insufficient for distributed root cause analysis
A common misconception is that “good logs” suffice for observability. They don’t, for a fundamental architectural reason: logs are emitted per service, but root causes in distributed systems live in the relationships between services.
Consider a checkout request that times out. The checkout-service log shows “called inventory: timeout after 5s.” The inventory-service log shows “received call, query took 4.8s.” The postgres log shows “lock wait.” These three lines, scattered across three different log streams, describe one causal chain. Reconstructing it requires:
- Knowing the request ID and consistently propagating it through every service (which most teams botch at least one boundary of).
- Time-aligning logs whose clocks may differ by milliseconds or seconds.
- Guessing the causal order from timestamps that don’t capture parent/child relationships.
- Repeating this for every layer of the call graph.
This is exactly the problem traces solve, and it’s why logs alone — no matter how structured — cannot answer “what made this specific slow request slow?” in a microservices architecture [Source: https://arxiv.org/html/2501.11709v3]. Logs complement traces by carrying the rich context of individual events; they do not replace the causal graph.
Key Takeaway: Structured logs with a standardized data model (the OpenTelemetry LogRecord) capture rich per-event detail with normalized severity and shared resource context, but they cannot reconstruct causality across services on their own — that’s what traces are for.
Traces — Causally-linked Spans
A trace is the recorded journey of a single request through a distributed system. If metrics are the dashboard and logs are the event log, traces are the GPS track — they tell you not just that the trip happened but the exact route taken, which segments were slow, and where the detours were.
Trace, span, and span context
The core data model is a small hierarchy:
- A trace is identified by a 128-bit trace_id (32 hex characters), e.g.
4bf92f3577b34da6a3ce929d0e0e4736. All work performed in service of a single logical request shares this ID. - A span is a single named unit of work within the trace, identified by a 64-bit span_id (16 hex characters), e.g.
00f067aa0ba902b7. Each span has a start time, end time, name (often the operation, likeGET /api/orders), status, and attributes. - A span context is the immutable propagation envelope that carries
trace_id, currentspan_id, andtrace_flags(such as the sampling bit) across process boundaries. The standard wire format is the W3C Trace ContexttraceparentHTTP header [Source: https://developers.openai.com/cookbook/examples/gpt4-1_prompting_guide]:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
| trace_id | span_id |flags
When checkout-service calls inventory-service, it copies its current span context into a traceparent header on the outbound HTTP request. The receiving service reads that header, knows the parent’s trace_id and span_id, and starts a child span under the same trace. This is how a single trace stitches itself together across N services with no central coordinator.
A concrete example: a user clicks “place order.” The browser hits checkout-service, which calls inventory-service, which calls postgres, which calls payment-service. Five spans, all sharing one trace_id, each pointing to its parent’s span_id. Drawn on a timeline, this is the familiar flame graph view that tracing UIs render.
Figure 2.3: Span tree for a checkout request (one trace_id, parent-child via span_id)
graph TD
Root["POST /checkout<br/>SERVER · checkout-service<br/>span_id=a1 · 980ms"]
Root --> Inv["inventory.check<br/>CLIENT · checkout-service<br/>span_id=b2 · parent=a1 · 620ms"]
Inv --> InvS["GET /inventory<br/>SERVER · inventory-service<br/>span_id=c3 · parent=b2 · 600ms"]
InvS --> DB["db.query SELECT stock<br/>CLIENT · inventory-service<br/>span_id=d4 · parent=c3 · 540ms"]
Root --> Pay["payment.charge<br/>CLIENT · checkout-service<br/>span_id=e5 · parent=a1 · 320ms"]
Pay --> PayS["POST /charge<br/>SERVER · payment-service<br/>span_id=f6 · parent=e5 · 300ms"]
Parent-child relationships and span kinds
Each span (except the root) carries the span_id of its parent. This builds an explicit directed acyclic graph rather than the implicit one you’d have to reconstruct from log timestamps. The tree captures causality — the parent caused the child to happen — not merely temporal proximity.
OpenTelemetry classifies spans by span kind, which clarifies the role of the span in a network interaction:
- INTERNAL — work entirely within a single process (e.g., a slow function).
- SERVER — a span that receives an incoming RPC (the “callee” side of a request).
- CLIENT — a span that sends an outgoing RPC (the “caller” side).
- PRODUCER — emits a message to a queue (e.g., publish to Kafka).
- CONSUMER — receives a message from a queue.
A typical HTTP call generates two spans: a CLIENT span on the caller and a SERVER span on the callee, linked by the same trace_id and a parent-child relationship via the propagated context. Span kinds let analysis tools draw service maps automatically: any CLIENT span pointing to a SERVER span in a different service implies an edge between those services.
Spans also carry events (timestamped sub-points, e.g., “cache miss”) and links (references to other traces, useful for fan-out or queue-driven workflows where one input span causes work in many traces).
Service maps derived from trace topology
Because every cross-service call produces a CLIENT/SERVER pair tagged with service.name, a tracing backend can build a live service map by aggregating recent traces:
- Each service.name becomes a node.
- Each observed parent-child cross-service relationship becomes an edge.
- Edge weight = call rate; edge color = error rate or latency.
You no longer maintain a hand-drawn architecture diagram. The system draws its own current architecture from the traces it processes, automatically reflecting newly deployed services, retired ones, and unexpected dependencies (the classic “wait, why is checkout-service calling legacy-billing directly?” discovery).
Figure 2.4: Service map auto-derived from CLIENT→SERVER span pairs
graph LR
Web["web-frontend"]
Co["checkout-service"]
Inv["inventory-service"]
Pay["payment-service"]
PG[("postgres")]
Rd[("redis")]
Web -- "1200 rps<br/>err 0.1%" --> Co
Co -- "1100 rps<br/>err 0.4%" --> Inv
Co -- "900 rps<br/>err 2.1% (hot)" --> Pay
Inv -- "1100 rps<br/>p99 540ms" --> PG
Co -- "1200 rps<br/>p99 5ms" --> Rd
Pay -- "900 rps<br/>err 1.8%" --> PG
Key Takeaway: Traces capture causality across services via trace_id and parent span_id propagation through W3C Trace Context, enabling per-request flame graphs and auto-derived service maps — the one capability metrics and logs cannot deliver.
Correlating the Three
Each signal in isolation is useful; together they become an investigative narrative. The mechanics of correlation — how a dot on a Grafana chart links to a trace, which in turn surfaces the relevant log lines — rely on three standardized hooks.
Exemplars in Prometheus histograms
The naive way to link a latency spike to a slow request is to put trace_id in the metric labels. That blows up cardinality instantly: every unique trace_id creates a new series, and you’ll have billions in a day. Exemplars solve this by attaching a few representative trace_id/span_id pointers alongside metric samples without making them series-defining labels [Source: https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/prompt-engineering].
The OpenMetrics text exposition format appends an exemplar after the sample value, prefixed by #:
http_server_request_duration_seconds_bucket{le="0.5",method="GET",service="api"} 240 1700000100 \
# {trace_id="4bf92f3577b34da6a3ce929d0e0e4736",span_id="00f067aa0ba902b7"} 1.700000099e+09
Constraints worth memorizing:
- At most one exemplar per series per scrape interval is typically exported.
- Exemplars attach most commonly to histogram buckets (great for “find me a trace in this latency bucket”) and sometimes to counter increments (great for “find me a trace for this kind of error”).
- Exemplars live in a separate storage path from ordinary time series in Prometheus TSDB, specifically to avoid the cardinality explosion of storing high-cardinality IDs alongside normal series.
The trace ID format is exactly the W3C Trace Context format used by OpenTelemetry — 32 hex chars for trace_id, 16 hex chars for span_id — and the values must be the same IDs your tracer is shipping to Tempo or Jaeger. Mismatched propagation is the most common reason exemplars appear in Grafana but the “View trace” link returns “trace not found.”
The end-to-end workflow when a developer investigates a latency spike at 10:05:
- Grafana renders the p99 latency line from
histogram_quantile(0.99, sum by (le) (rate(http_server_request_duration_seconds_bucket[5m]))). - Small dots appear along the line wherever Prometheus has stored exemplars.
- Developer hovers a dot at the spike; popup shows
trace_id=4bf92f3577…, latency2.3s. - Click “View trace.” Grafana queries the Tempo data source by trace_id.
- The trace opens:
checkout-servicecalledinventory-servicewhich sat on a database lock for 2.1s.
The dot was sampled — not every slow request gets an exemplar, just enough to give the investigator a starting point. This is the practical realization of the metrics-to-traces correlation that the three pillars metaphor promises.
Figure 2.5: Metric-to-trace exemplar drill-down workflow
sequenceDiagram
participant Dev as Developer
participant Graf as Grafana
participant Prom as Prometheus<br/>(TSDB + exemplar store)
participant Tempo as Tempo
Dev->>Graf: open p99 latency dashboard
Graf->>Prom: histogram_quantile(0.99, ...)
Prom-->>Graf: latency series + exemplar dots
Note over Graf: spike at 10:05<br/>dot shows trace_id=4bf92f35...
Dev->>Graf: hover dot, click "View trace"
Graf->>Tempo: GET /traces/4bf92f35...
Tempo-->>Graf: span tree (checkout → inventory → db)
Note over Dev,Tempo: db.query span = 2.1s<br/>(lock wait identified)
Trace-to-logs joins via trace_id and span_id
The second leg of the correlation triangle is trace-to-logs, made possible because the OpenTelemetry LogRecord schema reserves dedicated trace_id and span_id fields at the top level — not buried in attributes [Source: https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices].
When an application logs inside an active span, the OpenTelemetry logging integration automatically copies the current span’s trace_id and span_id into the LogRecord. In Java with log4j2, in Python with the OTel logging instrumentation, in Go with the otelslog handler — the SDK does it for you. You no longer have to remember to call logger.info("...", "trace_id", currentSpan.TraceID()) at every log site.
From the trace view in Grafana or Tempo, “show logs for this span” is then a single derived field query:
{service="checkout"} |= "4bf92f3577b34da6a3ce929d0e0e4736"
The trace_id acts as the universal join key. The same query in reverse — start in logs, jump to the trace — gives you the bidirectional navigation that makes investigations feel fluid instead of archaeological.
A subtle pitfall: sampling mismatches. If you head-sample traces at 10% but log unconditionally on every request, 90% of your log trace_ids will point to traces that don’t exist in Tempo. To avoid this, use tail-based sampling (decide which traces to keep based on what happened in them — errors and slow ones always kept) or ensure that the sampling decision is made early and propagated, so logs from non-sampled traces don’t carry stale IDs.
Unified resource attributes across all signals
The third correlation hook is the Resource — the OpenTelemetry concept of a small, fixed set of attributes describing where the telemetry came from. The same resource attributes are attached to every metric, log, and trace emitted by a process:
service.name = "checkout-service"
service.namespace = "payments"
service.instance.id = "pod-abc123"
service.version = "1.4.2"
deployment.environment = "prod"
k8s.pod.name = "checkout-7f8b9-abcde"
k8s.namespace.name = "payments-prod"
host.name = "ip-10-0-1-42.ec2.internal"
cloud.region = "us-east-1"
Because all three signals share these attributes verbatim, a query like “all telemetry for service.name = checkout-service in deployment.environment = prod” returns matching metrics, logs, and traces from a single filter — no vendor-specific tag mapping, no per-tool index naming conventions. This is the practical meaning of the OpenTelemetry promise of “vendor-neutral observability”: a service.name in one tool means the same thing in another.
Semantic conventions extend this consistency to operation-level attributes. The OpenTelemetry spec defines official keys like http.request.method, http.response.status_code, db.system.name, messaging.system, rpc.service. When everyone uses these standard names, queries portable across backends become routine.
| Correlation hook | Joins | Mechanism |
|---|---|---|
| Exemplars | Metrics → Traces | trace_id/span_id appended to histogram samples in OpenMetrics |
| Trace context in LogRecord | Logs ↔ Traces | trace_id/span_id as first-class LogRecord fields, auto-populated by SDK |
| Shared resource attributes | All three | service.name, k8s., host. identical across signals from same process |
| Semantic conventions | All three | Standard attribute keys (http., db., rpc.*) used uniformly |
Figure 2.6: Correlation hub — three standardized hooks unifying the signals
graph TB
subgraph Signals
M["Metrics<br/>Prometheus TSDB"]
L["Logs<br/>Loki / Elasticsearch"]
T["Traces<br/>Tempo / Jaeger"]
end
H["Shared Resource<br/>service.name<br/>k8s.pod.name<br/>deployment.environment"]
EX["Exemplars<br/>trace_id appended<br/>to histogram samples"]
TC["LogRecord trace context<br/>top-level trace_id / span_id<br/>auto-populated by SDK"]
M -- "drill via" --- EX
EX -- "to" --- T
T -- "join via" --- TC
TC -- "to" --- L
M --- H
L --- H
T --- H
H --- Q["query: service.name=checkout<br/>returns metrics + logs + traces"]
Key Takeaway: The three signals become one investigative narrative through three standardized hooks — exemplars connect metrics to traces without cardinality blowup, OpenTelemetry’s LogRecord puts trace_id/span_id as top-level fields for trace-to-log joins, and a shared Resource model means service.name and friends mean the same thing across all signals.
Chapter Summary
This chapter dissected the three observability signals at the data-model level and showed how they interlock.
Metrics are cheap aggregates that answer “how many” and “how fast.” Prometheus offers four instrument types: counters, gauges, histograms, and summaries. The critical operational decision is histogram-vs-summary: summaries compute quantiles per-instance and cannot be aggregated across the fleet, while histogram buckets are fully aggregatable and let you compute new quantiles at query time. Native histograms (Prometheus 2.40+) further reduce series count via dynamic log-spaced buckets in a single time series. Cardinality — the product of unique label combinations — is the dominant cost, so labels like user_id or trace_id must never be used as series-defining metric labels.
Logs are discrete events with rich per-event context. The OpenTelemetry LogRecord defines a typed schema — timestamp, severity_number (normalized 1–24), severity_text, body (AnyValue), attributes, resource, trace context — that turns “lines of text” into structured telemetry events. This contrasts with Fluentd/Logstash pipelines, which transport JSON or syslog without a global data model. Despite their richness, logs alone cannot reconstruct causality across services in microservices systems.
Traces capture causality. A trace is identified by a 128-bit trace_id; each unit of work is a span with a 64-bit span_id and a parent pointer. W3C Trace Context propagates these IDs across process boundaries via the traceparent header. Span kinds (SERVER/CLIENT/PRODUCER/CONSUMER/INTERNAL) let tracing backends auto-derive service maps from the topology of observed spans.
Correlation is the payoff. Exemplars attach trace_id/span_id pointers to Prometheus histogram samples in OpenMetrics format, stored separately from normal series to avoid cardinality explosion. Trace IDs in LogRecord as top-level fields make trace-to-logs navigation a single join. A shared Resource — service.name, k8s.pod.name, deployment.environment — ties all three signals to the same process identity. Semantic conventions standardize the rest. Together these hooks turn three independent data streams into one coherent investigative narrative: spike on a graph → hover an exemplar dot → open the trace → see the slow span → click through to its logs → identify the root cause.
Key Terms
| Term | Definition |
|---|---|
| counter | A Prometheus metric instrument type representing a monotonically increasing value (or one that resets on process restart); used for rates and totals via rate(). |
| gauge | A Prometheus metric instrument type representing a value that can go up or down; used for levels and sizes like memory in use or queue depth. |
| histogram | A Prometheus distribution metric that exposes cumulative bucket counters (_bucket{le="..."}), plus _sum and _count; quantiles are computed server-side in PromQL with histogram_quantile(), and buckets are fully aggregatable across instances. |
| summary | A Prometheus distribution metric that computes quantiles client-side via a sliding-window algorithm and exposes them as {quantile="0.9"} series; quantile series cannot be aggregated across instances. |
| span | A single named unit of work within a trace, identified by a 64-bit span_id, with start time, end time, attributes, events, and a link to its parent span_id. |
| trace context | The propagation envelope (typically a W3C traceparent HTTP header) that carries trace_id, current span_id, and trace_flags across process boundaries to stitch distributed work into one trace. |
| exemplar | A small pointer (typically a trace_id and span_id) attached to a metric sample in OpenMetrics format, stored separately from normal series, enabling navigation from a metric spike to a specific trace without cardinality explosion. |
| structured logging | The practice of emitting logs as typed key-value records rather than free-form strings, enabling indexed queries; OpenTelemetry formalizes this via the LogRecord schema. |
| trace_id | A 128-bit identifier (32 hex characters in W3C Trace Context format) that uniquely identifies a single distributed request and is shared by every span and log emitted in service of that request. |
Chapter 3: Prometheus Architecture and Data Model
If chapter 2 explained why metrics matter, this chapter is about how one of the most influential metrics systems in the world actually works. Prometheus is more than a “metrics database.” It is an opinionated bundle of design choices — a pull-based scraper, a multi-dimensional data model, a custom time-series database (TSDB), and a query engine — that together produce a system that is operationally simple at small scale and powerful at large scale.
Understanding Prometheus internals matters for two reasons. First, almost every cloud-native observability stack you will encounter borrows Prometheus’s conventions: the OpenMetrics exposition format, label-based identity, PromQL, and remote write. Second, when something goes wrong — slow queries, OOMing pods, missing data after a crash, runaway cardinality — you cannot debug it without a mental model of the scrape loop, the TSDB block layout, and the WAL.
This chapter walks through Prometheus from the outside in: the components, the pull model, the data model, and finally how data is stored, retained, and shipped to long-term storage.
Learning Objectives
By the end of this chapter, you will be able to:
- Diagram the Prometheus server, the scrape loop, the TSDB, and Alertmanager, and describe how they interact during a normal scrape and alert cycle.
- Explain the pull-based scraping model, the rationale behind it, and the specific trade-offs it makes versus a push model — including where the Pushgateway fits in.
- Map a metric name, its label set, and its timestamp onto Prometheus’s multi-dimensional data model, and read or write a valid OpenMetrics exposition payload.
- Reason about TSDB blocks, the WAL, retention, and remote-write integrations like Thanos, Cortex, and Mimir well enough to plan capacity and debug failures.
Section 1: Server Components
A single prometheus binary contains several cooperating subsystems. Architecturally, you can think of Prometheus as a small distributed system that happens to run inside one process: a retrieval subsystem pulls data, a storage subsystem persists it, a query subsystem answers PromQL questions, and a rules subsystem fires alerts to an external Alertmanager.
Figure 3.1: Prometheus server components and external integrations
flowchart LR
SD[Service Discovery<br/>K8s, Consul, EC2, DNS] --> R[Retrieval<br/>Scrape Loop]
R -->|HTTP GET /metrics| Targets[(Scrape Targets)]
R --> TSDB[(TSDB<br/>head + WAL + blocks)]
TSDB --> API[HTTP API + Web UI]
API --> Grafana[Grafana / Clients]
TSDB --> RE[Rule Engine<br/>recording + alerting]
RE --> AM[Alertmanager<br/>external]
TSDB --> RW[Remote Write]
RW --> LTS[(Long-term Storage<br/>Thanos / Mimir / Cortex)]
Retrieval (scrape) loop
The retrieval subsystem is the heart of Prometheus’s interaction with the outside world. It maintains a set of scrape targets, each of which is essentially a URL like http://10.0.4.17:9100/metrics, a scrape interval (commonly 15s or 30s), and an optional set of labels.
For each target, Prometheus runs a small state machine on a timer:
- Resolve the target’s address (service discovery may have changed it).
- Open an HTTP GET to
/metricswith a configured timeout. - Parse the response as OpenMetrics or the legacy Prometheus text format.
- Apply relabeling rules to drop or rewrite the resulting samples.
- Append the samples to the TSDB’s head block, tagged with the scrape timestamp.
If the scrape fails (timeout, 5xx, parse error), Prometheus still writes a synthetic series called up{job="...",instance="..."} with the value 0. On success, it writes up = 1 plus several built-in scrape_* metrics like scrape_duration_seconds and scrape_samples_scraped. These “meta” series are how you alert on monitoring itself.
A useful analogy: think of the scrape loop as a postal carrier who walks the same route every 15 seconds. The carrier does not wait for anyone to mail a letter; they pick up whatever is in the mailbox at that moment. If the mailbox is missing — the house has been demolished — the carrier files a report (up = 0) and moves on.
TSDB on-disk format and blocks
The TSDB (time-series database) is where samples land after the scrape loop. Prometheus does not use Postgres, RocksDB, or any general-purpose database for time series — it ships its own purpose-built engine designed for one workload: many millions of monotonically advancing numeric series.
The on-disk layout is straightforward. Inside the data directory you will find:
- A
wal/directory containing append-only write-ahead-log segments (00000001,00000002, …) and possiblycheckpoint.N/directories. - One directory per block, each named with a ULID (e.g.,
01J1H1Q2K3V3Y2...). Each block containschunks/,index,meta.json, and possiblytombstones[Source: https://natesnewsletter.substack.com/p/context-windows-are-a-lie-the-myth].
Conceptually there are two regions:
- The head block holds the most recent ~2 hours of data, lives in memory, and is protected by the WAL.
- Older data lives in immutable blocks on disk; each block covers a contiguous time range and is never modified after it is written.
We will return to blocks, the WAL, and compaction in section 4. The important takeaway here is that the storage engine is not a free-for-all key-value store — it is shaped around time-bounded immutable units, which is what makes 90-day retention with sub-second queries feasible on a single node.
HTTP API and web UI
Everything you can do to Prometheus — query it, list targets, see the configuration, fire test alerts — goes through its HTTP API. Notable endpoints:
| Endpoint | Purpose |
|---|---|
/api/v1/query | Instant PromQL query |
/api/v1/query_range | Range PromQL query (what Grafana uses) |
/api/v1/series | Series matching label selectors |
/api/v1/labels, /api/v1/label/<n>/values | Discover label names and values |
/api/v1/targets | Active and dropped scrape targets |
/api/v1/rules, /api/v1/alerts | Configured rules and active alerts |
/-/reload, /-/healthy, /-/ready | Lifecycle endpoints |
/metrics | Prometheus’s own metrics (Prometheus scrapes itself) |
The web UI bundled with the server is intentionally minimal: a query box, a graph view, a targets page, and an alerts page. It exists so an operator can debug from a fresh laptop with no Grafana. For day-to-day dashboards, you point Grafana (or another visualization tool) at the same HTTP API.
Service discovery integrations
Static configuration — listing IPs in a YAML file — falls over the moment your environment becomes dynamic. Prometheus solves this with service discovery (SD), where the list of scrape targets is generated from an external source of truth.
Built-in SD integrations include Kubernetes, Consul, EC2, GCE, Azure, DNS SRV records, file-based SD, and many more. A Kubernetes SD configuration, for example, asks the API server for all pods matching a selector, then turns each pod into a target with labels like __meta_kubernetes_pod_name, __meta_kubernetes_namespace, and __meta_kubernetes_pod_label_app. Relabeling rules (covered in section 3) then promote these “underscore” labels into permanent target labels or drop them entirely.
Think of SD as the address book: Prometheus’s scrape loop is the postal carrier, but SD is what tells the carrier which houses exist this morning.
Key Takeaway: Prometheus is a small distributed system in a single binary — a scrape loop driven by service discovery, a purpose-built TSDB protected by a WAL, an HTTP query API, and a rules engine that talks to an external Alertmanager. Operating Prometheus well means understanding each of those subsystems independently.
Section 2: The Pull Model
Of all the design decisions in Prometheus, the choice to pull metrics rather than have services push them is the most contentious — and the most consequential. Understanding why pull was chosen helps you reason about when push is genuinely better and when it just feels easier because it is what you already know.
Why pull was chosen over push
In a pull model, targets expose a /metrics HTTP endpoint and Prometheus periodically calls it. In a push model, targets connect to a central collector and stream samples as they are produced.
The pull design buys Prometheus several properties that are surprisingly hard to replicate in push systems [Source: https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/prompt-engineering]:
- Liveness is implicit. If Prometheus cannot reach the target, the target’s
upseries becomes0and existing series are marked stale. There is no “this metric is from a dead pod that hasn’t told us yet” ambiguity. - Prometheus controls load. The scrape interval, timeout, and concurrency are set on the server side. A misbehaving target cannot flood Prometheus with samples.
- Operators can simulate Prometheus with
curl. Anyone with shell access to the target can call/metricsand see exactly what Prometheus would see. There is no special protocol to inspect. - No client-side buffering or retries. The target does not need to know about Prometheus’s availability or implement a queue. The carrier shows up; if no one is home, no one is home.
- Horizontal scaling by sharding targets. Splitting load across multiple Prometheus servers is “give server A half the targets, server B the other half.”
The trade-offs are real, however. Pull is awkward when:
- Targets live behind NAT or firewalls and Prometheus cannot reach them.
- Targets are very short-lived (a CI job that runs for 3 seconds with a 15s scrape interval).
- The “target” is conceptually a thing that emits events, not a thing with a current state (a webhook delivery, a batch ingest event).
Pull is therefore not universally better — it is better for the steady-state of long-lived services in trusted networks, which happens to describe most of what runs inside a Kubernetes cluster.
Figure 3.2: Scrape loop sequence (pull model with implicit liveness)
sequenceDiagram
participant SD as Service Discovery
participant P as Prometheus<br/>Scrape Loop
participant T as Target /metrics
participant DB as TSDB Head
SD->>P: target list (IPs, labels)
loop every 15s
P->>T: HTTP GET /metrics
alt target healthy
T-->>P: 200 OK + OpenMetrics body
P->>P: parse + relabel samples
P->>DB: append samples + "up=1"
else target unreachable
T--xP: timeout / 5xx / parse error
P->>DB: append "up=0" (stale marker)
end
end
The Pushgateway for short-lived jobs
For the cases pull cannot reach naturally — most importantly short-lived batch jobs — Prometheus offers a companion service called the Pushgateway. It is a small HTTP server with a deceptively simple job: accept pushed metrics, store the current value in memory, and expose them on its own /metrics endpoint so Prometheus can scrape the Pushgateway like any other target.
Used correctly, the pattern looks like this:
- A nightly batch job starts, runs for two minutes, then finishes.
- Right before exiting, the job pushes
nightly_import_last_run_status(0 for success),nightly_import_last_run_duration_seconds, andnightly_import_last_run_timestamp_secondsto the Pushgateway, grouped byjob="nightly_import". - Prometheus continues to scrape the Pushgateway every 15 seconds, so these “last run” metrics are visible to PromQL and Alertmanager whether or not the job is currently running.
A simple Go example illustrates the shape of this pattern:
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/push"
"time"
)
var (
jobDuration = prometheus.NewGauge(prometheus.GaugeOpts{Name: "nightly_import_last_run_duration_seconds"})
jobStatus = prometheus.NewGauge(prometheus.GaugeOpts{Name: "nightly_import_last_run_status"})
jobTimestamp = prometheus.NewGauge(prometheus.GaugeOpts{Name: "nightly_import_last_run_timestamp_seconds"})
)
func main() {
start := time.Now()
err := runImport()
jobDuration.Set(time.Since(start).Seconds())
jobTimestamp.Set(float64(time.Now().Unix()))
if err != nil { jobStatus.Set(1) } else { jobStatus.Set(0) }
_ = push.New("http://pushgateway:9091", "nightly_import").
Collector(jobDuration).Collector(jobStatus).Collector(jobTimestamp).Push()
}
The anti-patterns are at least as important to know [Source: https://blog.codinghorror.com/the-problem-with-logging/]:
- Do not use Pushgateway for long-running services. There is no built-in staleness; if the service dies, its last pushed values sit in Pushgateway forever and dashboards look healthy.
- Do not push per-instance labels for ephemeral pods. Each pod UUID creates a new label combination that is never cleaned up.
- Do not treat Pushgateway as an event stream. Each push replaces the current value; it is a cache, not a queue.
- Do not use it for SLO metrics like latency or error rate. SLOs require continuous, high-resolution series that reflect live behavior — Pushgateway hides outages behind stale cached values.
For most modern push needs, the OpenTelemetry Collector is a better answer (chapter 5 covers it in depth): it accepts OTLP push from applications, handles retries and batching, and can expose a Prometheus-compatible scrape endpoint or remote-write to a backend.
Federation and hierarchical scraping
What if you have too many targets for one Prometheus, or you want a “global” view across regions without one giant server scraping the world? Prometheus supports federation: one Prometheus scrapes another’s /federate endpoint and pulls a subset of its series.
A typical hierarchy looks like:
| Tier | Role | Retention | Scrape sources |
|---|---|---|---|
| Leaf | Per-cluster, per-region; scrape everything | Short (1-7 days) | Pods, nodes, exporters |
| Aggregator | Pull recording-rule outputs from leaves | Medium (15-30 days) | Leaf /federate endpoints |
| Global | Cross-region dashboards and alerting | Long (via remote write to Thanos/Mimir) | Aggregator /federate endpoints |
Federation is best used to pull aggregated series (the output of recording rules) rather than raw metrics — pulling raw, high-cardinality data through federation will bottleneck on a single HTTP scrape. For “ship everything to long-term storage,” remote write (section 4) is almost always a better fit.
Key Takeaway: Pull gives Prometheus implicit liveness, server-side load control, and a debuggable wire protocol; push (via Pushgateway or OTel Collector) is the escape hatch for short-lived jobs and firewalled targets. Use federation for aggregated views, remote write for raw data shipping.
Section 3: Data Model and Exposition Format
Prometheus’s data model is small enough to fit on an index card, yet it underpins every PromQL query you will ever write. Internalize it once and the rest of the system stops feeling magical.
Metric name + label set = unique time series
In Prometheus, a time series is uniquely identified by its metric name plus its set of labels. Every other property — the metric type, the help text, the unit — is metadata; the identity is the name and the labels.
Consider this single sample:
http_requests_total{job="api", method="GET", status="200", path="/users"} 1873 1717520400000
That has three parts:
- Metric name:
http_requests_total. Conventionally_totalsuffixes counters;_seconds,_bytes,_ratiosuffixes give units. - Label set:
{job="api", method="GET", status="200", path="/users"}. Each label is a (name, value) pair; values are arbitrary UTF-8 strings. - Sample: a numeric value (
1873) and a Unix-millisecond timestamp (1717520400000).
Change any label value and you have a different series. http_requests_total{...,method="GET"} and http_requests_total{...,method="POST"} are entirely separate time series, stored separately, indexed separately, and counted separately against cardinality budgets.
This is the multi-dimensional data model: instead of inventing one metric per dimension (http_requests_GET_200_total, http_requests_POST_500_total, …) you have one metric name with multiple label dimensions, and PromQL lets you slice along those dimensions at query time:
sum by (status) (rate(http_requests_total{job="api"}[5m]))
A practical analogy: a metric name is a spreadsheet, labels are the columns, label values are the cell contents, and each unique row is a time series. Adding a column with high cardinality (say, user_id) is like adding a column to a spreadsheet that has one row per user — your data grows linearly in the number of distinct values, and so does Prometheus’s memory.
OpenMetrics text exposition format
Targets expose metrics in a deliberately simple line-based format. The original Prometheus text format was standardized as OpenMetrics in 2020 and is now an IETF-tracked specification. A typical /metrics payload looks like:
# HELP http_requests_total Total HTTP requests received.
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1873
http_requests_total{method="GET",status="500"} 4
http_requests_total{method="POST",status="200"} 219
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 4.2848256e+07
# HELP http_request_duration_seconds HTTP request latency.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 1500
http_request_duration_seconds_bucket{le="0.5"} 1860
http_request_duration_seconds_bucket{le="1.0"} 1872
http_request_duration_seconds_bucket{le="+Inf"} 1873
http_request_duration_seconds_sum 92.7
http_request_duration_seconds_count 1873
# EOF
Key rules:
- Lines starting with
#are comments, except for# HELP <metric> <text>and# TYPE <metric> <counter|gauge|histogram|summary>, which are semantically meaningful. - Each sample line is
metric_name{labels} value [timestamp]. The timestamp is optional; if omitted, Prometheus uses the scrape time. - Histograms expand into multiple lines: one bucket per
leboundary, plus_sumand_count. Each bucket is its own series. - OpenMetrics ends with a literal
# EOFline.
The format’s simplicity is the point: anyone can write an exporter in a few dozen lines of code in any language, and anyone can debug one with curl.
Honor labels, target labels, and relabeling
When Prometheus scrapes a target, it sees two sets of labels: the labels the target exposed in its /metrics output, and target labels that Prometheus attaches automatically (most importantly job and instance, plus anything service discovery contributed).
By default, if the target’s exposed labels conflict with target labels, the target labels win. The honor_labels: true setting in a scrape config flips this: the target’s labels take precedence. This matters mostly for the Pushgateway (where the pushing job has set its own job label intentionally) and for federation (where you want to preserve the original job label from the upstream Prometheus).
Relabeling is the small declarative DSL Prometheus uses to transform labels before storing samples. There are two flavors:
relabel_configs: run before the scrape, against target metadata labels (the__meta_*labels from service discovery). Used to select which targets to scrape and to rewrite their labels.metric_relabel_configs: run after the scrape, against each sample’s labels. Used to drop high-cardinality samples or rename labels.
A worked example: Kubernetes SD discovers thousands of pods. You want to scrape only pods with the annotation prometheus.io/scrape=true, set their job label from prometheus.io/job, and drop a noisy histogram bucket:
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Keep only pods that opt in.
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
# Set the job label from the pod's annotation.
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_job]
target_label: job
# Promote namespace and pod into permanent labels.
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
metric_relabel_configs:
# Drop a known cardinality-bomb metric.
- source_labels: [__name__]
regex: "go_gc_pauses_seconds_bucket"
action: drop
Figure 3.3: Relabeling pipeline (target metadata to stored series)
flowchart LR
SD[Service Discovery] -->|"__meta_kubernetes_*<br/>__meta_consul_*"| RC[relabel_configs<br/>keep / drop / rewrite]
RC --> TL[Final target labels<br/>"job, instance, namespace, pod"]
TL --> SCR[Scrape /metrics]
SCR --> RAW[Raw samples<br/>"name + labels + value"]
RAW --> MRC[metric_relabel_configs<br/>drop high-cardinality]
MRC --> TSDB[(TSDB head block)]
Relabeling is the place where most operational bugs hide. Two rules of thumb: (1) always test relabeling against a known target with --log.level=debug, and (2) drop unbounded labels (user IDs, request IDs, HTTP paths with query strings) before they enter the TSDB — once they are in, they cost memory until retention removes them.
Key Takeaway: A time series is uniquely identified by
metric_name + label_set; labels are how you slice data at query time but also how cardinality explodes. The OpenMetrics text format is intentionally readable, and relabeling is your control plane for shaping labels before and after each scrape.
Section 4: Storage, Retention, and Remote Write
Now we follow a sample from the moment Prometheus accepts it from the scrape loop, through the head block and WAL, into a 2-hour block on disk, through compaction, and finally out to a long-term storage system over remote write.
Local TSDB block compaction and WAL
When the scrape loop produces a sample, the TSDB does two things, in order [Source: https://natesnewsletter.substack.com/p/context-windows-are-a-lie-the-myth]:
- Append a record to the WAL. The sample (with its series ID, timestamp, and value) is encoded as a CRC-checksummed record and written to the current WAL segment file (typically ~128 MB segments named
00000001,00000002, …). - Update the in-memory head block. The TSDB looks up (or creates) an in-memory series for the label set and appends the sample to that series’ current chunk.
The WAL ensures durability: if Prometheus crashes after step 1 but before the head block is persisted as a finished block, the sample can be recovered on restart by replaying the WAL.
Every ~2 hours, the head’s accumulated samples are cut into a new immutable on-disk block. Each block directory looks like:
01J1H1Q2K3V3Y2.../
meta.json # ULID, minTime, maxTime, compaction level, sources, stats
index # symbol table + postings lists (label=value -> series IDs)
chunks/
000001 # concatenated XOR-compressed chunks (mmapped at query time)
000002
tombstones # optional, deletion intervals
The index file is the crown jewel: it interns every label name and value into a symbol table, assigns each series an integer ID, and stores postings lists — sorted lists of series IDs for each label=value pair. When you run sum by (status) (rate(http_requests_total{job="api"}[5m])), the index lets Prometheus find every series matching __name__="http_requests_total" and job="api" by intersecting two postings lists in milliseconds, no table scan required.
Chunks themselves use XOR-based delta-of-delta compression for timestamps and Gorilla-style XOR for float values, which is brutally efficient: typical compressed sizes are ~1-2 bytes per sample. Chunks are read via memory-mapped I/O rather than copied into the heap, which is why Prometheus’s working set often shows modest RSS but a large OS page cache.
Background compaction periodically merges adjacent 2-hour blocks into larger ones (8h, then ~24h, then multi-day) [Source: https://www.youtube.com/watch?v=MqqKT6etxpQ]. Larger blocks mean fewer index files to open per query and better compression. Retention is enforced at the block level: when --storage.tsdb.retention.time=30d is set, any block whose maxTime is older than 30 days ago is deleted in one shot.
To keep WAL replay fast after a crash, Prometheus periodically writes a checkpoint (wal/checkpoint.N/) that snapshots the head state. After a successful checkpoint, all WAL segments numbered less than N can be deleted. On restart, recovery loads the latest checkpoint and replays only segments after it, stopping at the first record with an invalid CRC (which represents a torn write at the moment of crash).
Figure 3.4: TSDB block lifecycle from scrape to retention
stateDiagram-v2
[*] --> Scrape: sample produced
Scrape --> WAL: append CRC record
WAL --> Head: update in-memory head chunk
Head --> Head: accumulate ~2h of samples
Head --> Block2h: cut immutable block<br/>(meta.json + index + chunks)
Block2h --> Block8h: compact adjacent blocks
Block8h --> Block24h: compact further
Block24h --> BlockMulti: compact multi-day
BlockMulti --> Deleted: retention horizon passed
Deleted --> [*]
Head --> Recovery: crash
Recovery --> Head: replay WAL from checkpoint.N
Worked example — surviving a crash. Suppose Prometheus has wal/checkpoint.10/ plus segments 00000011, 00000012, 00000013, and crashes while writing to segment 13. On startup:
- All existing immutable blocks load normally — they need no WAL [Source: http://susandumais.com/CHI2012-12-tailanswers-chi2012.pdf].
- The head is rebuilt from
checkpoint.10. - Segments 11, 12, and 13 are replayed; recovery stops at the first record in segment 13 whose CRC fails.
- Any samples that did not complete their write are lost — but the head is consistent and queries work immediately.
If your Prometheus startup is slow, it almost always means either checkpoints are not happening (frequent restarts, repeated OOM kills) or WAL has grown because of an unusually long outage. The fix is to keep Prometheus healthy long enough to checkpoint.
Remote write protocol
Local TSDB is intentionally short-term storage — a few weeks at most. For longer retention, multi-cluster aggregation, or “metrics as a service,” Prometheus supports remote write: an HTTP-based protocol for shipping every sample Prometheus ingests to a remote backend.
The protocol is conceptually simple. Prometheus batches samples into Snappy-compressed protobuf messages and POSTs them to a configured URL:
remote_write:
- url: https://mimir.example.com/api/v1/push
basic_auth:
username: tenant-42
password_file: /etc/prometheus/mimir-token
queue_config:
capacity: 10000
max_shards: 30
min_backoff: 30ms
max_backoff: 5s
write_relabel_configs:
- source_labels: [__name__]
regex: "go_.*"
action: drop # don't ship Go runtime metrics
Important properties:
- Remote write is lossy under backpressure: if the remote endpoint cannot keep up, Prometheus’s queue fills and eventually drops samples. The
prometheus_remote_storage_*metrics tell you what is happening. - Each sample is sent once; remote write is not a replay protocol. If you start it on day 30, the backend gets samples from day 30 onward — not the previous 30 days from local TSDB.
write_relabel_configslets you drop or rewrite labels on the way out, which is useful for sampling expensive metrics or stripping cardinality before it leaves your network.
Long-term storage with Thanos, Cortex, and Mimir
Three open-source projects dominate the “Prometheus at scale” space. Each makes different architectural trade-offs.
| Aspect | Thanos | Cortex | Grafana Mimir |
|---|---|---|---|
| Ingest model | Sidecar uploads TSDB blocks; optional Receive for remote write | Remote write only (distributor -> ingester -> blocks) | Remote write only; simplified Cortex |
| Multi-tenancy | Label-based; basic | First-class tenant IDs, per-tenant limits | First-class, with shuffle-sharding |
| HA dedup | Query-time, stores all replicas | Ingest-time, stores one replica | Ingest-time, improved |
| Downsampling | Native (5m, 1h) via compactor | None at storage layer | None at storage layer |
| Object storage | S3/GCS/Azure/Swift | Blocks or chunks engine | Blocks engine only |
| Best fit (2024-2025) | Bolt long-term storage onto existing Prometheus | Legacy deployments | New large-scale, multi-tenant platform |
Thanos is the natural choice when you already operate Prometheus clusters and want to bolt on object-store-backed long-term storage with minimal disruption [Source: https://blog.codinghorror.com/the-problem-with-logging/]. A small sidecar runs next to each Prometheus, uploads each finalized 2-hour block to S3 (or GCS, Azure, MinIO), and exposes the local TSDB to a central Querier for “recent data.” A Store Gateway serves historical blocks from object storage; the Compactor merges and downsamples them to 5-minute and 1-hour resolutions, which is what makes long-range queries (90 days, 1 year) fast.
Cortex and Grafana Mimir take a different approach: they are remote-write-native, horizontally scalable, multi-tenant metric backends. Prometheus (or any OTel Collector or Agent) ships samples via remote write to a load-balanced endpoint that fans them out to distributors, then to ingesters that keep recent data in memory and a WAL before flushing TSDB blocks to object storage. Store-gateways serve historical reads, and a query-frontend / query-scheduler layer parallelizes and caches PromQL queries.
Cortex was the original of these two; Mimir is its evolved successor from Grafana Labs, with simplified storage (blocks only — no legacy chunks engine), better defaults, and significantly improved query performance at scale. In 2024-2025, most new “central metrics platform” deployments choose Mimir over Cortex, and existing Cortex shops are gradually migrating.
When choosing between them, a useful rule of thumb:
- Use Thanos if your topology is “many independent Prometheus servers, give us long-term storage and a global view.” Thanos’s sidecar pattern adds minimal moving parts and its downsampling is uniquely useful for year-long dashboards.
- Use Mimir if you are building a multi-tenant metrics service where teams or customers ship metrics in via remote write and expect per-tenant quotas, isolation, and SLOs.
- Use Cortex mostly if you already run it; for new deployments, Mimir is usually the better choice.
- VictoriaMetrics is worth mentioning as a simpler single-binary alternative, especially for teams without dedicated SREs.
Figure 3.5: Long-term storage architectures — Thanos vs. Mimir/Cortex
flowchart TB
subgraph Thanos["Thanos (sidecar pattern)"]
direction LR
TP[Prometheus] --> TS[Thanos Sidecar]
TS -->|upload 2h blocks| TOS[(Object Store<br/>S3 / GCS)]
TQ[Thanos Querier] --> TS
TQ --> TSG[Store Gateway]
TSG --> TOS
TC[Compactor<br/>downsample 5m / 1h] --> TOS
end
subgraph Mimir["Mimir / Cortex (remote-write platform)"]
direction LR
MP[Prometheus / Agent] -->|remote_write| MD[Distributor]
MD --> MI[Ingester<br/>head + WAL]
MI -->|flush blocks| MOS[(Object Store)]
MQF[Query Frontend<br/>+ cache] --> MQ[Querier]
MQ --> MI
MQ --> MSG[Store Gateway]
MSG --> MOS
end
Key Takeaway: Local TSDB is fast, durable, and short-term: WAL + head + 2-hour blocks + compaction + retention. For longer horizons and multi-cluster scale, remote write ships samples to a backend like Thanos (sidecar + downsampling for existing Prometheus), Mimir (multi-tenant remote-write platform), or Cortex (legacy predecessor of Mimir).
Chapter Summary
Prometheus is a small distributed system bundled into a single binary. Its retrieval subsystem pulls metrics over HTTP from targets discovered via service discovery; its TSDB persists those samples through a WAL into 2-hour blocks that compact into larger blocks over time; its HTTP API answers PromQL queries; and its rules engine evaluates recording rules and fires alerts to an external Alertmanager.
The pull model gives Prometheus implicit liveness, server-controlled load, and a debuggable wire protocol, at the cost of awkwardness around short-lived jobs and firewalled targets. The Pushgateway is a narrow escape hatch — appropriate for job-level batch metrics, dangerous when used for service-level SLOs. The OpenTelemetry Collector is increasingly the modern answer for genuinely push-shaped workloads.
The multi-dimensional data model — metric name plus a label set defines a unique time series — is the conceptual key to PromQL and to the operational pitfalls (cardinality explosion) that come with mishandling labels. The OpenMetrics text format is intentionally simple enough to debug with curl and write with printf.
Underneath, the TSDB is one of the more elegant pieces of open-source storage engineering: immutable time-bounded blocks, an index built around label postings lists, XOR-compressed memory-mapped chunks, and a WAL with checkpoints for crash recovery. Remote write is how Prometheus integrates with Thanos, Cortex, and Mimir to extend retention from weeks to years and to scale to many tenants and clusters.
When something goes wrong in production — slow queries, OOMing servers, missing data, runaway cardinality — the mental model from this chapter is the map. The next chapter builds on it by exploring PromQL itself, the query language that makes all of this storage useful.
Key Terms
| Term | Definition |
|---|---|
| TSDB | Prometheus’s purpose-built Time-Series Database: an in-memory head block plus immutable on-disk blocks (chunks + index + meta.json), protected by a write-ahead log. |
| Scrape | A single HTTP GET against a target’s /metrics endpoint by Prometheus’s retrieval loop, which parses the response and appends samples to the TSDB. |
| Pushgateway | A standalone server that caches pushed metrics for short-lived batch jobs; Prometheus scrapes the Pushgateway like any other target. Misused as a general push ingest, it causes stale and ambiguous data. |
| Service discovery | The mechanism by which Prometheus learns the current list of scrape targets — from Kubernetes, Consul, cloud APIs, DNS, or files — rather than from static config. |
| OpenMetrics | The IETF-tracked text exposition format Prometheus uses on /metrics endpoints; the successor to the original Prometheus text format. |
| Relabeling | A declarative pipeline (relabel_configs and metric_relabel_configs) for selecting targets, rewriting labels, and dropping samples before and after a scrape. |
| Remote write | The HTTP-based protocol Prometheus uses to ship every ingested sample to a remote backend such as Thanos Receive, Cortex, Mimir, or VictoriaMetrics. |
| Federation | A pattern where one Prometheus scrapes a subset of another Prometheus’s series via the /federate endpoint, typically to aggregate recording-rule outputs across clusters. |
Chapter 4: PromQL — Querying Time-Series Data
PromQL (the Prometheus Query Language) is the lens through which every metric in a Prometheus-based observability stack becomes operational insight. Without PromQL, the millions of samples Prometheus diligently scrapes are just numbers on a disk. With it, you can express questions like “what is the 99th-percentile checkout latency per region, excluding canary pods, over the last five minutes?” in a single line. That power, however, comes with sharp edges: a misplaced label, a counter that resets at the wrong moment, or a quantile computed in the wrong order can turn a confident dashboard into a comforting lie.
Think of PromQL as a spreadsheet formula language for time. Where a spreadsheet operates on rows and columns of static values, PromQL operates on labeled streams of timestamps and floats. Every expression returns a vector, and every vector has a shape — number of series, set of labels, and a temporal extent. Master the shape, and the language follows.
Learning Objectives
By the end of this chapter, you will be able to:
- Write instant, range, and subquery PromQL expressions to answer operational questions about service health and capacity.
- Apply
rate,histogram_quantile, and aggregation operators (sum,avg,topk,by,without) correctly, including the subtle ordering rules that govern quantile and ratio computations. - Diagnose common PromQL pitfalls — counter resets across restarts, staleness markers and missing samples, and cardinality explosions caused by high-dimensional labels — and refactor queries to avoid them.
Figure 4.1: PromQL data-shape transitions
graph TD
RAW["Raw samples in TSDB<br/>(timestamp, value, labels)"]
SEL["Selector<br/>http_requests_total{job="api"}"]
IV["Instant Vector<br/>one sample per series at eval time"]
RV["Range Vector<br/>append [5m] — many samples per series"]
AGG["Aggregated Instant Vector<br/>sum/avg/topk by labels"]
S["Scalar<br/>single number, no labels"]
RAW --> SEL
SEL --> IV
IV -->|"append [duration]"| RV
RV -->|"rate, increase, *_over_time"| IV
IV -->|"sum by, avg by, topk"| AGG
AGG -->|"scalar()"| S
S -->|"comparison, arithmetic"| IV
4.1 PromQL Fundamentals
Before writing useful queries, you need a precise mental model of the data types PromQL manipulates. Confusion about these types is the single largest source of “why isn’t my query returning anything?” tickets.
Instant Vectors, Range Vectors, and Scalars
PromQL has four expression types, but for day-to-day work three of them dominate:
| Type | What it is | Example | When you use it |
|---|---|---|---|
| Instant vector | A set of time series, each containing one sample at the evaluation timestamp | http_requests_total | Most query results, dashboard panels, alert conditions |
| Range vector | A set of time series, each containing a range of samples going back in time | http_requests_total[5m] | Input to rate, increase, *_over_time functions |
| Scalar | A single numeric value (no labels, no time series) | 0.99, time() | Thresholds, quantile arguments, arithmetic constants |
| String | A literal string (rarely used outside label_replace) | "prod" | Function arguments only |
The crucial rule: most functions and operators that you think of as “PromQL math” require an instant vector. The comparison operator >, the binary operator +, and aggregation operators all reject range vectors. Range vectors exist almost exclusively to feed time-windowed functions like rate(), avg_over_time(), or increase().
# Instant vector: one sample per series at "now"
http_requests_total
# Range vector: every sample in the past 5 minutes per series
http_requests_total[5m]
# Scalar: just a number
0.95
# Functions transform range vectors back into instant vectors
rate(http_requests_total[5m]) # instant vector again
Analogy: an instant vector is a single Polaroid snapshot of all your series right now. A range vector is a flip-book of snapshots covering the last five minutes. Functions like rate() are the flip-book reader that summarizes the motion into a single number per series, handing you back a new Polaroid.
Selectors and Label Matchers
A selector chooses which series to retrieve. Every PromQL query starts with one. The selector has two parts: the metric name and an optional set of label matchers in {...} braces.
# All series for this metric, across every label combination
http_requests_total
# Filter by exact label match
http_requests_total{job="api", method="GET"}
# Regex match (note the =~ operator)
http_requests_total{code=~"5.."}
# Negative regex match
http_requests_total{code!~"2..|3.."}
# Negative exact match
http_requests_total{environment!="canary"}
Four matcher operators exist: = (equals), != (not equals), =~ (regex match), and !~ (regex doesn’t match). Regexes are anchored on both ends automatically — code=~"5.." matches 500, 503, and 599 but not 5000.
You can also select by the special __name__ label, which is how PromQL internally represents the metric name. This trick is occasionally useful when the metric name itself needs filtering:
{__name__=~"http_.*", job="api"}
Offset and @ Modifiers
Two modifiers let you shift queries through time. They’re the difference between “requests right now” and “requests at exactly 9:00 AM yesterday.”
The offset modifier shifts the query backwards by a relative duration:
# Current request rate
rate(http_requests_total[5m])
# Request rate from one week ago, same lookback
rate(http_requests_total[5m] offset 1w)
# Week-over-week ratio
rate(http_requests_total[5m])
/
rate(http_requests_total[5m] offset 1w)
The @ modifier (introduced in Prometheus 2.25) pins the query to an absolute Unix timestamp, which is invaluable for reproducible alerts and post-incident analysis:
# Request rate as observed at 2026-01-15 14:30:00 UTC
rate(http_requests_total[5m] @ 1736951400)
# Combined: 5-minute rate, 1 hour before a fixed timestamp
rate(http_requests_total[5m] @ 1736951400 offset 1h)
Subqueries extend this further by letting you build a range vector out of an instant-vector expression on the fly, evaluated at a step you specify:
# Max 5-minute request rate observed over the last hour, sampled every 1m
max_over_time(
rate(http_requests_total[5m])[1h:1m]
)
The [1h:1m] syntax says: “evaluate the inner expression every 1 minute over the last 1 hour and assemble those results into a range vector.” Subqueries are powerful but expensive — they multiply query work — so prefer recording rules for anything you run repeatedly.
Key Takeaway: Every PromQL expression has a shape: instant vector, range vector, or scalar. Most operators require instant vectors; range vectors exist to feed time-window functions like
rate(). Master the shape transitions and most “why doesn’t this work?” errors disappear.
4.2 Functions and Operators
This is where PromQL goes from “selecting data” to “answering questions.” The functions in this section are the workhorses of every production dashboard and alert.
rate, irate, and increase on Counters
Counters are the most common Prometheus metric type — monotonically increasing numbers like http_requests_total or node_network_transmit_bytes_total. Raw counter values are almost never useful on their own; what you care about is how fast they’re growing. Three functions answer that question with subtly different semantics [Source: https://developers.openai.com/cookbook/examples/gpt4-1_prompting_guide].
| Function | Output | Computation | Best for | Avoid for |
|---|---|---|---|---|
rate(v[w]) | Per-second average rate over window w | Linear regression across all samples in the range, with extrapolation to window edges | Dashboards, most alerts, trend analysis | Detecting brief spikes; very short windows |
irate(v[w]) | Instantaneous per-second rate | Uses only the last two samples in the range | Spike detection, short-duration alerts | Long-term graphs (too noisy) |
increase(v[w]) | Total increments across window w | Equivalent to rate(v[w]) * w_seconds, with the same extrapolation | SLO budgets, “how many events” questions | When you actually want a rate; expecting integers |
All three operate only on counters and automatically handle counter resets — when the value drops (say, from 12345 back to 0 after a pod restart), the negative jump is ignored and only post-reset increments count. Critically, none of them work correctly on gauges; for gauges, reach for avg_over_time, max_over_time, delta, or deriv.
# Smooth requests-per-second dashboard, aggregated by service
sum by (service) (
rate(http_requests_total[5m])
)
# Spike-detection alert: instantaneous error rate above 10%
sum by (service) (irate(http_requests_total{code=~"5.."}[1m]))
/
sum by (service) (irate(http_requests_total[1m]))
> 0.1
# SLO accounting: total 5xx errors and total requests over 30 days
sum(increase(http_requests_total{code=~"5.."}[30d]))
/
sum(increase(http_requests_total[30d]))
A common stumble is dividing before aggregating. The ratio of two rates is not the rate of two ratios. Always aggregate the numerator and denominator separately, then divide [Source: https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents]:
# WRONG — produces per-instance ratios that get summed nonsensically
sum by (service) (
rate(http_requests_total{code=~"5.."}[5m])
/
rate(http_requests_total[5m])
)
# RIGHT — aggregate first, then divide
sum by (service) (rate(http_requests_total{code=~"5.."}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))
Choosing the window size matters too. A good rule of thumb is 3–5× the scrape interval: with a 15s scrape, use at least [1m] for rate to have meaningful samples; [5m] is the standard dashboard window because it smooths jitter without hiding real outages. Going too short produces noisy graphs; going too long ([1h]) hides brief incidents [Source: https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices].
Figure 4.2: rate vs irate vs increase over a range vector
flowchart LR
C["Counter samples in [5m] window<br/>t-5m … t-4m … t-3m … t-2m … t-1m … t"]
C --> R["rate(v[5m])<br/>linear regression across<br/>all samples in window"]
C --> I["irate(v[5m])<br/>uses only the LAST<br/>two samples"]
C --> N["increase(v[5m])<br/>= rate(v[5m]) * 300s<br/>total increments in window"]
R --> RO["Smooth per-second rate<br/>good for dashboards and alerts"]
I --> IO["Spiky per-second rate<br/>good for short spike detection"]
N --> NO["Total event count over window<br/>good for SLO budgets"]
histogram_quantile and Bucket Math
Latency, response sizes, and queue depths are typically tracked with histograms, not gauges or counters. A classic Prometheus histogram exposes three families of series for each underlying metric:
*_bucket{le="0.1", ...}throughle="+Inf"— cumulative counters of observations falling at or below each upper bound.*_sum— total of all observed values.*_count— equal to the+Infbucket; total observation count.
The histogram_quantile() function reconstructs an approximate distribution from these buckets and linearly interpolates within the bucket that contains your target quantile [Source: https://natesnewsletter.substack.com/p/context-windows-are-a-lie-the-myth].
The math, briefly: for quantile q with cumulative counts b_i at upper bounds u_i and total count N = b_n,
- Compute rank
r = q × N. - Find the first bucket
kwhereb_k ≥ r. - Interpolate:
x_q = u_{k-1} + ((r − b_{k-1}) / (b_k − b_{k-1})) × (u_k − u_{k-1}).
This assumes a uniform distribution within each bucket, which is why bucket boundary design matters so much.
The canonical p99 latency pattern looks like this:
histogram_quantile(
0.99,
sum by (le, service) (
rate(http_request_duration_seconds_bucket[5m])
)
)
Four things must be right for this to produce a meaningful answer [Source: https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/prompt-engineering]:
- Apply
rate()to the buckets first. Buckets are counters, so you almost always want a rate of observations (orincrease()for a fixed window), not the raw cumulative count since process start. - Preserve the
lelabel in aggregation. If you writesum by (service)without includingle, you collapse all buckets into a single value and destroy the histogram structure. - Aggregate before taking the quantile, never after. Quantiles are not linear; you cannot average p99s across instances.
- Ensure all instances share the same bucket boundaries. Mixing layouts produces meaningless interpolation.
Here is the wrong-vs-right comparison in code:
# WRONG — compute per-instance p99, then sum/avg quantiles (statistically invalid)
sum by (service) (
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
)
# WRONG — drops the `le` label, breaking histogram reconstruction
histogram_quantile(
0.99,
sum by (service) (rate(http_request_duration_seconds_bucket[5m]))
)
# RIGHT — aggregate buckets preserving `le`, then take one quantile
histogram_quantile(
0.99,
sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
)
A related gotcha: if the true p99 falls within the +Inf bucket, histogram_quantile() returns +Inf (or, depending on bucket bounds, flattens at the largest finite upper bound). The fix is better bucket design — clustering boundaries tightly around your SLO threshold. If your SLO is 300ms p99, you want buckets like 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 1.0 rather than the default 0.005, 0.01, ..., 10.
For average latency, combine _sum and _count:
sum by (service) (rate(http_request_duration_seconds_sum[5m]))
/
sum by (service) (rate(http_request_duration_seconds_count[5m]))
Native histograms (introduced as experimental in Prometheus 2.40+ and stabilized later) eliminate the le label entirely by storing the bucket structure compactly inside a single series. Querying them is simpler:
# Native histogram p99 — no `le`, no _bucket suffix
histogram_quantile(
0.99,
sum by (service) (
rate(http_request_duration_seconds[5m])
)
)
| Aspect | Classic histogram | Native histogram |
|---|---|---|
| Bucket representation | One series per le value | One compact value per series |
| Cardinality cost | High (n buckets × m label combinations) | Low (single series per label combination) |
| Aggregation for quantiles | sum by (le, ...) | sum by (...) |
| Tooling maturity | Universal | Newer, some tools assume le exists |
Aggregation Operators
Aggregation operators collapse many series into fewer series. They’re the workhorse of every dashboard panel. PromQL has a fixed set: sum, avg, min, max, count, count_values, stddev, stdvar, topk, bottomk, quantile, and group.
Each can be modified by by (label_list) (keep only these labels) or without (label_list) (drop these labels, keep the rest).
# Total requests per second across the whole fleet
sum(rate(http_requests_total[5m]))
# Grouped by service — one result series per service
sum by (service) (rate(http_requests_total[5m]))
# Equivalent: drop the instance and pod labels, keep everything else
sum without (instance, pod) (rate(http_requests_total[5m]))
# Top 5 noisiest services by error rate
topk(5,
sum by (service) (rate(http_requests_total{code=~"5.."}[5m]))
)
# Average CPU usage per node
avg by (node) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
A subtle but important point: by and without are complementary. sum by (service) keeps only the service label and drops everything else. sum without (pod, instance) drops just those two labels and keeps everything else. In high-cardinality environments, without is often safer — it doesn’t accidentally hide labels you forgot to mention.
Key Takeaway: Use
rate()for smooth dashboards,irate()for spike alerts, andincrease()for “how many events” questions. For histograms, alwaysrate()the buckets first, preservelein aggregation, and compute the quantile after aggregating. For ratios, aggregate numerator and denominator separately, then divide.
4.3 Recording and Alerting Rules
PromQL queries can be slow. A histogram_quantile() over a 30-day window touching millions of series can take seconds to evaluate — far too slow for a dashboard that refreshes every 10s or an alert that fires every 30s. The fix is rules: queries that Prometheus evaluates on a fixed schedule and stores the results as either new metrics (recording rules) or alert states (alerting rules) [Source: https://sre.google/sre-book/service-best-practices/].
Recording Rules for Expensive Queries
A recording rule pre-computes a query and writes the result back into Prometheus as a new time series. The first dashboard load is slow; every subsequent load reads a single pre-computed series.
groups:
- name: http_slos
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_errors:rate5m
expr: sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))
- record: job:http_error_ratio:5m
expr: |
job:http_errors:rate5m
/
job:http_requests:rate5m
Notice the naming convention: level:metric:operation. The level (job, namespace, cluster) identifies the aggregation scope; the metric identifies what’s being measured; the operation describes the transformation (rate5m, histogram_quantile99). This convention is part of the Prometheus operational vocabulary and pays dividends in dashboards and downstream rules [Source: https://www.dynatrace.com/news/blog/site-reliability-done-right/].
# Without recording rule — slow, runs every dashboard refresh
histogram_quantile(0.99,
sum by (le, service) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# With recording rule — fast lookup of a single pre-computed series
service:http_request_duration_seconds:histogram_quantile99_5m
A few rules-of-thumb for rule chains:
- Keep chains 2–3 levels deep. Pipelines like raw → base → service → cluster work well; deeper chains become hard to debug.
- Don’t rate a rate. If
job:http_requests:rate5mis already a per-second rate, writingrate(job:http_requests:rate5m[5m])is meaningless. - Group rules by data source and scrape interval. Co-locate rules that depend on each other so they evaluate consistently.
Alerting Rule Syntax and the for Clause
Alerting rules look like recording rules but produce alert states instead of new time series. The expr field must return an instant vector; any series in that vector with a non-zero, non-NaN value triggers an alert.
groups:
- name: http_alerts
interval: 30s
rules:
- alert: HighRequestErrorRate
expr: job:http_error_ratio:5m > 0.05
for: 5m
labels:
severity: critical
team: payments
annotations:
summary: "High error rate on {{ $labels.job }}"
description: |
Error ratio is {{ $value | humanizePercentage }} on job
{{ $labels.job }} (threshold 5%). See runbook for triage.
runbook_url: "https://runbooks.example.com/HighRequestErrorRate"
The for clause is the single most important alerting-rule feature for reducing noise. It requires the alert condition to be continuously true for the specified duration before the alert fires. With for: 5m, a one-minute blip in error rate won’t page anyone; a sustained five-minute problem will.
Guidelines for for durations [Source: https://www.splunk.com/en_us/blog/learn/sre-metrics-four-golden-signals-of-monitoring.html]:
| Alert category | Typical for | Reasoning |
|---|---|---|
| User-visible symptom (errors, slow latency) | 1–5m | Fast page; users already see it |
| Resource saturation (CPU, memory, disk) | 5–15m | Avoid paging on transient spikes |
| SLO burn rate (fast window) | 2–5m | Catches rapid budget burn |
| SLO burn rate (slow window) | 30–60m | Long, sustained budget drift |
| Capacity / “filling up” trends | 1h+ | Days-ahead warnings, not pages |
A critical anti-pattern: don’t use for to mask noisy query design. If your expression is flapping because the underlying rate window is too short, fix the query (longer rate window, more aggregation) rather than lengthening for. The for clause delays alerts; it doesn’t make them more accurate.
Best Practices for Rule Organization
At scale, rules become code. Treating them like code is the single biggest lever for keeping alerting sane [Source: https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started].
- Version control everything. Rules live in Git, reviewed by service owners and SREs, deployed by CI/CD.
- Lint in CI.
promtool check rulesvalidates syntax; custom linters enforce naming conventions, required labels (severity,team,service,runbook_url), and forbidden patterns (notopkin alerts, no high-cardinality grouping). - Standard libraries. Share templates for common patterns: SLO burn-rate alerts with FAST and SLOW windows, four-golden-signals dashboards, exporter-specific rule packs.
- Tier alerts by severity.
criticalpages a human;warningopens a ticket;infologs to a channel. Everycriticalalert must have a runbook URL in annotations. - Safe rollout. New rules ship as recording rules first, then as
warningalerts to a non-paging channel with a longfor, and only graduate tocriticalafter they prove reliable.
A sample rule-organization layout might look like:
rules/
├── recording/
│ ├── http_aggregates.yaml # job:http_requests:rate5m, etc.
│ ├── slo_aggregates.yaml # service:availability:ratio_rate30d
│ └── infra_aggregates.yaml # node:cpu:usage_ratio_avg5m
├── alerts/
│ ├── slo_burn.yaml # multi-window SLO alerts
│ ├── infra_saturation.yaml # disk, memory, CPU
│ └── platform_health.yaml # control-plane symptoms
└── tests/
└── *.test.yaml # promtool test rules fixtures
Key Takeaway: Recording rules turn expensive queries into single-series lookups; alerting rules turn those queries into pages. Use the
level:metric:operationnaming convention, keep rule chains shallow, and tune theforclause to alert category — never to mask noisy queries. Treat rules as code: Git, CI lint, runbook annotations.
Figure 4.3: Recording-rule chain feeding an alert
sequenceDiagram
participant T as "Target /metrics"
participant P as "Prometheus scraper"
participant R1 as "Recording rule (base)<br/>job:http_requests:rate5m"
participant R2 as "Recording rule (ratio)<br/>job:http_error_ratio:5m"
participant A as "Alerting rule<br/>HighRequestErrorRate"
participant AM as "Alertmanager"
T->>P: "scrape every 15s"
P->>P: "write samples to TSDB"
Note over R1: "every 30s eval interval"
P->>R1: "read raw counters"
R1->>R1: "sum by (job) (rate(...))"
R1->>P: "write new series"
Note over R2: "every 30s eval interval"
R1->>R2: "read rate5m series"
R2->>R2: "errors / requests"
R2->>P: "write ratio series"
Note over A: "every 30s eval interval"
R2->>A: "read ratio series"
A->>A: "expr > 0.05 sustained for 5m"
A->>AM: "fire alert with labels + annotations"
4.4 Common Pitfalls
PromQL is a small language with a lot of footguns. The pitfalls below are the ones that show up most often in incident postmortems.
Counter Resets Across Restarts
Counters should never decrease, but processes restart. When a counter resets from 12345 back to 0, every rate-family function (rate, irate, increase) detects the drop and treats it as a reset, counting only the post-reset increase. This works automatically and is one of PromQL’s most pleasant surprises.
But there are edge cases:
- Frequent restarts within a short window. A pod that restarts every 30 seconds inside a 1-minute rate window produces noisy, misleading rates. Investigate the restart loop; the metric is correctly reflecting the chaos.
- Counters that aren’t actually counters. Some metrics named
_totalare gauges in disguise (badly designed exporters). Rate functions silently produce nonsense. Always verify metric type via the Prometheus UI or/metricsendpoint. - Aggregating across reset boundaries. If you
sumtwo counter series and one resets, the sum jumps. Always applyrate()orincrease()before aggregating counters:
# WRONG — sum first, then rate. Reset on one instance corrupts the sum.
rate(sum by (job)(http_requests_total)[5m:])
# RIGHT — rate per series first, then aggregate
sum by (job)(rate(http_requests_total[5m]))
Staleness Markers and Missing Samples
Prometheus 2.0+ writes an explicit staleness marker when a target disappears or a series stops being reported. Five minutes after the last sample (the default lookback delta), instant queries return no result for that series rather than the last known value. This is usually what you want — but it has consequences:
- Alerts can silently disable themselves. If your alert is
up == 0and theupseries stops being reported (e.g., because the scrape config no longer targets that job), the alert never fires. - Dashboards show gaps. Sparse metrics (events that only occur occasionally) appear missing rather than zero. Use
or vector(0)orabsent()to make missing data explicit:
# Detect when a critical metric is missing
absent(up{job="payments-api"})
# Default to zero when no data
sum by (service)(rate(payment_failures_total[5m])) or vector(0)
- Subqueries cross staleness boundaries. A
[1h:30s]subquery includes 30-second snapshots; if data is missing for part of that hour, the inner expression evaluates to no result for those steps. Functions likeavg_over_timeskip those steps rather than treating them as zero.
The lookback-delta setting (default 5 minutes) controls how far back PromQL searches for the most recent sample of a series. Don’t change this without strong reason — it has cascading effects on every query in your environment.
Cardinality Explosions from High-Dimensional Labels
Cardinality — the number of unique time series — is the single largest scalability constraint in Prometheus. Each unique combination of metric name and label values is one series; each series consumes memory, disk, and query CPU. A metric with 10 labels each having 100 possible values can in principle produce 10^20 series. In practice, a few hundred thousand series per Prometheus instance is healthy; a few million is painful; tens of millions usually crashes.
Cardinality explodes when a label can take values from an unbounded set:
| Label | Cardinality | Safe in metrics? |
|---|---|---|
method (GET, POST, PUT, …) | ~10 | Yes |
status_code (200, 404, 500, …) | ~40 | Yes |
service | tens | Yes |
pod | hundreds–thousands; changes constantly | Risky |
request_path (raw URL) | unbounded | No |
user_id | millions | No |
trace_id | every request | Absolutely not |
email | millions | Absolutely not |
A common real-world failure: an exporter that uses raw request paths as labels generates one series per (method, status, path) tuple. Add /users/{id}/posts/{id} and you have one series per user per post — millions of series before lunch. The fix is to normalize at the source: bucket paths into templates (/users/:id/posts/:id), drop high-cardinality labels before exposing them, or move that information into logs and traces (where cardinality is cheap) instead of metrics.
For recording rules, the discipline tightens. A recording rule that retains pod as a grouping label persists one new series per pod per evaluation interval — a hidden multiplier. Standard practice [Source: https://www.dynatrace.com/news/blog/site-reliability-done-right/]:
# RISKY — preserves pod label, creates pod-cardinality series
- record: pod:http_requests:rate5m
expr: sum by (pod, service)(rate(http_requests_total[5m]))
# SAFER — aggregate pod away in the recording rule
- record: service:http_requests:rate5m
expr: sum by (service)(rate(http_requests_total[5m]))
Tools for diagnosis:
topk(20, count by (__name__)({__name__=~".+"}))— top 20 metrics by series count.count(count by (label_name)(metric_name))— distinct values for a specific label.prometheus_tsdb_head_series— total active series; alert when this grows fast.
Key Takeaway: Three pitfalls dominate PromQL incidents: counter resets handled by aggregating after
rate()(not before); staleness handled byabsent()/or vector(0)for missing data; and cardinality controlled by aggregating away unbounded labels in recording rules and never exposing raw user/path/trace IDs as label values.
Figure 4.4: How label dimensions multiply series count
graph LR
M["http_requests_total<br/>1 metric, no labels<br/>1 series"]
M --> L1["+ method<br/>(~10 values)<br/>10 series"]
L1 --> L2["+ status_code<br/>(~40 values)<br/>400 series"]
L2 --> L3["+ pod<br/>(~500 churning values)<br/>200,000 series"]
L3 --> L4["+ request_path raw URL<br/>(unbounded, 50k+)<br/>10,000,000+ series<br/>Prometheus OOM"]
L4 --> FIX["Fix: normalize path to template<br/>route="/users/:id/posts/:id"<br/>or drop label entirely"]
Chapter Summary
PromQL is the language of operational truth in a Prometheus-based observability stack. In this chapter you learned to:
- Reason about data shape. Every PromQL expression returns an instant vector, range vector, or scalar; the shape determines which operators are valid. Range vectors exist to feed time-window functions; everything else operates on instant vectors.
- Pick the right rate function.
rate(m[5m])for dashboards and most alerts;irate(m[1m])for spike detection;increase(m[30d])for SLO budgets. All three handle counter resets automatically and demand counter (not gauge) inputs. - Compute histogram quantiles correctly.
rate()the buckets,sum by (le, ...)preserving thelelabel, thenhistogram_quantile()on the aggregated result. Never aggregate after quantiles, never drople, never mix bucket layouts. - Aggregate with intent. Use
byto keep a small label set,withoutto drop specific labels in high-cardinality environments. Always aggregate ratios numerator-and-denominator-separately, then divide. - Move expensive work into recording rules. Apply the
level:metric:operationnaming convention, keep rule chains 2–3 levels deep, and never rate a rate. - Tune alerting rules. Use
forclauses sized to alert category (1–5m for symptoms, 5–15m for saturation, longer for SLO burn). Treat rules as code: Git, CI lint, runbooks linked from annotations. - Avoid the big pitfalls. Counter resets (rate first, then aggregate), staleness (use
absentand explicit defaults), and cardinality (drop unbounded labels in recording rules; never expose user/path/trace IDs as labels).
The next chapter takes the data we now know how to query and shows how to push it into dashboards, alert receivers, and downstream systems — bringing PromQL into the operational loop of the SRE team.
Key Terms
| Term | Definition |
|---|---|
| instant vector | A set of time series each containing a single sample at the evaluation timestamp; the default result type of most PromQL expressions. |
| range vector | A set of time series each containing a range of samples going back in time; produced by appending [duration] to a selector and consumed by functions like rate(). |
| rate | A function returning the per-second average rate of a counter across a range vector, computed via linear regression with extrapolation to window edges. |
| histogram_quantile | A function that estimates a quantile by linearly interpolating within the bucket of a classic or native histogram that contains the target rank. |
| recording rule | A configured PromQL expression that Prometheus evaluates on a fixed schedule, writing the result back as a new time series for fast retrieval. |
| alerting rule | A configured PromQL expression whose non-zero, non-NaN result series produce alert events; tunable with a for duration to require sustained truth. |
| staleness | The Prometheus behavior of treating a series as gone after a lookback delta (default 5m) with no new samples or after an explicit staleness marker. |
| cardinality | The total number of unique time series; the primary scaling constraint of Prometheus, driven mostly by the cross-product of label-value combinations. |
[Source: https://sre.google/sre-book/service-best-practices/] [Source: https://www.dynatrace.com/news/blog/site-reliability-done-right/] [Source: https://natesnewsletter.substack.com/p/context-windows-are-a-lie-the-myth]
Chapter 5: OpenTelemetry Architecture: API, SDK, and Collector
OpenTelemetry is often described as a single project, but in practice it is three loosely coupled layers working in concert: a small, stable API that instrumentation code calls; a configurable SDK that turns those calls into real spans, metrics, and log records; and an out-of-process Collector that receives, processes, and forwards telemetry. Understanding how these layers fit together — and where the seams are — is the foundation for every decision you will make later about samplers, exporters, deployment topologies, and vendor choice.
This chapter zooms out from individual signals (Chapters 2–4) and looks at OpenTelemetry as a system. By the end, you should be able to read an OpenTelemetry architecture diagram, predict where a given piece of configuration belongs, and choose a Collector deployment topology appropriate for the workload in front of you.
Learning Objectives
By the end of this chapter, you will be able to:
- Differentiate the OpenTelemetry API, SDK, and Collector and explain why the API/SDK split is a deliberate architectural decision rather than an accident of history.
- Trace telemetry data from an instrumented application through SDK exporters and the Collector to a backend, and identify which configuration knob lives at each layer.
- Select an appropriate Collector deployment topology — agent (sidecar or DaemonSet), gateway (centralized service), or a hybrid of both — for a given workload, citing concrete trade-offs.
- Compare OTLP transport variants (gRPC, HTTP/protobuf, HTTP/JSON) and pick one based on infrastructure constraints.
- Distinguish OpenTelemetry distributions (core, contrib, vendor, custom) and know when to build your own with the OpenTelemetry Collector Builder (
ocb).
Figure 5.1: OpenTelemetry three-layer architecture and OTLP data flow
flowchart TD
subgraph App["Application Process"]
Lib[Library Code<br/>depends on API only]
Code[Application Code<br/>depends on API only]
API[OpenTelemetry API<br/>Tracer / Meter / Logger interfaces]
SDK[OpenTelemetry SDK<br/>samplers + processors + exporters + resource]
Lib --> API
Code --> API
API --> SDK
end
SDK -->|"OTLP/gRPC :4317 or OTLP/HTTP :4318"| Col
subgraph Col["OpenTelemetry Collector (out-of-process)"]
Recv[Receivers<br/>OTLP, Prometheus, filelog]
Proc[Processors<br/>batch, memory_limiter, k8sattributes, tail_sampling]
Exp[Exporters<br/>OTLP, vendor-specific]
Recv --> Proc --> Exp
end
Exp -->|"OTLP or vendor protocol"| BE[("Backend<br/>Prometheus / Tempo / Loki / Vendor SaaS")]
1. API vs SDK vs Collector
OpenTelemetry’s most important architectural decision is splitting instrumentation surface from pipeline implementation, and then separating both from the out-of-process telemetry agent. Each layer has a distinct audience, a distinct release cadence, and a distinct dependency footprint [Source: https://opentelemetry.io/docs/concepts/components/].
1.1 The API: a stable interface for instrumentation
The API is what library authors and application code import. It defines types like TracerProvider, Tracer, Span, MeterProvider, Meter, LoggerProvider, and Logger, along with global access points (GlobalOpenTelemetry.getTracer(...) in Java, opentelemetry.trace.get_tracer(...) in Python). Crucially, the API defines interfaces only — it does not know about exporters, samplers, batching, OTLP, or any backend [Source: https://opentelemetry.io/docs/specs/otel/].
Think of the API as the plug on the back of an appliance. The shape of the plug is standardized and changes very slowly. Whether the wall socket is connected to a hydroelectric dam, a solar panel, or nothing at all is not the appliance’s problem.
A library that emits a span looks like this in Java:
import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.api.trace.Span;
private static final Tracer tracer =
GlobalOpenTelemetry.getTracer("com.example.library", "1.0.0");
void doWork() {
Span span = tracer.spanBuilder("doWork").startSpan();
try {
span.setAttribute("foo", "bar");
// library logic
} finally {
span.end();
}
}
Note what is missing: no exporter, no endpoint, no sampling rate, no environment variable parsing. The library cannot accidentally pull in a vendor SDK, gRPC client, or HTTP exporter as a transitive dependency [Source: https://opentelemetry.io/docs/languages/java/instrumentation/].
If no SDK is registered, the API returns no-op implementations. Spans are created but never recorded; metric updates evaporate; log records are dropped. The cost is a few function calls and a small allocation — safe to leave instrumentation enabled even in latency-sensitive paths [Source: https://opentelemetry.io/docs/specs/otel/].
1.2 The SDK: a configurable pipeline implementation
The SDK is what application developers wire up at startup. It replaces the API’s no-op providers with concrete implementations that actually record data, apply samplers, run processors, and call exporters [Source: https://opentelemetry.io/docs/specs/otel/].
A minimal Java SDK initialization looks like this:
OtlpGrpcSpanExporter exporter = OtlpGrpcSpanExporter.builder()
.setEndpoint("http://collector:4317")
.build();
SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
.addSpanProcessor(BatchSpanProcessor.builder(exporter).build())
.setResource(Resource.getDefault().toBuilder()
.put("service.name", "my-service")
.build())
.build();
OpenTelemetrySdk.builder()
.setTracerProvider(tracerProvider)
.buildAndRegisterGlobal();
The SDK owns four moving parts:
- Samplers (e.g.,
AlwaysOn,ParentBased,TraceIdRatioBased) decide whether to record a span. - Processors (
SimpleSpanProcessor,BatchSpanProcessor, metric readers, log record processors) buffer, batch, enrich, or filter data. - Exporters serialize data and send it somewhere — OTLP, Jaeger, Zipkin, Prometheus, or a vendor-specific endpoint.
- Resource describes the entity producing telemetry (
service.name,host.name,deployment.environment).
1.3 Why the split matters
The API/SDK split is not bureaucratic — it directly enables vendor neutrality:
| Concern | API only | SDK |
|---|---|---|
| Audience | Library authors, app code | Application operators, platform teams |
| Release cadence | Slow, stable | Faster, more features |
| Dependencies | Tiny (interfaces) | Heavy (exporters, gRPC, processors) |
| Default behavior | No-op | Records and exports data |
| Vendor coupling | None | Choose any exporter |
A widely used HTTP client library depending only on opentelemetry-api adds essentially zero weight and zero opinion about your backend. The same library can be used in an app that exports to Jaeger, in an app that exports to a SaaS vendor, and in an app that disables telemetry entirely — without recompilation [Source: https://opentelemetry.io/docs/concepts/components/].
1.4 The Collector: an out-of-process pipeline
The Collector is a separate binary written in Go that runs outside your application. It speaks OTLP (and many other protocols) on its receivers, runs processors on the in-memory pipeline, and sends data out through exporters [Source: https://opentelemetry.io/docs/collector/].
receivers → processors → exporters
Figure 5.4: Collector pipeline anatomy
flowchart LR
subgraph Receivers
R1[OTLP gRPC :4317]
R2[OTLP HTTP :4318]
R3[Prometheus scrape]
R4[filelog tail]
end
subgraph Processors
P1[memory_limiter<br/>backpressure]
P2[k8sattributes<br/>add pod/namespace]
P3[tail_sampling<br/>keep errors + slow]
P4[batch<br/>group for efficiency]
P1 --> P2 --> P3 --> P4
end
subgraph Exporters
E1[OTLP to backend]
E2[Vendor exporter]
E3[debug / logging]
end
R1 --> P1
R2 --> P1
R3 --> P1
R4 --> P1
P4 --> E1
P4 --> E2
P4 --> E3
Why move logic out of the application?
- Decoupling deploys. Sampling, redaction, and backend routing change without rebuilding every service.
- Aggregation and batching. A Collector seeing traffic from hundreds of pods can build larger, more efficient batches than any single SDK exporter.
- Vendor portability. Apps export OTLP; the Collector translates to whatever backend you happen to use this quarter.
- Centralized policy. Drop secrets, rewrite attributes, enforce per-tenant limits in one place.
Key Takeaway: OpenTelemetry deliberately separates a stable instrumentation surface (API) from a configurable in-process pipeline (SDK) from an out-of-process aggregator (Collector). Libraries depend only on the API, applications own the SDK, and operations teams own the Collector — each layer can evolve independently without breaking the others.
2. Cross-Language Architecture
OpenTelemetry promises a consistent mental model across more than a dozen languages. The architecture above repeats almost identically in Java, Python, Go, .NET, Node.js, Ruby, PHP, Rust, C++, Swift, and others. What ties them together is a shared specification, a shared set of semantic conventions, and a shared wire protocol (OTLP) [Source: https://opentelemetry.io/docs/specs/otel/].
2.1 Language support matrix and stability levels
Each language SIG (Special Interest Group) implements the spec at its own pace. The OpenTelemetry project tracks stability per signal per language: a language might be GA for traces, beta for metrics, and experimental for logs. This matters when you adopt OpenTelemetry in a polyglot environment — a Java service might emit fully GA telemetry while a sibling Node.js service is still using a beta logs SDK [Source: https://opentelemetry.io/docs/languages/].
A high-level snapshot (consult the docs for current status):
| Signal | Java | Python | Go | .NET | Node.js |
|---|---|---|---|---|---|
| Traces | Stable | Stable | Stable | Stable | Stable |
| Metrics | Stable | Stable | Stable | Stable | Stable |
| Logs | Stable | Beta/Stable | Beta | Stable | Beta |
The pattern: traces stabilized first, metrics followed, logs are the most recent and still maturing in some languages.
2.2 Semantic conventions: the lingua franca
If every team picks its own attribute names — http.statusCode vs http_status vs httpResponse.code — your “vendor-neutral” telemetry becomes useless. Semantic conventions are the OpenTelemetry project’s standardized vocabulary for resource and span attributes [Source: https://opentelemetry.io/docs/concepts/semantic-conventions/].
Examples:
service.name,service.namespace,service.instance.idhost.name,host.id,host.archdeployment.environment(e.g.,production,staging)http.request.method,http.response.status_code,url.fulldb.system,db.statement,db.namemessaging.system,messaging.destination.namecloud.provider,cloud.region,k8s.namespace.name,k8s.pod.name
The payoff: a dashboard, alert, or query written against http.response.status_code works identically whether the data came from a Java service, a Go service, or a Python service. Backends like Grafana, Datadog, Honeycomb, and Tempo can build out-of-the-box visualizations because they know exactly what to look for.
Analogy: semantic conventions are to telemetry what HTTP status codes are to the web. Without them, every server could invent its own “the page worked” signal; with them, every browser, proxy, and dashboard knows what 200 means.
2.3 OTLP: the wire protocol that ties it together
OTLP (OpenTelemetry Protocol) is the bridge between SDKs, Collectors, and OTLP-compatible backends. It defines:
- A protobuf schema (
ExportTraceServiceRequest,ResourceSpans,ScopeSpans,Span, plus equivalents for metrics and logs). - A small set of service methods (
TraceService.Export,MetricsService.Export,LogsService.Export). - Three transport bindings: OTLP/gRPC, OTLP/HTTP/protobuf, and OTLP/HTTP/JSON [Source: https://opentelemetry.io/docs/specs/otlp/].
The structure of a trace export request is layered:
ExportTraceServiceRequest
└── ResourceSpans (one per Resource, e.g., per service)
├── Resource (attributes like service.name)
└── ScopeSpans (one per instrumentation library)
├── InstrumentationScope (name, version)
└── Span[] (trace_id, span_id, attributes, events, status)
The same proto messages are used across all three transports — only the encoding and HTTP/gRPC framing differ [Source: https://github.com/open-telemetry/opentelemetry-proto].
2.4 gRPC vs HTTP/protobuf vs HTTP/JSON
| Aspect | OTLP/gRPC | OTLP/HTTP/protobuf | OTLP/HTTP/JSON |
|---|---|---|---|
| Default port | 4317 | 4318 | 4318 |
| Encoding | Protobuf (binary) | Protobuf (binary) | JSON (text) |
| Transport | gRPC over HTTP/2 | HTTP/1.1 or HTTP/2 | HTTP/1.1 or HTTP/2 |
| Multiplexing | Yes (HTTP/2 streams) | Depends on HTTP version | Depends |
| Wire overhead | Lowest | Low | Highest (text, verbose) |
| Proxy/LB friendliness | Needs HTTP/2 + gRPC-aware LBs | Standard HTTP infra | Standard HTTP infra |
| Debuggability | Hardest (binary + gRPC) | Medium (binary) | Easiest (curl-able) |
| Browser support | No (gRPC-Web is different) | Yes | Yes |
Practical guidance:
- Default to OTLP/gRPC on port 4317 in Kubernetes and other modern infra where you control the network path [Source: https://opentelemetry.io/docs/specs/otel/protocol/exporter/].
- Switch to OTLP/HTTP/protobuf on port 4318 when traversing proxies or load balancers that don’t handle gRPC well, or backends that only expose HTTP endpoints.
- Use OTLP/HTTP/JSON for browsers (front-end telemetry), debugging with
curl, or low-volume use cases where wire overhead doesn’t matter.
Default endpoints for OTLP/HTTP:
POST /v1/tracesPOST /v1/metricsPOST /v1/logs
A common configuration mistake is mismatching protocol and port:
# WRONG: gRPC port with HTTP exporter
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
This produces errors like “unexpected response code” or “transport error.” Use grpc with 4317 or http/protobuf with 4318, not a cross [Source: https://opentelemetry.io/docs/specs/otlp/].
Figure 5.3: OTLP export request flow across transport variants
sequenceDiagram
participant SDK as SDK Exporter
participant Col as Collector OTLP Receiver
Note over SDK,Col: OTLP/gRPC on :4317
SDK->>Col: HTTP/2 frame: TraceService.Export(ExportTraceServiceRequest, protobuf)
Col-->>SDK: ExportTraceServiceResponse (may include partial_success)
Note over SDK,Col: OTLP/HTTP/protobuf on :4318
SDK->>Col: "POST /v1/traces Content-Type: application/x-protobuf"
Col-->>SDK: "200 OK protobuf body (partial_success)"
Note over SDK,Col: OTLP/HTTP/JSON on :4318
SDK->>Col: "POST /v1/traces Content-Type: application/json"
Col-->>SDK: "200 OK JSON body (partial_success)"
Note over SDK,Col: Retry only on UNAVAILABLE / 5xx / 429 with backoff
2.5 Partial success and retry semantics
Both gRPC and HTTP OTLP support partial success: the response carries a partial_success field with rejected_spans (or rejected data points / log records) and an error_message. The transport-level status is still success — rejected items are usually permanently bad (e.g., rate-limited, malformed) and should not be retried [Source: https://opentelemetry.io/docs/specs/otlp/].
Retry policy is signaled by the transport status:
- Retry on
UNAVAILABLE,DEADLINE_EXCEEDED,RESOURCE_EXHAUSTED(gRPC) or HTTP 5xx, 429 (with backoff). Use exponential backoff with jitter. - Do not retry on
INVALID_ARGUMENT,UNAUTHENTICATED,PERMISSION_DENIED(gRPC) or HTTP 400, 401, 403. The data is bad; retrying won’t help.
Key Takeaway: OpenTelemetry’s cross-language consistency rests on three pillars: the specification (which defines API/SDK semantics per signal), semantic conventions (a shared vocabulary so any backend can interpret data from any language), and OTLP (the wire protocol that carries telemetry between SDK, Collector, and backend across three transport variants).
3. Collector Deployment Topologies
The Collector is the most operationally flexible piece of OpenTelemetry. The same binary, with different configuration, can be deployed as a sidecar inside a pod, as a DaemonSet per node, or as a centralized fleet of pods behind a service. Most production Kubernetes environments combine more than one of these patterns [Source: https://opentelemetry.io/docs/collector/deployment/].
Figure 5.2: Agent, gateway, and hybrid Collector topologies
flowchart TD
subgraph AgentMode["Agent topology (DaemonSet)"]
direction TB
A_App1[App Pod<br/>Node 1]
A_App2[App Pod<br/>Node 2]
A_Ag1[Collector Agent<br/>Node 1]
A_Ag2[Collector Agent<br/>Node 2]
A_BE[(Backend)]
A_App1 -->|"OTLP localhost"| A_Ag1
A_App2 -->|"OTLP localhost"| A_Ag2
A_Ag1 --> A_BE
A_Ag2 --> A_BE
end
subgraph GatewayMode["Gateway topology (Deployment)"]
direction TB
G_App1[App Pod<br/>Node 1]
G_App2[App Pod<br/>Node 2]
G_GW[Gateway Collectors<br/>centralized Deployment + Service]
G_BE[(Backend)]
G_App1 -->|"OTLP cluster Service"| G_GW
G_App2 -->|"OTLP cluster Service"| G_GW
G_GW --> G_BE
end
subgraph HybridMode["Hybrid topology (recommended)"]
direction TB
H_App1[App Pod<br/>Node 1]
H_App2[App Pod<br/>Node 2]
H_Ag1[Agent<br/>Node 1]
H_Ag2[Agent<br/>Node 2]
H_GW[Gateway Collectors<br/>tail sampling + tenant routing + auth]
H_BE[(Backend)]
H_App1 -->|"OTLP localhost"| H_Ag1
H_App2 -->|"OTLP localhost"| H_Ag2
H_Ag1 -->|"OTLP cross-node"| H_GW
H_Ag2 -->|"OTLP cross-node"| H_GW
H_GW --> H_BE
end
3.1 Agent mode (sidecar or DaemonSet)
In agent mode, a Collector lives next to the workload:
- Sidecar: one Collector container per application pod, sharing
localhost. Useful for per-service custom pipelines or strict tenant isolation between pods on the same node. Expensive at scale. - DaemonSet: one Collector per Kubernetes node, exposed on the node’s IP or a host-network port. Apps on that node send telemetry to the local Collector.
Agents excel at:
- Node-local enrichment — adding
k8s.pod.name,k8s.namespace.name,host.namefrom a position where that information is cheap to obtain. - Host-level signals — scraping kubelet/cAdvisor metrics, reading container logs from
/var/log/containers/*.log, collecting host CPU and memory. - Minimal application latency — apps export to
localhostor a node-local address with no cross-node hop. - Isolated failure domain — if one agent dies, only its node’s telemetry is affected.
A minimal agent receivers/processors/exporters chain:
receivers:
otlp:
protocols:
grpc:
http:
filelog:
include: [ /var/log/containers/*.log ]
kubeletstats:
collection_interval: 10s
processors:
memory_limiter:
limit_percentage: 75
k8sattributes:
auth_type: serviceAccount
batch:
timeout: 5s
send_batch_size: 8192
exporters:
otlp:
endpoint: otel-gateway.observability.svc.cluster.local:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlp]
3.2 Gateway mode (centralized service)
In gateway mode, one or more Collectors run as a Deployment behind a Kubernetes Service (or external load balancer). Apps — or, more commonly, agents — send telemetry to this central fleet [Source: https://opentelemetry.io/docs/collector/deployment/].
Gateways excel at:
- Tail-based sampling. Sampling that depends on the full trace (e.g., “keep all traces with errors” or “keep all traces over 500 ms”) requires seeing every span in the trace. A node-local agent only sees spans for pods on its node; a gateway can see the whole trace [Source: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor].
- Efficient batching and compression. Aggregating from many sources lets the gateway build large batches that compress well and amortize backend connection overhead.
- Centralized auth and secrets. API keys, OAuth tokens, mTLS certs for the backend live in one place rather than on every node.
- Multi-tenant routing. The gateway is the natural place to route per-team or per-tenant traffic to different backends, apply per-tenant quotas, and validate tenant tokens.
For correct tail sampling across multiple gateway replicas, you need trace-ID-aware load balancing so that all spans for a given trace ID hit the same gateway pod. The loadbalancing exporter or an L7 load balancer with consistent hashing is the typical solution.
A gateway pipeline with tail sampling:
processors:
memory_limiter:
batch:
tail_sampling:
decision_wait: 5s
num_traces: 50000
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: default
type: probabilistic
probabilistic:
sampling_percentage: 10
3.3 Agent vs gateway: trade-off summary
| Concern | Agent (sidecar / DaemonSet) | Gateway (centralized) |
|---|---|---|
| Resource overhead | Linear with nodes/pods; isolated impact | Fewer instances; better CPU/memory efficiency |
| Batching efficiency | Smaller per-instance volume | Aggregates → large batches, better compression |
| Tail-based sampling | Limited to local view → broken traces | Global view → correct decisions |
| Network topology | App → localhost or node-local | App → cluster Service (cross-node traffic) |
| Auth to backends | Secrets distributed across nodes | Secrets centralized at gateway |
| Multi-tenancy | Hard to enforce centrally | Natural policy enforcement point |
| Scalability | Scales naturally with nodes | Needs HPA / sharding for stateful processors |
| Reliability | No central SPOF; per-node blast radius | Choke point; mitigated by replicas + HPA |
3.4 Hybrid: agent + gateway
For most non-trivial Kubernetes clusters, the recommended pattern is DaemonSet agent + Deployment gateway:
- DaemonSet agent receives OTLP from local pods, scrapes Prometheus and kubelet endpoints, tails container logs, adds Kubernetes metadata, and forwards OTLP to the gateway. Resilient, node-local, and cheap.
- Deployment gateway receives OTLP from agents (and any clients that prefer to bypass the agent), runs tail sampling and tenant routing, authenticates to backends, and handles retries and backpressure.
This hybrid topology gives you the best of both worlds: node-local enrichment plus host signals from the agent, and centralized sampling, auth, and routing from the gateway. It also gives you a natural choice of where to bound failure: if the gateway is overloaded, agents buffer locally (memory or file_storage extension) until pressure subsides [Source: https://opentelemetry.io/docs/collector/deployment/].
3.5 The OpenTelemetry Operator
On Kubernetes, the OpenTelemetry Operator provides CRDs that let you declare topology declaratively [Source: https://github.com/open-telemetry/opentelemetry-operator]:
OpenTelemetryCollectorwithmode: daemonset→ agent per node.OpenTelemetryCollectorwithmode: sidecar→ injected into matching pods.OpenTelemetryCollectorwithmode: deployment→ centralized gateway.InstrumentationCR → auto-injects SDK and configures endpoint for Java, Python, Node.js, .NET, and Go workloads.
The Operator lets you treat Collector topology as declarative configuration, just like Deployments and Services.
Key Takeaway: Agent mode optimizes for node-local collection, host signals, and isolated failure domains; gateway mode optimizes for tail sampling, centralized auth, and multi-tenant routing. Most production Kubernetes clusters end up running both — a DaemonSet agent for cheap node-local work and a Deployment gateway for heavy centralized work.
4. Distributions and Builds
The Collector is more than one binary. The OpenTelemetry project ships distributions — pre-built bundles of receivers, processors, exporters, and extensions — and provides a builder tool for assembling your own [Source: https://opentelemetry.io/docs/collector/].
4.1 otelcol vs otelcol-contrib
The two flagship distributions:
otelcol(core): the minimal, stable distribution. Includes the most common, well-tested receivers (OTLP, Prometheus), processors (batch,memory_limiter,attributes), and exporters (OTLP, OTLPHTTP, debug). Recommended for production pipelines that don’t need exotic components.otelcol-contrib: the kitchen-sink distribution. Includes hundreds of components: vendor-specific exporters (Datadog, Splunk, New Relic, Honeycomb, AWS, GCP, Azure), specialized receivers (kubeletstats,filelog,hostmetrics,redis,mysql), advanced processors (tail_sampling,transform,routing,k8sattributes), and many more.
| Distribution | Components | Image size | Use case |
|---|---|---|---|
otelcol | Minimal, core only | Small | Stable production, OTLP-only |
otelcol-contrib | Hundreds | Large | Most real-world deployments needing vendor or specialty components |
| Vendor (e.g., Datadog Agent, AWS Distro for OpenTelemetry) | Curated for vendor | Varies | Tight vendor integration, vendor support contracts |
Custom (built with ocb) | Exactly what you choose | Smallest possible | Production hardening, supply-chain control |
For most teams getting started, otelcol-contrib is the practical default — almost every production pipeline ends up needing at least one component that lives in contrib (k8sattributes, tail_sampling, filelog, etc.).
4.2 Building custom distributions with ocb
The OpenTelemetry Collector Builder (ocb) is a CLI that lets you assemble your own distribution from a manifest. The motivation:
- Supply-chain hygiene. Ship only the components you actually use. Smaller binary, smaller attack surface, faster startup.
- Compliance. Some organizations require an inventory of every Go module in production binaries.
- Performance. Fewer registered components = less startup overhead, smaller config schema, fewer plugins competing for the same resources.
A builder manifest (manifest.yaml) looks like this:
dist:
name: my-otelcol
description: Custom OpenTelemetry Collector for ACME Corp
output_path: ./dist
otelcol_version: 0.95.0
receivers:
- gomod: go.opentelemetry.io/collector/receiver/otlpreceiver v0.95.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver v0.95.0
processors:
- gomod: go.opentelemetry.io/collector/processor/batchprocessor v0.95.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor v0.95.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor v0.95.0
exporters:
- gomod: go.opentelemetry.io/collector/exporter/otlpexporter v0.95.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter v0.95.0
Then:
ocb --config manifest.yaml
The result is a single Go binary containing exactly the components listed — nothing more.
Analogy: otelcol-contrib is like a Linux distribution shipped with every package preinstalled. Convenient, but most servers only need a handful. A custom build with ocb is the equivalent of apt install only the packages you need, on a minimal base image.
4.3 Vendor-specific distributions
Several vendors ship their own Collector distributions, sometimes called agents or contribs:
- AWS Distro for OpenTelemetry (ADOT) — bundles upstream Collector with AWS-specific exporters (X-Ray, CloudWatch, AMP) and is supported by AWS.
- Splunk OpenTelemetry Collector — pre-configured for Splunk Observability Cloud, with Splunk-tuned defaults.
- Datadog Agent (OpenTelemetry mode) — Datadog’s agent can ingest OTLP and forward to Datadog with its own pipeline.
- Grafana Agent / Alloy — Grafana Labs’ distribution focused on Grafana Cloud and the LGTM stack.
Vendor distributions still respect the API/SDK/OTLP boundary. Your applications continue to emit standard OTLP; only the Collector itself is vendor-flavored. Switching vendors is largely a Collector configuration change — instrumentation code stays put. This is the practical payoff of the architecture in section 1.
4.4 Choosing a distribution
A decision matrix for picking a distribution:
| If you need… | Start with |
|---|---|
| OTLP-only, simple pipeline | otelcol (core) |
| Most real-world Kubernetes pipelines | otelcol-contrib |
| Tight vendor integration + support contract | Vendor distribution |
| Minimal binary, supply-chain control | Custom build with ocb |
| Quick local experimentation | otelcol-contrib Docker image |
Key Takeaway: OpenTelemetry distributions are pre-assembled bundles of Collector components. Use
otelcol-contribfor most production pipelines, vendor distributions when you want first-party support, andocb-built custom distributions when supply-chain hygiene or binary size matter. In all cases, applications continue to emit standard OTLP — the distribution choice is a Collector concern, not an instrumentation concern.
Chapter Summary
OpenTelemetry is three layers, not one: a stable API that instrumentation code calls, a configurable SDK that turns those calls into real telemetry, and an out-of-process Collector that aggregates, processes, and forwards data. The API/SDK split is what makes vendor-neutral instrumentation possible — libraries can depend only on the API, and applications choose backends at deployment time without recompiling.
Across more than a dozen languages, three pillars give OpenTelemetry consistency: the specification that defines API/SDK semantics, semantic conventions that standardize attribute names so any backend can interpret any signal, and OTLP — the wire protocol with three transport variants (gRPC on 4317, HTTP/protobuf on 4318, HTTP/JSON on 4318) — that carries telemetry between SDKs, Collectors, and backends.
Collector topology is the most operationally flexible knob. Agents (sidecars or DaemonSets) excel at node-local enrichment, host signals, and isolated failures. Gateways (centralized Deployments) excel at tail-based sampling, centralized auth, and multi-tenant routing. Most non-trivial Kubernetes deployments combine both — a DaemonSet agent for cheap local work and a Deployment gateway for heavy centralized work, often managed by the OpenTelemetry Operator.
Finally, distributions package the Collector for different audiences: otelcol for minimal stable pipelines, otelcol-contrib for the practical default, vendor distributions for first-party support, and ocb-built custom distributions for supply-chain hygiene. Regardless of distribution, applications emit standard OTLP — keeping the seam between application code and operations exactly where the architecture intends.
In the next chapter, we’ll dive deeper into instrumenting applications: auto-instrumentation versus manual instrumentation, language-specific patterns, and how to evolve from zero-touch coverage to high-value custom spans and metrics.
Key Terms
| Term | Definition |
|---|---|
| OpenTelemetry API | The stable, vendor-neutral interface (Tracer/Meter/Logger and their providers) that libraries and application code use to emit telemetry. Default behavior is no-op when no SDK is configured. |
| OpenTelemetry SDK | The concrete in-process implementation of the API that adds samplers, processors, exporters, and resource detection. Applications own SDK configuration. |
| OpenTelemetry Collector | An out-of-process Go binary that receives, processes, and exports telemetry. Pipeline = receivers → processors → exporters. |
| OTLP | OpenTelemetry Protocol; the vendor-neutral wire protocol between SDKs, Collectors, and backends. Defined in protobuf with three transport variants. |
| OTLP/gRPC | Default OTLP transport: protobuf over gRPC/HTTP2 on port 4317. Lowest overhead; preferred in modern infrastructure. |
| OTLP/HTTP/protobuf | OTLP over HTTP/1.1 or HTTP/2 with binary protobuf body, port 4318. Friendly to traditional HTTP proxies and load balancers. |
| OTLP/HTTP/JSON | OTLP over HTTP with JSON body. Easiest to debug; required for browser telemetry. |
| Semantic conventions | Standardized attribute names (e.g., service.name, http.response.status_code, k8s.pod.name) that let any backend interpret telemetry from any language uniformly. |
| Agent deployment | Collector running per-node (DaemonSet) or per-pod (sidecar). Node-local collection, host signals, isolated failures. |
| Gateway deployment | Collector running as a centralized Deployment behind a Service. Enables tail sampling, centralized auth, multi-tenant routing. |
| Distribution | A pre-built bundle of Collector components (otelcol, otelcol-contrib, vendor builds, or custom ocb builds). |
ocb (OpenTelemetry Collector Builder) | A CLI that builds a custom Collector binary from a manifest listing exactly the receivers, processors, and exporters you want. |
| Resource | OTLP/SDK construct describing the entity producing telemetry (service.name, host.name, deployment.environment, cloud attributes). |
| Partial success | OTLP response indicating that some items were rejected by the backend (e.g., rate-limited). Rejected items should not be retried. |
Chapter 6: Instrumentation: Manual, Automatic, and Zero-Code
Instrumentation is the act of teaching your code to talk about itself. Before any dashboard can be drawn, before any alert can fire, before any trace can be visualized, some agent — your code, a runtime hook, or the Linux kernel — must produce a signal. OpenTelemetry recognizes three broad strategies for producing those signals: manual instrumentation, where developers explicitly emit spans, metrics, and logs; automatic instrumentation, where libraries are patched at runtime to emit telemetry on your behalf; and zero-code instrumentation, where an external observer (often eBPF in the kernel, or a Kubernetes Operator injecting agents) generates telemetry with no awareness from the application itself.
Think of the three approaches as the difference between an author writing a memoir, a transcriptionist sitting beside them, and a hidden microphone in the ceiling. Each captures a story; each captures it differently; each has a place.
Learning Objectives
By the end of this chapter, you will be able to:
- Choose between manual, automatic, and zero-code instrumentation for a given language and runtime.
- Add custom spans, metrics, and attributes that follow OpenTelemetry semantic conventions.
- Configure auto-instrumentation in Kubernetes using the OpenTelemetry Operator and its
InstrumentationCRD.
Figure 6.1: Three instrumentation approaches compared
graph TD
I[Application Telemetry]
I --> M[Manual<br/>developer writes<br/>tracer.start_span]
I --> A[Automatic<br/>runtime agent<br/>wraps libraries]
I --> Z[Zero-Code<br/>eBPF kernel probes<br/>or Operator injection]
M -->|effort: high| Q1[business attributes<br/>tenant.id, order.id]
A -->|effort: low| Q2[broad HTTP/DB/RPC<br/>library coverage]
Z -->|effort: none| Q3[polyglot, no rebuild<br/>kernel-wide visibility]
Section 1: Manual Instrumentation
Manual instrumentation puts the developer in direct control. You acquire a Tracer, Meter, or Logger from the OpenTelemetry SDK, then explicitly call methods to start spans, record measurements, or write structured log events. Manual code is verbose, but it is the only way to express domain context — concepts like order_id, tenant_id, payment.status, and feature_flag.variant that no auto-instrumenter could ever guess.
Acquiring Tracers, Meters, and Loggers
The OpenTelemetry SDK exposes three top-level provider objects: a TracerProvider, a MeterProvider, and a LoggerProvider. From each provider you obtain a named, versioned instance scoped to your library or module. The name is conventionally the import path of the instrumented package; it becomes the instrumentation.scope.name on every signal you emit, letting backends filter by which code produced the data [Source: https://opentelemetry.io/docs/specs/semconv/db/database-spans/].
// Java: acquire scoped tracer and meter
import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.api.metrics.Meter;
Tracer tracer = GlobalOpenTelemetry.getTracer("com.acme.payments", "1.4.0");
Meter meter = GlobalOpenTelemetry.getMeter("com.acme.payments");
# Python: acquire scoped tracer and meter
from opentelemetry import trace, metrics
tracer = trace.get_tracer("acme.payments", "1.4.0")
meter = metrics.get_meter("acme.payments")
// Node.js: acquire scoped tracer and meter
const { trace, metrics } = require('@opentelemetry/api');
const tracer = trace.getTracer('acme-payments', '1.4.0');
const meter = metrics.getMeter('acme-payments');
Creating Spans and Recording Attributes
A span is the basic building block of a trace: a named, timed operation with a start, an end, attributes, events, and a status. The idiomatic pattern is to wrap a unit of work in a span block so it closes automatically, even on exceptions.
// Java: a span around a business operation
Span span = tracer.spanBuilder("authorize_payment")
.setSpanKind(SpanKind.INTERNAL)
.setAttribute("payment.method", "card")
.setAttribute("tenant.id", tenantId)
.startSpan();
try (Scope scope = span.makeCurrent()) {
boolean ok = gateway.authorize(amount);
span.setAttribute("payment.outcome", ok ? "approved" : "declined");
if (!ok) span.setStatus(StatusCode.ERROR, "gateway declined");
} catch (Exception e) {
span.recordException(e);
span.setStatus(StatusCode.ERROR);
throw e;
} finally {
span.end();
}
# Python: idiomatic context-manager span
with tracer.start_as_current_span("authorize_payment") as span:
span.set_attribute("payment.method", "card")
span.set_attribute("tenant.id", tenant_id)
try:
approved = gateway.authorize(amount)
span.set_attribute("payment.outcome",
"approved" if approved else "declined")
except Exception as exc:
span.record_exception(exc)
span.set_status(trace.StatusCode.ERROR, str(exc))
raise
// Node.js: span around an async function
await tracer.startActiveSpan('authorize_payment', async (span) => {
span.setAttribute('payment.method', 'card');
span.setAttribute('tenant.id', tenantId);
try {
const approved = await gateway.authorize(amount);
span.setAttribute('payment.outcome', approved ? 'approved' : 'declined');
} catch (err) {
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
throw err;
} finally {
span.end();
}
});
Attributes are key–value pairs. Their job is twofold: provide search and grouping handles in your trace UI, and feed dashboards through trace-to-metric pipelines. Use dot-namespaced keys (payment.method, not paymentMethod), follow semantic conventions when one exists, and treat your custom namespace (acme.*, payment.*) the way you would treat a public API — once dashboards depend on it, you cannot freely rename it.
Custom Metric Instruments
Where spans tell stories of individual requests, metrics tell stories of aggregate behavior. OpenTelemetry provides four core synchronous instruments and several asynchronous ones. Choose the instrument that matches the semantics of what you are counting, not just the dashboard panel you want.
| Instrument | Direction | Aggregation | Typical Use |
|---|---|---|---|
Counter | Monotonic up | Sum | Total requests, errors, bytes sent |
UpDownCounter | Up or down | Sum | Active connections, queue depth, pool size |
Histogram | Records observations | Bucketed distribution | Request latency, payload size |
Gauge (observable) | Sampled | Last value | CPU utilization, current temperature |
# Python: a Counter and a Histogram for HTTP server work
http_requests = meter.create_counter(
name="http.server.requests",
description="Number of HTTP requests served",
unit="1",
)
http_latency = meter.create_histogram(
name="http.server.request.duration",
description="HTTP server request latency",
unit="s",
)
start = time.monotonic()
try:
response = handle(request)
http_requests.add(1, {
"http.request.method": request.method,
"http.response.status_code": response.status,
"http.route": request.matched_route,
})
finally:
http_latency.record(time.monotonic() - start, {
"http.request.method": request.method,
"http.route": request.matched_route,
})
Three rules deserve special emphasis. First, units matter: declare seconds (s), bytes (By), or 1 for dimensionless counts so backends can convert and label correctly. Second, histogram bucket boundaries are usually picked by the SDK — override them only when you know the latency profile of your service. Third, every attribute you attach to a metric multiplies cardinality; we return to that in Section 4.
Key Takeaway: Manual instrumentation gives you the only path to business-meaningful telemetry. Acquire scoped tracers and meters once per module, wrap units of work in spans with carefully chosen attributes, and pick metric instruments by semantics —
Counterfor monotonic totals,UpDownCounterfor things that ebb and flow,Histogramfor distributions,Gaugefor current values.
Section 2: Automatic Instrumentation
Automatic instrumentation is the OpenTelemetry community’s answer to a hard question: “How do I get traces from libraries I did not write?” The answer differs by runtime, because each language exposes different hooks for intercepting library code without changing it.
Bytecode Injection: Java Agent and .NET Profiler
In Java, the opentelemetry-javaagent.jar attaches to the JVM at startup using the -javaagent flag. Internally, it registers a premain method via the Java Instrumentation API and uses a bytecode library such as ByteBuddy to rewrite classes as the classloader loads them, weaving span start/end logic around methods of interest [Source: https://javapro.io/wp-content/uploads/2026/02/JAVAPRO_01-2026.pdf].
OTEL_SERVICE_NAME=checkout-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
OTEL_TRACES_EXPORTER=otlp \
OTEL_LOGS_EXPORTER=otlp \
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=prod,service.version=2.3.1 \
java -javaagent:/opt/otel/opentelemetry-javaagent.jar -jar /app/checkout.jar
The agent ships with dozens of instrumentation modules for Servlet, Spring MVC and WebFlux, JAX-RS, gRPC, OkHttp, Apache HttpClient, JDBC, R2DBC, Hibernate, Mongo, Cassandra, Kafka, RabbitMQ, JMS, and many more. Because the rewriting happens at class load time, it works without source changes; however, the agent must attach at JVM start — you generally cannot retrofit a running JVM — and exotic custom classloaders sometimes need extra configuration [Source: https://javapro.io/wp-content/uploads/2026/02/JAVAPRO_01-2026.pdf].
.NET uses a conceptually similar mechanism via a CLR profiler, registered through CORECLR_ENABLE_PROFILING=1 and a companion DLL that injects IL into managed methods at JIT time.
Monkey-Patching: Python and Node.js
Dynamic languages do not need bytecode rewriting — they let you replace functions at runtime. Python’s auto-instrumentation uses the opentelemetry-instrument CLI to run a bootstrap before your application code; that bootstrap loads every installed opentelemetry-instrumentation-* package, each of which monkey-patches the relevant library at import time [Source: https://lumigo.io/opentelemetry/].
pip install opentelemetry-distro opentelemetry-exporter-otlp \
opentelemetry-instrumentation-requests \
opentelemetry-instrumentation-psycopg2 \
opentelemetry-instrumentation-flask
OTEL_SERVICE_NAME=orders-api \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
opentelemetry-instrument gunicorn orders.wsgi:application
For example, the requests instrumentation replaces requests.Session.request with a wrapper that opens a client span, records HTTP attributes, calls the original function, captures the response, and ends the span. The instrumentation must run before the first import of the patched library; otherwise the cached reference will be the unpatched one.
Node.js relies on require hooks. @opentelemetry/auto-instrumentations-node registers handlers for require() (via require-in-the-middle or similar) and patches each module’s exports as it is loaded.
// tracing.js — must be required before any other module
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
NODE_OPTIONS="--require ./tracing.js" node app.js
Coverage and the Cross-Language Comparison
All three approaches expose the same observable surface — HTTP servers and clients, gRPC, SQL and NoSQL clients, message queues — but each has a different “blast radius” and quirks.
Figure 6.2: Java agent bytecode injection lifecycle
sequenceDiagram
participant OS as OS / shell
participant JVM as JVM
participant Agent as otel-javaagent.jar
participant CL as ClassLoader
participant App as Application code
participant Col as OTLP Collector
OS->>JVM: java -javaagent:otel.jar -jar app.jar
JVM->>Agent: invoke premain(Instrumentation)
Agent->>JVM: register ClassFileTransformer (ByteBuddy)
JVM->>App: start main()
App->>CL: load HttpServlet, JdbcDriver, ...
CL->>Agent: transform(class bytes)
Agent-->>CL: rewritten bytes with span hooks
App->>App: first request enters Servlet.service()
App->>Col: OTLP span exported<br/>(http.server.request)
| Aspect | Java | Python | Node.js |
|---|---|---|---|
| Primary mechanism | -javaagent bytecode rewrite | Monkey-patching at import | require hook + export patching |
| Entry point | JVM flag | opentelemetry-instrument CLI | NODE_OPTIONS=--require |
| Runtime hook | Java Instrumentation API + ByteBuddy | Dynamic attribute assignment | require-in-the-middle |
| Code changes | None | None | One bootstrap file |
| Context propagation | Thread-locals + executor wrappers | contextvars + async wrappers | Async hooks integrated per library |
| Configuration | OTEL_* env vars | OTEL_* env vars + CLI flags | OTEL_* env vars + NodeSDK options |
| Common pitfall | Custom classloaders | Import order before patch | Bundlers/serverless hide require |
A shared environment-variable contract spans every language: OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_TRACES_EXPORTER, OTEL_METRICS_EXPORTER, OTEL_LOGS_EXPORTER, OTEL_EXPORTER_OTLP_PROTOCOL, OTEL_TRACES_SAMPLER, OTEL_PROPAGATORS, and OTEL_RESOURCE_ATTRIBUTES. Operators love this: one ConfigMap, one set of variables, and every workload — whether Java, Python, or Node — speaks the same dialect [Source: https://lumigo.io/opentelemetry/].
When something goes wrong, two debugging axioms apply. No traces at all usually means an exporter is set to none, the endpoint protocol is mismatched (grpc vs. http/protobuf), or the bootstrap is loading too late. Duplicate spans almost always mean a library is being captured by both auto- and manual instrumentation; disable one for that library.
Key Takeaway: Auto-instrumentation gives you HTTP, gRPC, database, and queue spans for free, but how it gets injected depends on the runtime. Java rewrites bytecode at class load; Python and Node monkey-patch at import or
require. All three share the sameOTEL_*configuration vocabulary, which is what makes mixed-language fleets tractable.
Section 3: Zero-Code Instrumentation
“Zero-code” is the marketing label for a stronger promise than auto-instrumentation: not only does the developer not write tracing code, the developer’s build artifact is not modified at all. Two distinct technologies live under this umbrella: eBPF agents that observe processes from the Linux kernel, and the OpenTelemetry Operator that injects auto-instrumentation agents into Kubernetes pods at admission time without changing container images.
eBPF-Based Auto-Instrumentation
eBPF (extended Berkeley Packet Filter) lets you load safe, sandboxed bytecode into the Linux kernel and attach it to kernel events — syscalls, function entry/exit, tracepoints, network events — at runtime, without recompiling the kernel [Source: https://ebpf.io/what-is-ebpf/]. An eBPF observability agent typically does the following [Source: https://logz.io/glossary/what-is-ebpf/] [Source: https://www.sysdig.com/blog/the-art-of-writing-ebpf-programs-a-primer]:
- Attaches kprobes to network kernel functions like
tcp_sendmsg,tcp_cleanup_rbuf,sys_enter_sendto, andsys_enter_recvfromto observe every byte that crosses TCP. - Attaches uprobes to user-space functions in shared libraries —
SSL_read/SSL_writeinlibssl, the Go runtime’s HTTP handlers, JVM JNI entry points — to see data before encryption or after decryption. - Writes structured records into eBPF maps that a user-space agent drains at high frequency.
- Reconstructs requests in user space — matching send/recv into request–response pairs, parsing HTTP headers, gRPC HTTP/2 framing, and SQL handshakes — to produce L7 metrics and OTLP spans [Source: https://www.groundcover.com/ebpf].
Figure 6.3: eBPF zero-code instrumentation dataflow
flowchart TD
subgraph US[User space]
A1[App A<br/>Go binary]
A2[App B<br/>Java JVM]
A3[App C<br/>Python]
SSL[libssl.so]
DS[Beyla / Pixie DaemonSet<br/>user-space agent]
end
subgraph K[Linux kernel]
KP1[kprobe: tcp_sendmsg]
KP2[kprobe: tcp_cleanup_rbuf]
KP3[tracepoint: sys_enter_sendto]
UP[uprobe: SSL_read / SSL_write]
MAP[(eBPF map<br/>ring buffer)]
end
A1 -->|syscalls| KP1
A2 -->|syscalls| KP3
A3 -->|TLS calls| SSL
SSL --> UP
KP1 --> MAP
KP2 --> MAP
KP3 --> MAP
UP --> MAP
MAP --> DS
DS -->|OTLP spans + RED metrics| COL[OpenTelemetry Collector]
Because the hook points are in the kernel and in shared libraries, eBPF works for every language on the host — Go, Rust, Java, Python, Node, C++, even closed-source binaries — without touching their code [Source: https://www.contrastsecurity.com/glossary/ebpf]. The output is typically the four golden signals per service (latency, traffic, errors, saturation) plus distributed traces for common protocols [Source: https://newrelic.com/blog/observability/what-is-ebpf].
Tool landscape:
| Tool | Focus | Output |
|---|---|---|
| Grafana Beyla | Zero-code OTel auto-instrumentation for HTTP/gRPC/DB | OTLP traces + RED metrics |
| Pixie | K8s deep debugging, full request bodies, PxL scripts | In-cluster live data, dashboards |
| Cilium Tetragon | Runtime security and policy enforcement | Process/file/network events; can block |
| Odigos | eBPF + SDK hybrid OTel platform | OTLP routed by policy |
OpenTelemetry Operator and Auto-Instrumentation CRDs
For Kubernetes workloads, the OpenTelemetry Operator offers a different flavor of zero-code: it lets the cluster itself inject the auto-instrumentation agents we saw in Section 2, with no changes to your container images [Source: https://lumigo.io/opentelemetry/]. The Operator defines an Instrumentation Custom Resource that describes how to instrument, then a mutating admission webhook applies that recipe when pods are annotated for injection.
# 1. The Instrumentation CRD: a reusable recipe per language
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: default-instrumentation
namespace: production
spec:
exporter:
endpoint: http://otel-collector.observability:4317
propagators:
- tracecontext
- baggage
sampler:
type: parentbased_traceidratio
argument: "0.1"
resource:
attributes:
deployment.environment: prod
service.namespace: payments
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
env:
- name: OTEL_INSTRUMENTATION_JDBC_ENABLED
value: "true"
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest
# 2. The Deployment opts in via pod annotations
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout
spec:
template:
metadata:
annotations:
instrumentation.opentelemetry.io/inject-java: "production/default-instrumentation"
spec:
containers:
- name: app
image: registry.example.com/checkout:2.3.1
When a pod with that annotation is created, the webhook injects an init container that copies the Java agent JAR into a shared volume, then patches the application container with JAVA_TOOL_OPTIONS=-javaagent:/otel-auto-instrumentation/javaagent.jar and the appropriate OTEL_* environment variables. Python and Node.js use analogous mechanisms — wrapping the entry command or injecting startup hooks — so the application image stays untouched [Source: https://lumigo.io/opentelemetry/].
This is the easiest path to fleet-wide instrumentation in Kubernetes: write the Instrumentation CRD once, label deployments by language, and every new pod is born observable.
Limits of Zero-Code Approaches
Zero-code is not a free lunch. Compare the three strategies:
| Capability | Manual | Auto (SDK) | eBPF Zero-Code |
|---|---|---|---|
| Captures HTTP/gRPC/DB calls | If coded | Yes, broad | Yes, broad |
Captures business attributes (order_id) | Yes | No | No |
| Works on closed-source binaries | No | No | Yes |
| Sees TLS-encrypted in-process traffic | Yes | Yes | Only via libssl uprobes |
| Works on Windows/macOS | Yes | Yes | Linux only |
| Custom binary protocols | Yes | Sometimes | Rarely |
| Operational rollout effort | High | Low | Very low (DaemonSet) |
| Privilege required | App identity | App identity | CAP_SYS_ADMIN/CAP_BPF |
eBPF loses when business context matters (it has no idea that the request it just saw belongs to tenant=acme-corp paying order=ORD-9182), when traffic is encrypted with libraries the agent doesn’t know how to probe, when protocols are custom or binary, or when the platform isn’t Linux [Source: https://www.stackrox.io/blog/what-is-ebpf/]. The Operator-based path inherits all the limits of the SDK auto-instrumentation it ships — no domain attributes, library-version compatibility risk — but it is fantastic for getting wide coverage quickly.
Hybrid is the production-grade answer. Run eBPF for horizontal, language-agnostic baseline coverage of every workload on every node; use the Operator to inject SDK auto-instrumentation on every K8s pod; and add manual instrumentation on the critical business flows where you need tenant_id, feature_flag, payment.outcome, and the like to debug or to model SLOs.
Key Takeaway: Zero-code means two different things: eBPF probes in the kernel that see every process on a node, and the OpenTelemetry Operator injecting SDK agents into K8s pods at admission. Both eliminate code changes; neither captures business meaning. Combine them with manual instrumentation on the flows that matter.
Section 4: Semantic Conventions in Practice
Instrumentation that nobody can query is just expensive noise. Semantic conventions are OpenTelemetry’s contract that names things the same way everywhere, so a dashboard written against http.response.status_code works whether the data came from a Java agent, a Python monkey-patch, a Beyla eBPF probe, or your own manual code [Source: https://opentelemetry.io/docs/specs/semconv/db/database-spans/].
Attributes for HTTP, RPC, Database, Messaging
The conventions divide into stable attributes (safe to anchor dashboards on) and experimental ones (subject to change). The most heavily used stable attributes:
| Domain | Attribute | Value | Use |
|---|---|---|---|
| HTTP | http.request.method | GET, POST, … | Method dimension on RED metrics |
| HTTP | http.response.status_code | 200, 404, 503 | Error-rate panels, SLO burn |
| HTTP | http.route | /orders/{id} | Path grouping without ID explosion |
| HTTP | url.full | https://api/...?token=… | Debugging (sensitive — see hygiene below) |
| HTTP | server.address | api.acme.io | Backend grouping |
| HTTP | user_agent.original | raw UA string | Client breakdown |
| RPC | rpc.system | grpc, connect_rpc | Filter by RPC family |
| RPC | rpc.service / rpc.method | PaymentService / Authorize | Endpoint heatmaps |
| DB | db.system | postgresql, mysql, mongodb | Engine breakdown |
| DB | db.operation | SELECT, INSERT | Latency by operation |
| DB | db.statement | full text | Slow-query debugging (sensitive) |
| DB | db.name | logical DB/schema | Per-schema metrics |
| Messaging | messaging.system | kafka, rabbitmq | Broker breakdown |
| Messaging | messaging.destination.name | topic/queue name | Per-topic throughput |
| Messaging | messaging.operation | publish, receive, process | Lifecycle staging |
A clean dashboard query — “p95 HTTP server latency by route and status, last 30 minutes” — is just groupby(http.route, http.response.status_code) of histogram(http.server.request.duration). Because every service emits those attribute keys, the same panel works across the entire fleet, and across vendors that ingest OTLP natively [Source: https://lumigo.io/opentelemetry/].
Resource Attributes for Service Identity
Attributes describe a single signal; resource attributes describe the emitter of every signal it produces. They live in the OTLP Resource and are typically set once, at SDK initialization, via OTEL_RESOURCE_ATTRIBUTES or runtime resource detectors.
Stable resource attributes you should always set:
OTEL_SERVICE_NAME=checkout-api
OTEL_RESOURCE_ATTRIBUTES=\
service.namespace=payments,\
service.version=2.3.1,\
service.instance.id=checkout-api-7d4f-x9w2,\
deployment.environment=prod,\
k8s.namespace.name=production,\
k8s.deployment.name=checkout-api,\
k8s.pod.name=checkout-api-7d4f-x9w2,\
cloud.provider=aws,\
cloud.region=us-east-1
The OpenTelemetry Operator can fill many of these for you automatically by reading the pod’s downward API; in Kubernetes you should rarely need to set Kubernetes resource attributes by hand.
Avoiding Label and Attribute Cardinality Bombs
Cardinality is the silent killer of observability platforms. Every unique combination of attribute values produces a distinct time series for metrics and a distinct index entry for traces. Pricing, retention, query speed, and even cluster stability degrade with cardinality. The single most important instrumentation discipline is asking, before you attach an attribute, “How many distinct values can this take?”
Figure 6.4: OpenTelemetry Operator pod injection workflow
sequenceDiagram
participant Dev as Developer
participant API as kube-apiserver
participant Op as OTel Operator
participant WH as Mutating Webhook
participant Init as Init container
participant App as App container
participant Col as Collector
Dev->>API: kubectl apply Instrumentation CR<br/>(java/python/nodejs recipes)
Op->>API: watch + cache Instrumentation CR
Dev->>API: apply Deployment with annotation<br/>inject-java: "ns/instr-name"
API->>WH: AdmissionReview (Pod create)
WH->>WH: read annotation + CR recipe
WH-->>API: patched Pod spec<br/>(init container + JAVA_TOOL_OPTIONS + OTEL_* env)
API->>Init: schedule init container
Init->>App: copy javaagent.jar to shared volume
App->>App: JVM starts with -javaagent
App->>Col: OTLP spans, metrics, logs
Figure 6.5: Cardinality explosion from a single attribute
graph LR
subgraph BASE[Safe baseline]
B1[method ~10]
B2[status ~60]
B3[route ~200]
B1 --> BX[10 x 60 x 200<br/>= 120K series]
B2 --> BX
B3 --> BX
end
subgraph BAD[Add user.id]
U[user.id<br/>~1,000,000]
BX --> E[120K x 1M<br/>= 120 billion series]
U --> E
end
E -->|TSDB OOM, cost spike| X[Cardinality bomb]
Rules of thumb:
| Attribute candidate | Cardinality | Use? |
|---|---|---|
http.request.method | ~10 | Yes |
http.response.status_code | ~60 | Yes |
http.route (templated) | ~hundreds | Yes |
db.operation | ~10 | Yes |
service.version | ~tens | Yes |
tenant.id (large SaaS) | ~thousands+ | Carefully — often spans only |
url.full with raw path | ~unbounded | No on metrics; redact on spans |
user.id | ~unbounded | Span attribute only; not on metrics |
request.id / trace.id | per-request | Span only — never a metric label |
db.statement raw | per-call | Span only, redacted/parameterized |
Three practical mitigations:
- Use templates, not raw values. Push
http.route=/orders/{id}to your metric labels, leavingurl.fullfor span-only attributes you debug with. - Drop or hash at the Collector. If you cannot prevent a high-cardinality attribute at the source, the OpenTelemetry Collector’s
attributes,transform, andredactionprocessors can drop, truncate, or one-way-hash before export [Source: https://www.honeycomb.io/blog/opentelemetry-best-practices-data-prep-cleansing]. - Separate metric and span schemas. It is fine — and good — for a span to carry
tenant.idandorder.idwhile the metric derived from those spans carries onlytenant.tierandpayment.method. Spans are sampled and indexed; metrics are aggregated forever.
The same mitigations double as PII hygiene controls. url.full, url.query, client.address, network.peer.address, and db.statement may all contain personal data — email addresses, search terms, session tokens. The Honeycomb best-practices guide recommends a layered strategy: redact at the SDK where possible, allow-list at the Collector for everything you cannot vouch for, and hash where you need to preserve cardinality without preserving identity [Source: https://www.honeycomb.io/blog/opentelemetry-best-practices-data-prep-cleansing].
Stability and Evolution
OpenTelemetry semantic conventions evolve under a three-state model: Experimental, Stable, and Deprecated. The migration from http.method (old) to http.request.method (new) is a real-world example: both names exist for a period, the new name is preferred, and Collector transform processors can normalize older signals so your dashboards survive the transition [Source: https://opentelemetry.io/docs/specs/semconv/db/database-spans/]. Anchor dashboards on Stable attributes; treat Experimental ones as opt-in extras.
Key Takeaway: Semantic conventions are what make OpenTelemetry portable. Use stable attribute names (
http.request.method,http.response.status_code,db.system,db.operation) on every signal, set resource attributes once for service identity, and treat cardinality and PII as instrumentation-time concerns — once they reach the Collector, the damage is harder to contain.
Chapter Summary
This chapter mapped the three pillars of OpenTelemetry instrumentation. Manual instrumentation — acquiring named tracers and meters, creating spans, choosing the right metric instrument — is where business meaning enters telemetry; nothing else can capture tenant_id, feature_flag, or payment.outcome. Automatic instrumentation runs in three flavors depending on the runtime: Java’s -javaagent bytecode rewriting, Python’s opentelemetry-instrument monkey-patching, and Node.js’s require-hook patching, all sharing a common OTEL_* environment-variable contract. Zero-code instrumentation comes in two forms: eBPF agents that watch kernel and library hooks for every process on a host, and the OpenTelemetry Operator’s Instrumentation CRD that injects SDK agents into Kubernetes pods at admission. Each strategy has a sweet spot: eBPF for fast, broad, polyglot coverage; the Operator for fleet-wide K8s rollouts; manual for the business-critical paths. Finally, semantic conventions — stable attribute names for HTTP, RPC, database, and messaging, plus resource attributes for service identity, plus disciplined cardinality and PII hygiene — are what turn instrumentation into vendor-portable, durable observability.
In Chapter 7 we will follow the signals after they leave the application: the OpenTelemetry Collector, its receivers, processors, and exporters, and how to build telemetry pipelines that can normalize, sample, and route data from every workload in your environment to the backends that consume it.
Key Terms
| Term | Definition |
|---|---|
| Tracer | Named, versioned SDK object used to create spans for a given library or module. |
| Meter | Named, versioned SDK object used to create metric instruments (Counter, UpDownCounter, Histogram, Gauge). |
| Instrument | A typed metric primitive (Counter for monotonic sums, UpDownCounter for ebbing values, Histogram for distributions, Gauge for sampled current values). |
| Auto-instrumentation | Runtime patching of common libraries to emit telemetry without source changes — Java agent, Python monkey-patching, Node.js require hooks. |
| OpenTelemetry Operator | Kubernetes operator that manages Collectors and uses an Instrumentation CRD + mutating webhook to inject auto-instrumentation agents into annotated pods. |
| eBPF | Linux kernel facility for loading sandboxed bytecode attached to kprobes, uprobes, and tracepoints; enables language-agnostic zero-code observability. |
| Semantic Conventions | OpenTelemetry-defined standard attribute names and meanings for HTTP, RPC, database, messaging, and other domains; the basis of vendor-portable dashboards. |
| Resource Attributes | Attributes describing the emitter of telemetry — service.name, service.version, deployment.environment, Kubernetes identifiers — set once per SDK instance. |
Chapter 7: Distributed Tracing with OpenTelemetry
In a monolithic application, when a user clicks “Place Order” and the page hangs, a developer can attach a debugger and walk through the call stack. The execution path is linear, the variables are local, and the entire story of the request lives in one process. In a cloud-native system, that same click might cross a dozen services, two message brokers, three databases, and a handful of language runtimes — each with its own logs, clocks, and failure modes. The stack trace is gone. What replaces it is the distributed trace: a stitched-together view of how a single request flowed through the system, who called whom, how long each hop took, and where things went wrong.
OpenTelemetry (OTel) is the open standard that makes those traces portable. It defines a data model for traces, a set of propagation formats for carrying trace identity over the wire, and APIs/SDKs for emitting trace data from instrumented applications. This chapter explains how OpenTelemetry traces are structured, how trace context is propagated across service and protocol boundaries, how to instrument code so traces are useful (and not just voluminous), and how to visualize and analyze trace data in tools like Jaeger and Grafana Tempo.
Learning Objectives
By the end of this chapter, you should be able to:
- Construct and propagate trace context across service and protocol boundaries using W3C Trace Context, B3, and Jaeger formats.
- Build readable, debuggable traces with appropriate span hierarchy, naming, attributes, status, and events.
- Visualize trace data in Jaeger or Tempo to diagnose latency and error patterns, and derive RED metrics from spans.
7.1 Trace Data Model
A trace is, formally, a directed acyclic graph of spans that share a common TraceId. Each span represents one unit of work — an HTTP handler, a database query, a queue publish — and carries a name, a start/end timestamp, a parent reference, attributes, events, status, and a kind. A trace is what you get when you collect all the spans for one request and arrange them in causal order.
The mental model worth carrying is this: a span is to a trace what a stack frame is to a stack trace, except spans cross process boundaries and overlap in time when work happens in parallel.
Figure 7.1: Parent-child span tree for a single trace
flowchart TD
A["SERVER<br/>POST /checkout<br/>checkout-svc<br/>0 - 480ms"] --> B["CLIENT<br/>payment.charge<br/>checkout-svc<br/>20 - 310ms"]
A --> C["CLIENT<br/>inventory.reserve<br/>checkout-svc<br/>20 - 180ms"]
B --> D["SERVER<br/>POST /charge<br/>payments-svc<br/>30 - 300ms"]
C --> E["SERVER<br/>POST /reserve<br/>inventory-svc<br/>30 - 170ms"]
D --> F["CLIENT<br/>db.query users<br/>payments-svc<br/>50 - 110ms"]
D --> G["CLIENT<br/>POST gateway<br/>payments-svc<br/>120 - 290ms"]
E --> H["CLIENT<br/>db.update stock<br/>inventory-svc<br/>40 - 160ms"]
classDef server fill:#1f3a5f,stroke:#58a6ff,color:#fff
classDef client fill:#3a2f5f,stroke:#a78bfa,color:#fff
class A,D,E server
class B,C,F,G,H client
TraceId, SpanId, and TraceFlags
Three identifiers form the backbone of every span context:
TraceId— A 128-bit value (32 lowercase hex characters when serialized) that is globally unique per trace. Every span in a single logical request shares the sameTraceId. It must not be all zeros [Source: https://www.w3.org/TR/trace-context/].SpanId— A 64-bit value (16 lowercase hex characters) that uniquely identifies a single span within a trace. Each new span gets a freshSpanId; the parent’sSpanIdis recorded in the child span as itsparent_span_id.TraceFlags— An 8-bit field where only bit 0 (LSB) is currently defined:01means the trace is sampled/recorded,00means it is not [Source: https://www.w3.org/TR/trace-context/].
Together, these three fields form the SpanContext, the minimal envelope of identity that must travel between services for a trace to remain coherent. Everything else — names, attributes, events, status — is local to each span and is exported to a backend; only the SpanContext crosses the wire.
A useful analogy: think of TraceId as the conference badge color (everyone at the same event shares it), SpanId as the individual badge number (each person is unique), and TraceFlags as whether the photographer is allowed to publish your photo (sampled or not).
Span Kinds: SERVER, CLIENT, PRODUCER, CONSUMER, INTERNAL
SpanKind is a small enum that tells backends what role the span plays in a distributed conversation. Without it, a backend cannot tell whether a span represents the inbound side or the outbound side of an RPC, and dependency graphs become guesswork.
| Span Kind | Role | Typical Example | Pairs With |
|---|---|---|---|
SERVER | Synchronous inbound request handler | HTTP handler, gRPC server method | CLIENT (caller) |
CLIENT | Synchronous outbound call to a remote service | http.Client.Do, gRPC client stub, JDBC query | SERVER (callee) |
PRODUCER | Asynchronous send onto a queue/topic | kafka.Producer.Send, SQS publish | CONSUMER (later) |
CONSUMER | Asynchronous receive/process from a queue/topic | Kafka consumer loop, SQS poll-and-process | PRODUCER (earlier) |
INTERNAL | Local work, not a network hop | Business-logic function, JSON parsing, an expensive loop | n/a |
The default kind is INTERNAL. Setting the kind correctly is what allows Tempo’s service-graph processor (covered in §7.4) to pair CLIENT spans with downstream SERVER spans and build a dependency map [Source: https://grafana.com/docs/loki/latest/query/log_queries/].
Span Status, Events, and Links
Beyond identity and kind, a span carries three additional structures that turn raw timing data into a diagnosis:
- Status — One of
UNSET(default),OK, orERROR. Crucially, OTel does not infer status from HTTP codes forSERVERspans: a 4xx is generally not an error from the server’s perspective (the client made a bad request), but it is an error from aCLIENTspan’s perspective. The instrumentation must set status deliberately. A span withERRORstatus should usually also have astatus_descriptionstring. - Events — Timestamped, named annotations within a span, each with its own attributes. Events are the OTel-native way to record exceptions: instrumentation typically calls
span.recordException(e), which adds an event namedexceptionwithexception.type,exception.message, andexception.stacktraceattributes. Events let you mark intra-span moments — “cache miss,” “retry attempt 2,” “circuit breaker opened” — without creating new spans. - Links — A reference from one span to one or more other
SpanContextvalues that are causally related but not the strict parent. Links are the right tool for fan-in patterns: a batch job processing 1,000 messages should have one span with 1,000 links to the producing spans, not 1,000 parent references that would force one giant trace.
Figure 7.2: Span kinds and how they pair across a distributed call
graph LR
subgraph svcA["Service A"]
A1["SERVER<br/>POST /checkout"]
A2["INTERNAL<br/>validate_cart"]
A3["CLIENT<br/>POST /charge"]
A4["PRODUCER<br/>orders.created publish"]
A1 --> A2
A1 --> A3
A1 --> A4
end
subgraph svcB["Service B"]
B1["SERVER<br/>POST /charge"]
B2["CLIENT<br/>db.query"]
B1 --> B2
end
subgraph svcC["Kafka + Consumer"]
C1["CONSUMER<br/>orders.created process"]
end
A3 -.->|"HTTP<br/>traceparent"| B1
A4 -.->|"Kafka<br/>traceparent header"| C1
classDef server fill:#1f3a5f,stroke:#58a6ff,color:#fff
classDef client fill:#3a2f5f,stroke:#a78bfa,color:#fff
classDef producer fill:#1f5f3a,stroke:#34d399,color:#fff
classDef consumer fill:#5f3a1f,stroke:#fbbf24,color:#fff
classDef internal fill:#2a2a2a,stroke:#888,color:#fff
class A1,B1 server
class A3,B2 client
class A4 producer
class C1 consumer
class A2 internal
Key Takeaway: A trace is a graph of spans tied together by a shared
TraceId; each span carries aSpanId, a kind (SERVER/CLIENT/PRODUCER/CONSUMER/INTERNAL), a status, attributes, events, and optional links. Setting these correctly is what turns raw timing into a diagnosable picture of a request.
7.2 Context Propagation
A trace only works if every service in the request path reads, preserves, and forwards the SpanContext. That cross-process handoff is called context propagation, and it is implemented by propagators — small objects that know how to inject context into outbound carriers (HTTP headers, gRPC metadata, Kafka record headers) and extract context from inbound carriers.
OpenTelemetry’s default wire format for HTTP is the W3C Trace Context standard, but the SDKs also ship propagators for B3 (Zipkin) and Jaeger to interoperate with legacy systems.
W3C Trace Context: traceparent and tracestate
The W3C spec defines two HTTP headers [Source: https://www.w3.org/TR/trace-context/]:
The traceparent header is mandatory and has a fixed, dash-separated format:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
| | | |
| | | +-- trace-flags (01 = sampled)
| | +-- parent span-id (16 hex)
| +-- trace-id (32 hex, globally unique per trace)
+-- version (currently "00")
The four fields are:
- version —
00today. Implementations that see a future version should follow version-specific rules; for00, ignore anything after the four parts. - trace-id — 32 lowercase hex characters (16 bytes). Maps to OTel’s
TraceId. - span-id — 16 lowercase hex characters (8 bytes). This is the sender’s span — the parent of any span the receiver creates.
- trace-flags — 2 hex characters. Only bit 0 is defined;
01is sampled,00is not.
The tracestate header is optional and carries an ordered list of vendor-specific key–value pairs:
tracestate: ot=foo:bar,ro=1,congo=t61rcWkgMzE
Rules worth knowing [Source: https://www.w3.org/TR/trace-context/]:
- Leftmost entries have highest precedence.
- Up to ~32 entries and roughly 512 total characters.
- Keys are lowercase letters, digits, and limited punctuation; vendor-prefixed keys use
vendor@tenantsyntax. - Values are ASCII; no commas or equals signs in values.
tracestate is where vendors stash routing hints, custom sampling decisions, or legacy correlation IDs without disturbing the standardized traceparent.
B3 and Jaeger Legacy Formats
Many production systems pre-date W3C Trace Context. OpenTelemetry includes propagators for two legacy formats so new W3C-aware services can interoperate with them.
B3 (Zipkin) has two variants. The multi-header form uses separate headers:
X-B3-TraceId: 4bf92f3577b34da6a3ce929d0e0e4736
X-B3-SpanId: 00f067aa0ba902b7
X-B3-ParentSpanId: 5e0c63257de34c92
X-B3-Sampled: 1
X-B3-Flags: 0
The single-header form packs everything into one header:
b3: 4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-1-5e0c63257de34c92
The X-B3-Flags: 1 value (or the equivalent in the single-header form) signals debug, which forces the trace to be sampled regardless of X-B3-Sampled.
Jaeger uses a single header named uber-trace-id:
uber-trace-id: 4bf92f3577b34da6a3ce929d0e0e4736:00f067aa0ba902b7:5e0c63257de34c92:1
The fields are colon-separated: trace-id : span-id : parent-span-id : flags, where flags is a decimal number — bit 1 (value 1) = sampled, bit 2 (value 2) = debug.
The three formats encode the same logical SpanContext but differ in surface syntax, sampling semantics, and how they handle debug traces:
| Aspect | W3C Trace Context | B3 Multi-header | B3 Single-header | Jaeger (uber-trace-id) |
|---|---|---|---|---|
| Header(s) | traceparent, tracestate | X-B3-TraceId, X-B3-SpanId, X-B3-Sampled, X-B3-Flags, X-B3-ParentSpanId | b3 | uber-trace-id |
| TraceId length | 128-bit (32 hex) | 64 or 128-bit | 64 or 128-bit | 64 or 128-bit |
| Field separator | - | (separate headers) | - | : |
| Sampled flag | trace-flags bit 0 | X-B3-Sampled: 1/0 | 3rd field: 1/0/d | flags bit 1 |
| Debug flag | not defined | X-B3-Flags: 1 → force sampled | 3rd field: d → force sampled | flags bit 2 |
| Vendor extensions | tracestate | none | none | none |
| OTel default? | Yes (paired with Baggage) | Opt-in propagator | Opt-in propagator | Opt-in propagator |
OpenTelemetry SDKs let you compose multiple propagators so a single service can accept and emit several formats at once. In Go, that looks like [Source: https://www.w3.org/TR/trace-context/]:
otel.SetTextMapPropagator(
propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{}, // W3C traceparent + tracestate
propagation.Baggage{}, // W3C baggage
b3.New(b3.WithSingleHeader()),
jaeger.Jaeger{},
),
)
On extract, each propagator tries its own header; the first one that succeeds wins (typically W3C). On inject, every enabled propagator writes its format, so the outbound request carries traceparent, b3, and uber-trace-id simultaneously. That redundancy is the migration trick: roll out W3C alongside B3/Jaeger, let downstream services read whichever they understand, then remove legacy formats once the fleet is fully W3C-aware.
The mapping between sampled flags is straightforward but must be preserved: B3 X-B3-Flags=1 (debug) and Jaeger flags & 0x02 both force-sample and should map to W3C trace-flags=01; B3 X-B3-Sampled=1 and Jaeger flags & 0x01 map directly to W3C bit 0.
Figure 7.3: traceparent propagation across a composite-propagator chain
sequenceDiagram
participant C as Client
participant A as Service A<br/>(W3C + B3)
participant B as Service B<br/>(W3C only)
participant D as Service C<br/>(B3 only)
C->>A: HTTP request<br/>traceparent: 00-{trace-id}-{span-C}-01
A->>A: extract context,<br/>start SERVER span {span-A}
A->>B: HTTP request<br/>traceparent: 00-{trace-id}-{span-A}-01<br/>b3: {trace-id}-{span-A}-1<br/>baggage: user.id=12345
B->>B: extract traceparent,<br/>start SERVER span {span-B}
B->>D: HTTP request<br/>traceparent: 00-{trace-id}-{span-B}-01<br/>b3: {trace-id}-{span-B}-1
D->>D: extract b3 header,<br/>start SERVER span {span-D}
D-->>B: response
B-->>A: response
A-->>C: response
Note over C,D: Same trace-id flows through<br/>all four hops despite mixed formats
Baggage for Cross-Cutting Attributes
Baggage is a separate W3C specification that travels alongside (but independently of) trace context. It is a set of key–value pairs stored on the Context — not on any one span — and propagated via the baggage HTTP header [Source: https://blog.nimblepros.com/blogs/otel/].
baggage: user.id=12345, tenant=acme-corp, feature.checkout_v2=enabled
Baggage is for cross-cutting, request-scoped context that every service might want to attach to its own spans, logs, or metrics: user ID, tenant ID, feature-flag variant, geographic region, support-case ID. Set it once at the edge, and every downstream hop can read it.
The distinction from span attributes is important:
| Aspect | Span attributes | Baggage |
|---|---|---|
| Lives on | A single span | The context (independent of any span) |
| Propagated downstream | No — only the span’s own service sees it | Yes — automatically injected into every outbound request |
| Typical use | Describe this operation (db.statement) | Cross-cutting request data (user.id, tenant.id) |
| Visible to | Only that span in the trace | All spans, logs, and metrics in the request flow |
| Auto-copied to spans? | n/a | No — instrumentation must opt-in to copy baggage to spans |
A critical security note: untrusted clients can send any baggage header they want. Edge services should sanitize incoming baggage and apply an allowlist of accepted keys, and outbound calls to third parties should strip internal baggage to avoid leaking identifiers [Source: https://blog.nimblepros.com/blogs/otel/]. Never put secrets, tokens, or PII into baggage; treat it as data that may be stored long-term and visible to any service in the path.
Key Takeaway: Propagation is what stitches local spans into a global trace. W3C Trace Context (
traceparent+tracestate) is OpenTelemetry’s default; composite propagators let you co-emit B3 and Jaeger headers for backward compatibility; and W3C Baggage carries cross-cutting request data — but never secrets — alongside the trace.
7.3 Building Useful Traces
Instrumenting a codebase is the easy part: auto-instrumentation libraries will produce spans for every HTTP request and database call out of the box. Producing traces that engineers actually use during an outage takes more thought. The difference between a noisy trace and a debuggable one usually comes down to span names, attribute hygiene, error recording, and knowing when not to create a span.
Naming Spans for Searchability
A span name is the primary identifier users see in Jaeger or Tempo. It should be low-cardinality (so backends can index and aggregate it) but descriptive enough to identify the operation.
The OpenTelemetry semantic conventions provide good defaults:
- HTTP server spans: the route template, not the raw URL.
GET /users/{id}is right;GET /users/12345is wrong because it creates a unique span name per user — a classic cardinality bomb. The numeric ID belongs in thehttp.targetor a custom attribute. - HTTP client spans: the method plus the route or host:
GET /users/{id}orPOST api.payments.svc. - Database spans: the operation and target:
SELECT users,INSERT orders. Put the full statement in thedb.statementattribute, not the name. - Messaging spans:
<destination> <operation>—orders.created publish,orders.created process. - Internal spans: a stable verb-noun describing the work:
validate_cart,compute_shipping_quote.
A simple test: if you imagine 10,000 spans being created, how many distinct names should appear? Tens or low hundreds, not millions. If your span name embeds an order ID or a UUID, it is too specific.
Attributes vs. Events vs. Status
Once a span is named, the question becomes what to attach to it. The three OTel facilities serve different purposes:
| Information shape | Use | Example |
|---|---|---|
| Stable property of the operation | Attribute | http.method=GET, db.system=postgresql, messaging.system=kafka |
| Cardinality is bounded and useful for filtering | Attribute | http.status_code=503, feature_flag.checkout_v2=enabled |
| Timestamped moment within the span | Event | cache.miss, retry.attempt, circuit_breaker.opened |
| Exception / error | Event + status | recordException(e) + setStatus(ERROR, "payment declined") |
| Pass/fail outcome | Status | OK, ERROR |
Follow the OTel semantic conventions for attribute names religiously: http.method, http.route, http.status_code, db.system, db.statement, rpc.system, rpc.service, rpc.method, messaging.system, messaging.destination. Consistent naming is what allows backends to render request panels, build service graphs, and correlate signals across the LGTM stack [Source: https://newrelic.com/blog/log/enrich-logs-with-opentelemetry-collector].
Recording Exceptions and Error Status
When an instrumented function throws, two things should happen: the exception is recorded as an event, and the span’s status is set to ERROR. Most language SDKs provide a single helper:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("charge_payment") as span:
span.set_attribute("payment.amount_cents", amount)
try:
gateway.charge(card, amount)
except PaymentDeclined as e:
span.record_exception(e)
span.set_status(trace.StatusCode.ERROR, "payment declined")
raise
Three habits to internalize:
- Record then re-raise unless you are deliberately swallowing the exception. Recording without re-raising can hide bugs.
- Status
ERRORis the signal that backends use to color spans red and that Tempo’s metrics-generator uses to count errors in RED metrics (§7.4). If you forget to set it, your error rate dashboards will lie. - HTTP 4xx is not automatically an error on
SERVERspans — the server worked correctly; the client sent a bad request. ReserveERRORfor 5xx or unhandled exceptions on the server side. OnCLIENTspans, both 4xx and 5xx are typically errors from the caller’s perspective.
Avoiding Span Explosion in Tight Loops
The most common instrumentation mistake is creating a span per iteration of a loop. Imagine processing 5,000 Kafka messages in a single poll:
# Wrong — 5,001 spans per batch, blows up trace storage and indexing
with tracer.start_as_current_span("process_batch") as batch_span:
for msg in batch:
with tracer.start_as_current_span("process_message") as msg_span:
msg_span.set_attribute("messaging.message_id", msg.id)
handle(msg)
Most messages are uninteresting, but they all get the same span treatment. Better patterns:
- One span per batch, events for noteworthy items: the parent
process_batchspan captures the whole loop;addEvent("message.failed", attrs={"message_id": msg.id})records only the messages that erred. - Sample inside the loop: create child spans only for messages that fail, or for 1-in-N successes.
- Use metrics for counts, not spans: if you want to know “how many messages were processed,” a counter is the right tool, not 5,000 spans.
- Use links for fan-in: if a downstream batch span needs to reference the producers of its inputs, use
linksrather than parent relationships so the trace graph stays bounded.
A useful guideline: a span should represent a unit of work big enough that you might one day look at it in a UI. If you would never click on it, do not create it.
Figure 7.4: Span explosion vs. disciplined instrumentation
flowchart TB
subgraph wrong["Wrong: 5001 spans per batch"]
W1["process_batch SERVER"]
W2["process_message x 5000<br/>uniform child spans<br/>blows up trace storage"]
W1 --> W2
end
subgraph right["Right: 1 span + events + metrics"]
R1["process_batch SERVER<br/>messaging.batch.size=5000"]
R2["handle_failed_message<br/>(child span, only on error) x 3"]
R3["events: cache.miss,<br/>retry.attempt, dlq.send"]
R4["counter: messages_processed_total<br/>(metrics, not spans)"]
R1 --> R2
R1 -.-> R3
R1 -.-> R4
end
classDef bad fill:#5f1f1f,stroke:#f87171,color:#fff
classDef good fill:#1f5f3a,stroke:#34d399,color:#fff
classDef neutral fill:#1f3a5f,stroke:#58a6ff,color:#fff
class W1,W2 bad
class R1,R2 good
class R3,R4 neutral
Key Takeaway: Useful traces follow the semantic conventions, use low-cardinality span names, distinguish attributes (stable properties) from events (timestamped moments), set
ERRORstatus deliberately, and avoid emitting a span per loop iteration — events, counters, or links usually serve better.
7.4 Trace Visualization and Analysis
Generated traces are only valuable if engineers can find and read them. Two open-source backends dominate the OpenTelemetry ecosystem: Jaeger, the original CNCF tracing project, and Grafana Tempo, the high-scale, object-storage-backed tracing backend in the Grafana LGTM stack. Both ingest OTLP, both render traces as Gantt-style waterfalls, and both produce RED-style metrics from spans — but the way they store, scale, and integrate differs.
Jaeger UI
Jaeger’s UI is a focused trace explorer. The core views are:
- Search — filter by service, operation, time range, duration, tags, and a free-form trace ID lookup. Tag search (
http.status_code=500,error=true) is how on-call engineers narrow down to relevant traces during an incident. - Trace timeline — a Gantt chart of spans, indented by parent-child relationship, with duration bars and color-coded status. Selecting a span reveals its attributes, events (including stack traces from
record_exception), and SpanContext IDs. - Trace graph / topology — a node-and-edge view of one trace, useful for understanding which services contributed to that specific request.
- System architecture / dependency graph — an aggregated view across many traces showing which services call which.
- Service Performance Monitoring (SPM) — derived RED metrics per service and operation, typically powered by an OpenTelemetry Collector
spanmetricsprocessor sitting in front of Jaeger.
Jaeger stores traces in Cassandra, Elasticsearch, or OpenSearch (with experimental support for object storage), and is well-suited to mid-scale deployments.
Grafana Tempo
Tempo takes a different design tack: it stores spans in object storage (S3, GCS, Azure Blob) and indexes only the TraceId, making per-trace lookup cheap but full-text span search expensive. Tempo’s bet is that most trace queries come from exemplars — a metric or log line that already gives you the TraceId — and that for ad-hoc search you can use a separate index or the TraceQL query language.
Tempo’s signature feature is the metrics-generator, a component that reads spans from the ingest pipeline and emits Prometheus metrics in real time [Source: https://grafana.com/docs/loki/latest/query/log_queries/]. It runs two processors:
span_metrics— Per-service, per-operation counters and duration histograms.service_graphs— Per-edge (caller → callee) metrics built by pairingCLIENTspans withSERVERspans within a configurable wait window (typically 10s).
A minimal Tempo configuration:
metrics_generator:
processor:
service_graphs:
enabled: true
wait: 10s
max_items: 10000
peer_attributes:
- peer.service
- db.name
- messaging.system
span_metrics:
enabled: true
dimensions:
- http.method
- http.status_code
- rpc.system
include_span_kinds:
- server
- consumer
Service Maps and Dependency Graphs
A service map is the aggregate dependency graph derived from many traces. Both Jaeger and Tempo can render one. The accuracy of the map depends entirely on the correctness of:
service.nameresource attribute on every span (consistent across services).SpanKindset toCLIENT/SERVER/PRODUCER/CONSUMERrather than the defaultINTERNAL.peer.service(ordb.name,messaging.system) attribute on outbound spans for inferring the callee when the downstream service is not instrumented.
When all three are right, the service map shows a directed graph with edges colored by error rate, throughput, or p95 latency — a near-real-time view of system topology that is impossible to maintain by hand.
Figure 7.5: Service map derived from trace data with RED metrics on each edge
flowchart LR
web["web<br/>SERVER"]
gw["api-gateway<br/>SERVER + CLIENT"]
orders["orders<br/>SERVER + CLIENT"]
pay["payments<br/>SERVER + CLIENT"]
inv["inventory<br/>SERVER + CLIENT"]
db[("db<br/>peer.service")]
web -->|"rate 920/s<br/>err 0.1%<br/>p95 95ms"| gw
gw -->|"rate 880/s<br/>err 0.2%<br/>p95 180ms"| orders
gw ===>|"rate 412/s<br/>err 3.4%<br/>p95 820ms"| pay
orders -->|"rate 720/s<br/>err 0.1%<br/>p95 60ms"| inv
orders -->|"rate 720/s<br/>err 0.0%<br/>p95 40ms"| db
pay -->|"rate 410/s<br/>err 0.1%<br/>p95 35ms"| db
inv -->|"rate 720/s<br/>err 0.0%<br/>p95 30ms"| db
classDef ok fill:#1f5f3a,stroke:#34d399,color:#fff
classDef hot fill:#5f1f1f,stroke:#f87171,color:#fff
classDef store fill:#3a2f5f,stroke:#a78bfa,color:#fff
class web,gw,orders,inv ok
class pay hot
class db store
Trace-Based Metrics: RED and USE Generation
The RED method — Rate, Errors, Duration — is the de facto SLI vocabulary for request-driven services. Tempo’s span_metrics processor emits exactly the time series needed to compute RED via PromQL:
Rate per service:
rate(tempo_span_calls_total{span_kind="server"}[5m]) by (service_name)
Error rate per service:
rate(
tempo_span_calls_total{
span_kind="server",
status_code!="OK"
}[5m]
) by (service_name)
p95 duration per service:
histogram_quantile(
0.95,
sum by (service_name, le) (
rate(tempo_span_duration_seconds_bucket{span_kind="server"}[5m])
)
)
The service_graphs processor emits parallel metrics keyed by (client, server) so you can ask the same questions per edge rather than per service — useful when a problem isn’t in a service but in a particular dependency between two services.
The USE method (Utilization, Saturation, Errors) applies to resources rather than requests, but trace spans can contribute to USE too. A span on a database client carries db.system and peer.service attributes; aggregating its duration and error counts gives you per-database errors and saturation indicators. Resource-level utilization (CPU, memory) still comes from Prometheus exporters, but traces give you USE from the consumer’s perspective: how much of a downstream resource each caller is using.
Caveats for Trace-Derived Metrics
Two pitfalls deserve emphasis [Source: https://grafana.com/docs/loki/latest/query/log_queries/]:
- Sampling distorts rate. Head-based sampling at 10% means trace-derived metrics report roughly 1/10 of true request rate. Tail-based sampling that preferentially keeps errors and slow traces over-represents errors in the metric stream. Many teams treat trace-derived metrics as a correlation tool and keep direct application metrics as the SLO source of truth.
- Cardinality. Each dimension (
http.method,http.status_code) becomes a Prometheus label. Addinguser_idorrequest_idto span metrics will blow up cardinality and crash your TSDB. Stick to bounded labels: service, operation, status, method, coarse path.
Jaeger SPM vs. Tempo Metrics-Generator
The two systems converge on the same goal — RED metrics from spans — but differ in placement and tightness of integration:
| Aspect | Jaeger SPM | Tempo Metrics-Generator |
|---|---|---|
| Implementation | OTel Collector spanmetrics processor in front of Jaeger | Built into Tempo as a first-class component |
| Service graphs | Often a separate processor or external tool | Native service_graphs processor in metrics-generator |
| Storage backend | Cassandra, Elasticsearch, OpenSearch | Object storage (S3/GCS/Azure Blob) |
| Metrics destination | Prometheus | Prometheus / Mimir |
| Grafana integration | Good | Native — designed alongside Grafana |
| Multi-tenancy | Limited | First-class per-tenant isolation |
| Best fit | Existing Jaeger deployments, on-prem Cassandra/ES stacks | Cloud-native, object-storage-backed, LGTM stack adopters |
Either choice still benefits from running an OTel Collector in front of the tracing backend: the Collector handles batching, retries, tail-based sampling, attribute scrubbing, and multi-backend fan-out, leaving the backend to do storage and query.
Key Takeaway: Jaeger and Tempo both render traces as Gantt waterfalls and derive RED metrics from spans, but Tempo couples object storage with a built-in metrics-generator that emits both per-service and per-edge metrics into Prometheus. Trace-derived metrics are excellent for exploration and service maps, but sampling and cardinality limits mean teams should keep direct application metrics as the SLO source of truth.
Chapter Summary
Distributed tracing makes the invisible visible: it stitches the journey of one request across services, queues, and databases into a single causal graph. OpenTelemetry defines that graph as a tree of spans sharing a TraceId, each span carrying a SpanId, a kind (SERVER/CLIENT/PRODUCER/CONSUMER/INTERNAL), a status, attributes, events, and optional links. Setting the kind correctly is what enables backends to build dependency maps; setting status and recording exceptions is what makes errors searchable.
Context propagation is the wire-level mechanism that keeps spans in the same trace. The W3C Trace Context spec defines the traceparent header (version, trace-id, parent span-id, sampled flag) and the optional tracestate for vendor-specific data, and is OpenTelemetry’s default propagator. Legacy B3 (multi-header or single-header b3:) and Jaeger (uber-trace-id) propagators are available for interoperability, and composite propagators let services emit and accept multiple formats simultaneously — the foundation of a gradual migration to W3C. Baggage is a separate, complementary propagator for cross-cutting request data (user.id, tenant, feature flags) that lives on the context rather than on any one span; it must never carry secrets and should be sanitized at trust boundaries.
Useful traces require discipline: low-cardinality span names following the OTel semantic conventions, attributes for stable properties, events for timestamped moments, deliberate ERROR status on real failures, and the willingness to not emit a span per loop iteration. Events, counters, and links usually serve the high-cardinality cases better than per-iteration spans.
Visualization closes the loop. Jaeger is the classic, focused trace explorer with strong search and dependency-graph features. Grafana Tempo stores spans cheaply in object storage and ships with a metrics-generator that turns spans into Prometheus RED metrics — per service via span_metrics and per dependency edge via service_graphs. Both tools support the RED method directly and contribute to USE-style resource views. Trace-derived metrics are powerful for correlation and dependency analysis but should not replace direct application metrics for SLO accounting, because sampling and cardinality choices distort the signal.
A well-instrumented system gives an on-call engineer three things at 3 a.m.: a metric that says “errors are up,” a log line with a TraceId, and a trace that points at the exact span where the request died. Everything in this chapter exists to make that handoff work.
Key Terms
| Term | Definition |
|---|---|
| TraceId | 128-bit identifier (32 hex chars) shared by every span in a single logical request; globally unique per trace. |
| SpanId | 64-bit identifier (16 hex chars) that uniquely identifies one span within a trace. |
| span kind | Enum (SERVER, CLIENT, PRODUCER, CONSUMER, INTERNAL) describing the span’s role in a distributed conversation; required for accurate service graphs. |
| traceparent | W3C header carrying version, trace-id, parent span-id, and trace-flags: 00-<trace-id>-<span-id>-<flags>. Mandatory for W3C Trace Context. |
| tracestate | Optional W3C header carrying an ordered list of vendor-specific key=value entries; up to ~32 entries, ~512 chars total. |
| trace-flags | 8-bit field in traceparent; bit 0 is the sampled/recorded flag (01 = sampled). |
| baggage | Key–value data on the distributed context (W3C baggage header) propagated across services for cross-cutting request data like user.id, tenant.id, feature flags. |
| W3C Trace Context | The W3C standard for HTTP trace propagation defining traceparent and tracestate; OpenTelemetry’s default propagator. |
| B3 | Legacy Zipkin propagation format using either multi-headers (X-B3-TraceId, X-B3-SpanId, X-B3-Sampled, X-B3-Flags) or a single b3: header. |
| uber-trace-id | Legacy Jaeger propagation header: <trace>:<span>:<parent>:<flags>, where flags encodes sampled (bit 1) and debug (bit 2). |
| composite propagator | OTel construct that runs multiple propagators in sequence; first to extract wins, all enabled formats inject simultaneously — enables gradual format migration. |
| span attributes | Key–value pairs attached to a single span; describe that operation; not propagated downstream. |
| span events | Timestamped, named annotations within a span; the OTel-native way to record exceptions and intra-span moments. |
| span links | References from one span to other related SpanContext values; the right tool for fan-in patterns like batch processing. |
| Jaeger | CNCF open-source tracing backend with a trace explorer, dependency graph view, and Service Performance Monitoring (SPM) via OTel Collector spanmetrics. |
| Tempo | Grafana’s object-storage-backed tracing backend with a built-in metrics-generator producing per-service span_metrics and per-edge service_graphs Prometheus metrics. |
| RED method | Rate, Errors, Duration — the standard SLI vocabulary for request-driven services, derivable from span metrics via PromQL. |
| metrics-generator | Tempo component that reads spans and emits Prometheus metrics in real time; runs span_metrics and service_graphs processors. |
Chapter 8: Metrics Pipeline: Bridging OpenTelemetry and Prometheus
OpenTelemetry (OTel) and Prometheus were designed by different communities to solve overlapping but distinct problems. Prometheus grew up around pull-based scraping of cumulative counters with a strict text exposition format. OpenTelemetry was conceived as a vendor-neutral push pipeline with a rich metric data model that includes deltas, exponential histograms, and full resource attribution. In production observability stacks, these two worlds must coexist — usually because teams already operate Prometheus dashboards and alerts, but want to instrument applications with OTel’s polyglot SDKs.
This chapter walks the metric all the way from the instrument call inside your application through the OpenTelemetry SDK, across the wire (OTLP or Prometheus exposition), and into a Prometheus-compatible backend. By the end, you will be able to choose between the three common bridging patterns, configure aggregation temporality correctly for each, and translate OTel attributes into Prometheus labels without quietly losing data.
Learning Objectives
By the end of this chapter, you will be able to:
- Configure the OpenTelemetry SDK to emit metrics that Prometheus can scrape, or that the Collector can forward to a Prometheus-compatible backend.
- Compare cumulative versus delta aggregation temporality and pick the right one for each backend in a multi-destination pipeline.
- Translate OpenTelemetry attributes to Prometheus labels — including the metric name and unit transformations — without losing semantic meaning.
- Decide between the Prometheus SDK exporter, the Collector’s
prometheusexporter, theprometheusremotewriteexporter, and Prometheus’ native OTLP receiver.
8.1 OpenTelemetry Metrics Data Model
Before you can bridge OTel metrics into Prometheus, you have to understand what OTel actually produces. OTel’s metrics data model is richer than Prometheus’ exposition format, which is precisely why the bridge is nontrivial.
Figure 8.1: OpenTelemetry metrics data model from Meter to Exporter
flowchart TD
Meter[Meter]
Meter --> Sync[Synchronous Instruments]
Meter --> Obs[Observable Instruments]
Sync --> C[Counter]
Sync --> UDC[UpDownCounter]
Sync --> H[Histogram]
Obs --> OC[ObservableCounter]
Obs --> OUDC[ObservableUpDownCounter]
Obs --> OG[ObservableGauge]
C --> View[View<br/>rename, filter,<br/>change aggregation]
UDC --> View
H --> View
OC --> View
OUDC --> View
OG --> View
View --> Agg[Aggregation<br/>Sum / LastValue /<br/>Histogram / ExpHistogram]
Agg --> DP[Data Point<br/>value + attributes +<br/>timestamp + temporality]
DP --> Exp[Exporter<br/>OTLP / Prometheus / stdout]
Instruments: The Six Core Shapes
An instrument is the API surface your application code calls. OpenTelemetry defines six standard instrument types, organized along two axes: synchronous versus observable, and monotonic versus non-monotonic [Source: https://www.groundcover.com/opentelemetry/opentelemetry-metrics].
| Instrument | Synchronous? | Monotonic? | Typical use case |
|---|---|---|---|
Counter | Yes | Yes (only adds) | Requests served, bytes sent, errors raised |
UpDownCounter | Yes | No (can go down) | In-flight requests, queue depth, open connections |
Histogram | Yes | n/a (distribution) | Request duration, payload size |
ObservableCounter | No (callback) | Yes | CPU seconds, GC bytes — anything you read from an OS counter |
ObservableUpDownCounter | No (callback) | No | Memory in use, thread pool size |
ObservableGauge | No (callback) | n/a (last value) | Temperature, queue saturation, current load average |
Synchronous instruments are recorded inline on the hot path: requestCounter.Add(ctx, 1, attrs). Observable instruments register a callback the SDK invokes during each collection cycle: useful when reading the underlying value is cheap and you do not want to pay for it on every request.
By analogy: synchronous instruments are like ringing a bell every time something happens, while observable instruments are like a thermostat that the SDK reads on a schedule. Both produce time series, but the cost model is very different.
Views: Customizing Aggregation at the SDK
A View is an SDK configuration mechanism that lets you intercept measurements from a specific instrument and change how they are aggregated, named, or attributed before export. Views are how you do things like:
- Rename a metric (
http.server.request.duration→request_latency). - Drop high-cardinality attributes (e.g., remove
user_idfrom a counter). - Change the aggregation type (e.g., explicit-bucket histogram → exponential histogram).
- Configure custom bucket boundaries on a histogram.
Views are essential for the OTel-to-Prometheus bridge because they let you operate two pipelines from one instrument: an explicit-bucket histogram for the Prometheus scrape path, and an exponential histogram for an OTLP backend that supports them [Source: https://www.groundcover.com/opentelemetry/opentelemetry-metrics].
// Go SDK: configure a View to use an exponential histogram
sdkmetric.NewView(
sdkmetric.Instrument{Name: "request_duration_seconds"},
sdkmetric.Stream{
Aggregation: sdkmetric.AggregationExponentialHistogram{
MaxSize: 160,
MaxScale: 10,
},
},
)
Exponential Histograms
OTel ExponentialHistogram is a compact, base-2 exponential bucket representation. Instead of fixing bucket boundaries up front, it uses a scale parameter s such that buckets approximate [2^(i/2^s), 2^((i+1)/2^s)). A higher scale gives more buckets per power of 2 — that is, more resolution.
Exponential histograms have three useful properties:
- Automatic dynamic range. If observed values span microseconds to minutes, you do not need to pre-pick boundaries. The aggregator chooses scale automatically.
- Bounded memory. When the bucket count would exceed
MaxSize, the aggregator downscales — loweringsand merging neighboring buckets. Memory stays constant; precision degrades gracefully. - Separate positive, negative, and zero buckets. Useful for instruments that can take negative values (rare, but well-defined).
Compare this to a classic explicit-bucket histogram, where if your latency suddenly grows tenfold past your last bucket boundary, everything piles into the +Inf overflow bucket and your quantiles become useless. Exponential histograms degrade much more gracefully.
Key Takeaway: OpenTelemetry’s metric model expresses the shape of a measurement (counter, gauge, histogram) and the cadence (synchronous vs. observable) separately from how it is aggregated for export. Views are the configuration knob that lets you serve multiple backends — including Prometheus — from a single instrumentation point.
8.2 Aggregation Temporality
Aggregation temporality is the single most common source of OTel-to-Prometheus bugs. It is the difference between a counter that goes up forever and one that resets every export — and Prometheus’ query language assumes the former.
What Temporality Means
Temporality describes what time window each exported data point represents [Source: https://www.groundcover.com/opentelemetry/opentelemetry-metrics]. There are two options:
- Cumulative: Each point is the total since a fixed start time (usually process start).
- Delta: Each point is the change since the previous export.
Temporality applies to sums (counters) and histograms. Gauges and last-value metrics are instantaneous snapshots and do not use temporality in the same way — they are simply “the current value at observation time.”
Crucially, temporality is a property of the exported time series, not the instrument. The same Counter can be exported as cumulative to Prometheus and delta to an OTLP backend, possibly through the same Collector.
Cumulative Temporality
A cumulative Counter at time T reports the total events since the SDK started. A cumulative Histogram bucket at time T reports the total observations that fell into that bucket since the SDK started.
The backend computes deltas by subtracting successive samples:
delta = value(T2) - value(T1)
This is exactly how Prometheus’ rate() and increase() work. Pros and cons:
- Process restart: Counter drops to zero. Backends that understand cumulative semantics detect the reset (a counter that decreased) and treat subsequent samples as a new run.
- Missed export interval: No data loss. The backend just computes the rate over a longer window when the next sample arrives.
Delta Temporality
A delta Counter at time T reports events since the previous export. Each exported point is already a per-interval increment.
- Process restart: Transparent. Each export is independent; there is nothing to reset.
- Missed export interval: Irrecoverable data loss. If the export for window
[T1, T2]never arrives, those events are gone — there is no later cumulative sample from which to reconstruct them.
Side-by-Side Comparison
| Aspect | Cumulative | Delta |
|---|---|---|
| Value at time T | Total since start | Change since last export |
| Lost export | Backend just computes a longer-window rate | Data for that interval is lost forever |
| Process restart | Backend must detect reset | Each export independent — no reset concept |
| Backend rate computation | Backend differences samples | Per-interval value already given |
| Aligns with Prometheus | Natural fit | Not supported natively |
| Typical OTLP push guidance | Supported, less common | Strongly favored for many OTel backends |
When Each Is the Right Choice
Use cumulative when:
- Your backend is Prometheus or any system that assumes cumulative monotonic counters.
- You want robustness to transient network issues — missing one export does not lose information.
- You want behavior ops teams already know: counters that monotonically rise.
Use delta when:
- You are pushing OTLP into a backend that expects deltas or aggregates server-side.
- You want exact per-interval statistics and accept the data-loss tradeoff.
- You have many ephemeral producers (serverless, batch jobs) where cumulative reset detection is messy.
The Collector as Temporality Translator
In a real pipeline, you rarely want to pick one and force every backend to live with it. The dominant 2025 pattern is to let the OpenTelemetry Collector convert temporality per exporter:
- Application SDK exports OTLP with
AggregationTemporality = DELTAfor sums and histograms. - Collector receives delta points.
- Collector forwards delta to an OTLP vendor backend that prefers deltas.
- Collector accumulates delta into cumulative for the
prometheusremotewriteexporter or the/metricsendpoint Prometheus scrapes.
Figure 8.2: Collector as temporality translator — one input, two temporalities out
flowchart LR
App[Application SDK]
App -->|OTLP delta| Coll[OpenTelemetry Collector]
Coll -->|cumulative| PromExp[prometheus exporter]
Coll -->|delta| OTLPExp[otlp exporter]
PromExp -->|scrape| Prom[(Prometheus)]
OTLPExp -->|push| Vendor[(Vendor OTLP backend)]
Cumulative versus delta in the time domain — same underlying event stream, two export shapes:
Figure 8.3: Cumulative vs delta temporality over time
graph LR
subgraph Cumulative
C1["t1: 5"] --> C2["t2: 12"] --> C3["t3: 18"] --> C4["t4: 25"]
end
subgraph Delta
D1["t1: +5"] --> D2["t2: +7"] --> D3["t3: +6"] --> D4["t4: +7"]
end
The Prometheus exporters inside the Collector maintain internal state per series so they can sum deltas into a running total. This is what makes the “delta-from-apps, cumulative-to-Prometheus” pattern work end to end.
Common Temporality Bugs
If you misconfigure temporality, the symptoms in Prometheus are distinctive:
sum(rate(my_requests_total[5m]))returns negative values: your “counter” is actually being exported as delta to Prometheus, so each scrape sees a smaller value than the last.- Inspecting the raw series shows a counter that drops between scrapes — same root cause.
increase()over a long window returns a value much smaller than the true event count — Prometheus interpreted scrape-to-scrape drops as resets and discarded the “missing” data.
The fix is always the same: ensure the exporter feeding Prometheus is configured to produce cumulative, regardless of what temporality the SDK and Collector use internally.
Key Takeaway: Prometheus is built around cumulative monotonic counters; OTLP push pipelines often favor delta. Pick temporality per exporter — usually delta from the SDK, cumulative at the Prometheus boundary — and let the Collector translate between them.
8.3 Bridging to Prometheus
There are four practical ways to get OpenTelemetry metrics into a Prometheus-based observability stack. Each has tradeoffs around coupling, push-versus-pull semantics, and how many moving parts you operate.
Figure 8.4: Four bridge pipelines from OTel-instrumented apps to Prometheus-compatible storage
flowchart TD
App1[App with OTel SDK<br/>+ Prometheus exporter]
App2[App with OTel SDK]
App3[App with OTel SDK]
App4[App with OTel SDK]
Coll2[OTel Collector<br/>prometheus exporter]
Coll3[OTel Collector<br/>prometheusremotewrite]
App1 -->|expose /metrics| P1Slash[/metrics endpoint/]
P1Slash -->|scrape| Prom1[(Prometheus)]
App2 -->|OTLP push| Coll2
Coll2 -->|expose /metrics| P2Slash[/metrics endpoint/]
P2Slash -->|scrape| Prom2[(Prometheus)]
App3 -->|OTLP push| Coll3
Coll3 -->|remote_write push| Remote[(Mimir / Cortex /<br/>Thanos / VictoriaMetrics)]
App4 -->|OTLP push| Prom4[(Prometheus<br/>OTLP receiver)]
Option 1: Prometheus Exporter in the SDK
The simplest path is to attach a Prometheus exporter directly to the OTel SDK inside your application. The SDK accumulates measurements internally and exposes them on an HTTP /metrics endpoint in the Prometheus text format.
import (
"net/http"
"go.opentelemetry.io/otel"
sdkmetric "go.opentelemetry.io/otel/sdk/metric"
"go.opentelemetry.io/otel/exporters/prometheus"
)
func initMeter() {
exporter, err := prometheus.New()
if err != nil {
panic(err)
}
provider := sdkmetric.NewMeterProvider(
sdkmetric.WithReader(exporter),
)
otel.SetMeterProvider(provider)
http.Handle("/metrics", exporter)
go http.ListenAndServe(":9464", nil)
}
Prometheus scrapes the app directly:
scrape_configs:
- job_name: 'myapp'
static_configs:
- targets: ['myapp:9464']
Pros: Native Prometheus experience, no Collector required, operationally familiar.
Cons: You are not really using OTLP — resource attributes get flattened into labels (or lost), and you couple every app to Prometheus’ wire format. Exponential histograms get down-converted to explicit-bucket form.
Best for: Small Prometheus-first shops; gradual migrations.
Option 2: Prometheus Receiver in the Collector
A subtle but useful inversion: the OpenTelemetry Collector can scrape existing Prometheus exporters (anything exposing /metrics) and ingest the result into OTLP pipelines. This is the prometheus receiver, not exporter. It uses an embedded Prometheus scrape engine [Source: https://www.groundcover.com/opentelemetry/opentelemetry-metrics].
This receiver is the bridge that lets you bring legacy Prometheus-instrumented services (node_exporter, kube-state-metrics, third-party apps) into a unified OTLP pipeline alongside OTel-native apps.
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'kube-state-metrics'
static_configs:
- targets: ['kube-state-metrics:8080']
The Collector then routes these scraped metrics to any exporter — OTLP, prometheusremotewrite, or back out via the prometheus exporter for re-scraping.
Option 3: Prometheus Remote Write Exporter
The Collector’s prometheusremotewrite exporter pushes metrics over the Prometheus remote-write protocol. This is the dominant pattern for shipping metrics into Prometheus-compatible long-term stores like Cortex, Mimir, Thanos Receive, and VictoriaMetrics [Source: https://www.groundcover.com/opentelemetry/opentelemetry-metrics].
Important: a vanilla Prometheus server is a remote-write client, not a receiver. You cannot push remote-write into stock Prometheus. The prometheusremotewrite exporter is for remote-write-capable backends.
receivers:
otlp:
protocols:
http:
grpc:
processors:
batch:
timeout: 5s
send_batch_size: 10000
exporters:
prometheusremotewrite:
endpoint: http://mimir:9009/api/v1/push
external_labels:
cluster: prod-cluster
source: otel
send_metadata: true
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheusremotewrite]
This path can preserve OTel exponential histograms by mapping them to Prometheus native histograms over the wire — the highest-fidelity OTel-to-Prometheus path available in 2025.
Option 4: OTLP-Native Ingestion into Prometheus
Modern Prometheus versions (2.47+, more complete in 3.x) include an OTLP receiver that accepts pushed OTLP metrics directly into the Prometheus TSDB:
global:
scrape_interval: 15s
otlp:
http:
endpoint: 0.0.0.0:4318
grpc:
endpoint: 0.0.0.0:4317
Configure the OTel SDK to push to Prometheus’ OTLP endpoint:
export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_ENDPOINT=http://prometheus:4318
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
This collapses the pipeline to two components — but you lose Prometheus’ pull-based service-discovery model, and OTLP-specific semantics (resource attributes, exemplars, exponential histograms) are mapped to Prometheus equivalents with varying maturity.
Comparison of the Four Options
| Option | Direction | Extra component | Model | Best for | Main limitation |
|---|---|---|---|---|---|
| SDK Prometheus exporter | App → Prom | None | Pull | Small/medium Prometheus shops | Couples apps to Prometheus wire format |
Collector prometheus exporter | App → Collector → Prom | Collector | Push-to-Collector, pull-from-Collector | Prom shops adopting OTel | Extra hop |
Collector prometheusremotewrite | App → Collector → remote store | Collector | Push to backend | Large multi-cluster Cortex/Mimir | Cannot target vanilla Prometheus |
| Prometheus OTLP receiver | App → Prom (OTLP) | None | Push into Prom | OTel-first, fewer components | Newer; push semantics; less mature mappings |
Recommended Pattern for 2025
The dominant recommendation for teams adopting OTel while keeping Prometheus is Option 2 with Collector in the middle: applications push OTLP to a Collector, the Collector hosts a prometheus exporter on a port, and Prometheus scrapes the Collector. You get OTLP from your apps, Prometheus’ familiar pull model at the storage boundary, and a central place to filter, rename, and rate-limit metrics.
# Collector
exporters:
prometheus:
endpoint: "0.0.0.0:9464"
namespace: otel
const_labels:
source: otel-collector
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
# Prometheus
scrape_configs:
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:9464']
Key Takeaway: For most teams in 2025, the recommended bridge is App (OTLP) → OpenTelemetry Collector →
prometheusexporter → Prometheus scrape. It combines OTel-native push from applications with Prometheus-native pull at the storage layer, and gives you a Collector chokepoint to fix temporality, naming, and cardinality.
8.4 Naming and Label Mapping
Once the wire path is sorted out, the next failure mode is semantic: OTel and Prometheus have different naming conventions, different valid character sets, and different ways of expressing units. A faithful bridge has to translate without quietly dropping meaning.
OTel Metric Name → Prometheus Metric Name
OpenTelemetry metric names are dotted and case-sensitive: http.server.request.duration. Prometheus metric names traditionally allow only [a-zA-Z_:][a-zA-Z0-9_:]* — no dots. The standard mapping is:
- Replace
.with_:http.server.request.duration→http_server_request_duration. - Replace other invalid characters with
_: hyphens, slashes, etc. - Append unit suffix based on the instrument’s unit (next subsection).
Worked example:
| OTel metric name | Unit | Instrument | Prometheus name |
|---|---|---|---|
http.server.request.duration | s | Histogram | http_server_request_duration_seconds |
http.server.active_requests | {request} | UpDownCounter | http_server_active_requests |
http.client.request.body.size | By | Histogram | http_client_request_body_size_bytes |
process.cpu.time | s | ObservableCounter | process_cpu_time_seconds_total |
system.memory.usage | By | ObservableUpDownCounter | system_memory_usage_bytes |
Note _total is appended to monotonic counters per Prometheus convention; the exporter does this automatically based on the instrument type.
Figure 8.5: OTel metric name to Prometheus name conversion pipeline
flowchart LR
Start["OTel name:<br/>http.server.request.duration<br/>unit: ms<br/>type: Histogram"]
Start --> S1[Step 1<br/>Replace dots<br/>with underscores]
S1 --> N1["http_server_request_duration"]
N1 --> S2[Step 2<br/>Sanitize other<br/>invalid characters]
S2 --> N2["http_server_request_duration"]
N2 --> S3[Step 3<br/>Convert unit to<br/>Prometheus base unit]
S3 --> N3["values multiplied by 0.001<br/>suffix: _seconds"]
N3 --> S4[Step 4<br/>Append _total<br/>if monotonic counter]
S4 --> Final["http_server_request_duration_seconds"]
Unit Suffixes and Base Units
OpenTelemetry uses UCUM unit codes: s (seconds), ms (milliseconds), By (bytes), Ki (kibibytes), 1 (dimensionless ratio), {request} (annotation, no physical unit). Prometheus by convention uses base units in suffixes:
| OTel unit | Prometheus suffix | Conversion at exporter |
|---|---|---|
s | _seconds | None |
ms | _seconds | Multiply by 0.001 |
us / μs | _seconds | Multiply by 0.000001 |
By | _bytes | None |
KiBy | _bytes | Multiply by 1024 |
1 | (none) | None |
{request}, {job}, … | (none) | Annotation only |
The Prometheus exporter is responsible for converting values to base units and applying the suffix. If you instrument in milliseconds but Prometheus dashboards expect _seconds, the exporter does the math for you — provided you set the OTel unit correctly. Setting the wrong unit (or none) is a common silent bug that yields metrics that look right but are off by 1,000×.
Attributes to Labels
OTel attributes (key-value pairs on each measurement) and OTel resource attributes (key-value pairs on the entire SDK, like service.name and service.version) both become Prometheus labels — but with caveats.
Attribute keys go through the same dot-to-underscore conversion: http.response.status_code becomes the label http_response_status_code.
Resource attributes are typically attached as labels to every series the SDK emits. The Prometheus exporter often promotes service.name to a label named job and service.instance.id to instance, mirroring Prometheus’ service-discovery model. Other resource attributes (k8s.namespace.name, cloud.region, etc.) become labels too — which can explode cardinality if you are not careful.
| OTel attribute | Prometheus label |
|---|---|
service.name | job (and/or service_name) |
service.instance.id | instance |
http.response.status_code | http_response_status_code |
k8s.namespace.name | k8s_namespace_name |
net.peer.name | net_peer_name |
UTF-8 Metric Names in Prometheus 3.x
A significant 2024–2025 development: Prometheus 3.x supports UTF-8 metric and label names when the OpenMetrics or native protocols are used. This means the OTel .-separated form can in principle be preserved end-to-end without the underscore mangling, by quoting the metric name:
{"http.server.request.duration", method="GET"} 1.23
In practice, dashboards, alert rules, and recording rules written against legacy underscored names mean most teams still use the classic conversion. Treat UTF-8 names as a forward-looking option: useful for greenfield deployments where every component (Prometheus, query layer, dashboards) supports them, but expect translation back to underscored names anywhere a tool was written before 2024.
Common Naming and Mapping Pitfalls
- Forgetting to set the unit. Histogram values land in Prometheus as a dimensionless number with no
_secondsor_bytessuffix. Dashboards silently misinterpret them. - High-cardinality resource attributes. Attributes like
host.idork8s.pod.uidproduce one Prometheus series per host or pod. Multiply by every metric and Prometheus’ memory pressure becomes a real problem. Use Views to drop attributes you do not need. - Casing. OTel is case-sensitive; Prometheus labels are too, but conventions almost always use snake_case. Avoid
httpStatusCodestyle; usehttp_status_code. - Reserved labels. Do not produce attributes that collide with Prometheus’ reserved labels (
__name__, anything starting with__). - Histogram bucket labels. Classic histograms produce
_bucket{le="0.005"}series. Thelelabel is added by the exporter, not something you can produce as an OTel attribute.
Key Takeaway: OTel-to-Prometheus naming is a deterministic transform: dots become underscores, units become base-unit suffixes, attributes become labels, and monotonic counters get
_total. Get the unit right at the instrument level; everything downstream depends on it.
Chapter Summary
The metrics pipeline between OpenTelemetry and Prometheus has to reconcile two designs: OTel’s flexible push-oriented data model and Prometheus’ cumulative pull-oriented exposition format. The bridge has three moving parts.
The data model. OpenTelemetry exposes six instruments (Counter, UpDownCounter, Histogram, ObservableCounter, ObservableUpDownCounter, ObservableGauge) plus an aggregation layer customizable via Views. Exponential histograms give you compact, dynamic-range distributions that gracefully degrade memory through auto-downscaling. Prometheus’ wire format is less expressive, which is why bridging requires deliberate choices about which OTel features to preserve and which to flatten.
Aggregation temporality. Cumulative means “total since start”; delta means “change since last export.” Prometheus only natively supports cumulative — its rate() and increase() functions assume it. OTel push pipelines often favor delta because deltas aggregate cleanly across many producers. The dominant pattern is to use delta from apps to the Collector, and have the Collector accumulate to cumulative before exposing metrics for Prometheus scraping.
The wire bridge. Four options exist: SDK Prometheus exporter, Collector prometheus exporter, Collector prometheusremotewrite exporter, and Prometheus’ native OTLP receiver. For most Prometheus shops adopting OTel, the App → OTLP → Collector → prometheus exporter → Prometheus scrape pattern is the recommended path. It preserves OTel push from apps and Prometheus pull at the storage layer, and gives you a single chokepoint to handle temporality, naming, and cardinality.
Naming and labels. OTel metric names use dots and are paired with explicit units; Prometheus uses underscores, base units, and the conventional _total suffix on counters. Resource attributes become labels — be deliberate about which ones, or your Prometheus cardinality will explode. Prometheus 3.x’s UTF-8 name support offers a forward-looking simplification but is not yet universally compatible with existing tooling.
Get these four layers right and OTel and Prometheus coexist smoothly: applications speak OTel, dashboards keep working, and you preserve the option to add OTLP-native backends without re-instrumenting.
Key Terms
| Term | Definition |
|---|---|
| Aggregation temporality | Whether each exported metric point represents a total since process start (cumulative) or a change since the last export (delta). |
| Cumulative | Temporality in which each data point is the total since a fixed start time; the natural fit for Prometheus. |
| Delta | Temporality in which each data point is the change since the previous export; favored for many OTLP push pipelines. |
| Exponential histogram | OTel histogram aggregation using base-2 exponential buckets controlled by a scale parameter, with automatic downscaling to bound memory. |
| View | OTel SDK configuration that customizes how measurements from an instrument are aggregated, named, attributed, or filtered before export. |
| Instrument | API surface for recording measurements: Counter, UpDownCounter, Histogram, and their Observable variants. |
| Prometheus exporter | An OTel component (in the SDK or Collector) that exposes a /metrics endpoint in Prometheus text format for scraping. |
| Prometheus receiver | A Collector component that scrapes existing Prometheus /metrics endpoints and ingests the result into the OTLP pipeline. |
| Remote write | Prometheus’ push protocol for shipping samples to remote storage backends like Cortex, Mimir, Thanos, and VictoriaMetrics. |
prometheusremotewrite exporter | Collector exporter that pushes metrics over the Prometheus remote-write protocol; cannot target vanilla Prometheus. |
| OTLP | OpenTelemetry Protocol; the push-based gRPC/HTTP wire format for traces, metrics, and logs. |
| OTLP receiver (Prometheus) | A feature in recent Prometheus versions that accepts pushed OTLP metrics directly into the TSDB. |
| Resource attributes | Key-value pairs that describe the entity producing telemetry (e.g., service.name); typically promoted to Prometheus labels. |
| Native histogram (Prometheus) | Prometheus’ base-2 exponential histogram type, transported via remote-write and roughly analogous to OTel ExponentialHistogram. |
| Exemplar | Sampled raw measurement attached to a histogram bucket, optionally carrying trace/span context for telemetry correlation. |
Chapter 9: Logs, Events, and Cross-Signal Correlation
Logs are the oldest and most stubborn signal in observability. Long before metrics dashboards and distributed traces, engineers grepped through /var/log to figure out what went wrong. That habit has not gone away — it has merely accumulated layers. Today a typical production stack carries application logs in JSON, container stdout collected by Fluent Bit, kernel messages in journald, audit events going to a SIEM, and increasingly OpenTelemetry log records flowing as OTLP. The interesting question is no longer “how do I store logs?” but “how do I make logs participate in observability alongside traces and metrics?”
This chapter walks through what OpenTelemetry says a log is, how to get logs into a collector pipeline whether your application is brand-new or twenty years old, and — most importantly — how to make a trace_id in a log line behave like a hyperlink to the actual trace.
Learning Objectives
By the end of this chapter you will be able to:
- Emit OpenTelemetry-compatible structured logs from applications using log bridges or native SDKs, without binding application code to a specific vendor.
- Correlate logs to traces using
trace_idandspan_idso engineers can pivot between signals during an incident. - Design a log pipeline that mixes the OpenTelemetry Collector, Loki, and existing log shippers (Fluent Bit, Fluentd, Vector) without forcing a big-bang migration.
- Decide when a piece of information belongs in a LogRecord, a span event, or both.
9.1 OpenTelemetry Logs Data Model
If traces describe what one operation did and metrics describe what is true in aggregate, logs describe what happened, at this exact moment, in detail. OpenTelemetry’s logs signal formalizes that intuition into a portable data structure so logs can be shipped, transformed, and queried the same way regardless of language or backend [Source: https://opentelemetry.io/docs/specs/otel/logs/].
9.1.1 The LogRecord Structure
A LogRecord is the atomic unit of the OpenTelemetry logs signal. It lives inside a hierarchy familiar from traces and metrics: a Resource (the entity that produced the data — a service, host, or pod) contains one or more InstrumentationScopes (the library or module that emitted the log), which in turn contain LogRecords [Source: https://opentelemetry.io/docs/specs/otel/logs/data-model/].
Each LogRecord carries the following core fields:
| Field | Purpose | Example |
|---|---|---|
timestamp | When the event actually occurred (nanoseconds since epoch) | 1717603200123456789 |
observed_timestamp | When the collector/agent first saw the event | 1717603200223456789 |
severity_number | Numeric severity (1-24, normalized across systems) | 17 (ERROR) |
severity_text | Original textual severity | "ERROR" |
body | The main payload, string or structured | "Payment failed" or {"message": "...", "reason": "card_declined"} |
attributes | Key-value dimensions for filtering and grouping | {"http.route": "/pay", "user.id": "12345"} |
trace_id | 32-hex-character correlation ID matching the trace | "8f1b5fe2d5de4a51b8884f8f4cdde3f5" |
span_id | 16-hex-character correlation ID matching the span | "d2a41c3ff7a1b0ce" |
trace_flags | Sampling/flag bits inherited from trace context | 01 |
resource | Service-level attributes (inherited) | {"service.name": "payments-api"} |
The separation between timestamp and observed_timestamp is subtle but useful: if a log line sits in a buffer for thirty seconds before reaching the Collector, both moments are preserved. This makes it possible to diagnose lag in the log pipeline itself.
Figure 9.1: LogRecord hierarchy — Resource, InstrumentationScope, and LogRecord fields
graph TD
R["Resource<br/>service.name=payments-api<br/>deployment.environment=prod<br/>k8s.pod.name=payments-7c584fd87f-jc6xg"]
R --> S1["InstrumentationScope<br/>com.example.payments"]
R --> S2["InstrumentationScope<br/>runtime"]
S1 --> L1["LogRecord<br/>severity=ERROR<br/>body=Payment charge failed<br/>trace_id=8f1b...e3f5<br/>span_id=d2a4...b0ce"]
S1 --> L2["LogRecord<br/>severity=INFO<br/>body=Charge created<br/>attributes.amount_cents=4200"]
S2 --> L3["LogRecord<br/>severity=WARN<br/>body=GC pause 250ms"]
S2 --> L4["LogRecord<br/>severity=INFO<br/>body=Heap resized"]
9.1.2 Severity, Body, and Attributes
OpenTelemetry maps the chaotic world of log levels onto a single severity scale of 1-24, divided into ranges: TRACE (1-4), DEBUG (5-8), INFO (9-12), WARN (13-16), ERROR (17-20), FATAL (21-24). The mapping lets you ask “show me everything WARN or above” across services that use Python’s logging, Java’s Logback, Node’s pino, and .NET’s ILogger — even though each of those frameworks invents its own level names.
The body field deserves attention. It can be a plain string (for legacy apps), but it can also hold a structured object. The recommendation in 2025 is:
- Plain message in
bodyfor human-readable context. - Dimensional fields in
attributes— anything you might filter or aggregate on.
Think of body as the headline and attributes as the metadata you’d want to slice by. A useful analogy: if logs were emails, body is the subject line and attributes are the headers (From, To, Date) that make the email searchable.
Here is a complete LogRecord serialized as JSON, ready for OTLP:
{
"timestamp": "2025-03-10T10:15:30.123Z",
"observed_timestamp": "2025-03-10T10:15:30.156Z",
"severity_number": 17,
"severity_text": "ERROR",
"body": {
"message": "Payment charge failed",
"reason": "card_declined"
},
"attributes": {
"http.method": "POST",
"http.route": "/pay",
"http.status_code": 402,
"user.id": "12345",
"payment.amount_cents": 4200,
"exception.type": "StripeCardException"
},
"trace_id": "8f1b5fe2d5de4a51b8884f8f4cdde3f5",
"span_id": "d2a41c3ff7a1b0ce",
"trace_flags": "01",
"resource": {
"service.name": "payments-api",
"service.namespace": "checkout",
"deployment.environment": "prod",
"k8s.namespace.name": "payments",
"k8s.pod.name": "payments-7c584fd87f-jc6xg"
}
}
That record carries enough context to answer three different questions: what happened (body), who did it (resource + user attribute), and which trace does it belong to (trace_id/span_id).
9.1.3 Maturity: Logs Compared to Traces and Metrics
The honest 2025 picture: the logs data model is stable, OTLP logs are stable, and the Collector’s log pipeline is mature. What still varies is native SDK support per language [Source: https://opentelemetry.io/docs/specs/otel/logs/].
| Language | Logs API/SDK status | Production posture |
|---|---|---|
| Java | Stable Logs API/SDK; Logback & Log4j2 appenders; agent integration | Production-ready |
| .NET | Stable via ILogger provider; first-class OTLP exporter | Production-ready |
| Python | SDK exists; some surface area still volatile | Production-usable with caution |
| Go | SDK in experimental form; many teams still inject IDs manually | Early-adopter |
| Node.js/JS | No unified mature logs SDK; use existing logger + manual injection | Hybrid approach |
| C++/Rust | Partial/experimental; varies by library | Evaluate per project |
The practical implication is that for Java and .NET services you can confidently say “logs go through OpenTelemetry.” For Python, you can do it but should isolate the OTel logging setup behind your own thin abstraction so SDK churn does not propagate. For Go and Node.js, the realistic posture is: keep using your favorite logger (zap, zerolog, pino, winston) and ensure it includes trace_id/span_id fields — the OTel Collector will accept the resulting JSON lines just as happily.
9.1.4 Log Bridges for Popular Frameworks
A log bridge is a small adapter that listens to log events from an existing framework and converts them into OpenTelemetry LogRecords. The framework keeps its familiar API (developers still write logger.error("...")); the bridge handles the translation, enriches the record with resource attributes and current trace context, and ships it over OTLP.
The pattern, in three steps regardless of language:
- Application code logs as it always has.
- A bridge — an appender, handler, or logger provider — translates each event into an OTel LogRecord.
- The OTel SDK exports LogRecords to the Collector via OTLP.
Java with Logback — Add the opentelemetry-logback-appender dependency, declare an OTel appender in logback.xml, and attach it to the root logger. The appender copies MDC entries to attributes and grabs trace_id/span_id from the current span automatically [Source: https://opentelemetry.io/docs/languages/java/instrumentation/].
<appender name="OTEL" class="io.opentelemetry.instrumentation.logback.appender.v1_0.OpenTelemetryAppender">
<captureMdcAttributes>*</captureMdcAttributes>
<captureCodeAttributes>true</captureCodeAttributes>
</appender>
<root level="INFO">
<appender-ref ref="OTEL"/>
<appender-ref ref="CONSOLE"/>
</root>
Python with logging — Install opentelemetry-sdk and wire up a LoggerProvider, a BatchLogRecordProcessor, and the OTLP exporter:
import logging
from opentelemetry._logs import set_logger_provider
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter
provider = LoggerProvider()
set_logger_provider(provider)
provider.add_log_record_processor(
BatchLogRecordProcessor(OTLPLogExporter(endpoint="http://otel-collector:4317", insecure=True))
)
handler = LoggingHandler(level=logging.INFO, logger_provider=provider)
logging.getLogger().addHandler(handler)
logging.getLogger().setLevel(logging.INFO)
Wrap that setup in your own observability/logging.py module so the inevitable SDK version bump only changes one file.
.NET with ILogger — Configuration is one fluent call:
builder.Logging.AddOpenTelemetry(options =>
{
options.IncludeFormattedMessage = true;
options.IncludeScopes = true;
options.SetResourceBuilder(ResourceBuilder.CreateDefault().AddService("payments-api"));
options.AddOtlpExporter(o => o.Endpoint = new Uri("http://otel-collector:4317"));
});
After that, any _logger.LogInformation("Order {OrderId} placed", id) produces a LogRecord with OrderId as a structured attribute and the current trace_id/span_id already attached.
Node.js with winston / pino — Because the JS logs SDK is still maturing, the common pattern is to keep winston or pino, configure JSON output, and add trace_id/span_id from the active context:
const { trace, context } = require('@opentelemetry/api');
const pino = require('pino');
const logger = pino({
formatters: {
log(obj) {
const span = trace.getSpan(context.active());
if (span) {
const ctx = span.spanContext();
obj.trace_id = ctx.traceId;
obj.span_id = ctx.spanId;
}
return obj;
}
}
});
The resulting JSON lines flow to stdout, are picked up by the Collector’s filelog receiver, and join the OTLP pipeline.
Key Takeaway: OpenTelemetry’s LogRecord gives you a single, stable schema (timestamp, severity, body, attributes, trace_id/span_id, resource) that any logging framework can be bridged into — keeping developer ergonomics in each language while normalizing the wire format downstream.
9.2 Collecting Logs at the Edge
The data model is portable; the act of capturing logs is not. Some logs come from applications that speak OTLP. Some come from third-party software writing to files. Some come from systemd via journald. Some are already being collected by a Fluent Bit DaemonSet your platform team installed five years ago. The OpenTelemetry Collector is designed to absorb all of these without forcing a single pattern.
9.2.1 The filelog Receiver
The filelog receiver is the Collector’s answer to legacy log files. It tails files on disk, optionally parses each line (regex, JSON, CSV), and emits OTel LogRecords downstream [Source: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver].
receivers:
filelog:
include:
- /var/log/pods/*/*/*.log
include_file_path: true
start_at: beginning
operators:
- type: json_parser
parse_from: body
timestamp:
parse_from: attributes.time
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
severity:
parse_from: attributes.level
The operators list is a small log-transformation pipeline that runs per receiver. The json_parser operator parses each line as JSON and promotes fields to LogRecord attributes. The timestamp and severity sub-blocks tell the operator which attributes to use to populate the canonical fields. Other useful operators include regex_parser for unstructured logs, multiline for stack traces that span multiple lines, and recombine for entries split by a logging framework.
9.2.2 The journald Receiver
For host-level events on systemd-based Linux machines — sshd logins, OOM kills, cron runs — the journald receiver reads directly from the systemd journal binary format, preserving all structured fields:
receivers:
journald:
directory: /var/log/journal
units:
- sshd
- cron
priority: info
This is invaluable for SRE teams who want a single observability backend for both application telemetry and host events.
9.2.3 Migrating from Fluent Bit, Fluentd, or Vector
Most clusters running for more than a year already have a log shipper deployed. The pragmatic migration path is not rip-and-replace. It is layered coexistence.
Figure 9.2: Two-track log pipeline during Fluent Bit to OTel Collector migration
flowchart LR
subgraph Legacy["Legacy track (Phase 1 - keep running)"]
A1["Legacy services<br/>stdout/files"] --> FB["Fluent Bit<br/>DaemonSet"]
FB --> Splunk["Splunk / ELK"]
end
subgraph New["OTel track (Phase 2-3 - growing)"]
A2["New services<br/>OTLP logs"] --> OC["OTel Collector<br/>DaemonSet<br/>filelog + OTLP receiver"]
A3["JSON file logs"] --> OC
OC --> Loki["Loki"]
OC --> Tempo["Tempo / OTLP backend"]
end
A1 -. "service adopts OTel SDK<br/>or JSON logs" .-> A2
FB -. "optional: Fluent Forward<br/>during transition" .-> OC
Concretely:
- Phase 1 — Leave Fluent Bit in place. Deploy the OTel Collector alongside it. Have new services emit OTLP logs directly to the Collector. Existing services keep flowing through Fluent Bit.
- Phase 2 — Standardize log format. Whether your shipper is Fluent Bit, Fluentd, or Vector, configure it to emit JSON with
trace_id,span_id, and OTel-style resource attributes. This makes the eventual switchover lossless. - Phase 3 — Either reconfigure Fluent Bit to forward to the OTel Collector via OTLP (Fluent Bit 2.0+ supports this), or replace it with the Collector’s
filelogreceiver. The destination backend (Loki, ELK, Splunk) can stay the same.
The OTel Collector also speaks the Fluent Forward protocol natively as a receiver, so you can point an existing Fluent Bit at the Collector during the transition without changing the agent’s output plugin.
9.2.4 Kubernetes Log Collection Patterns
In Kubernetes, container stdout/stderr is written to /var/log/pods/<namespace>_<pod>_<uid>/<container>/0.log on each node. There are three established patterns for collecting these:
| Pattern | Mechanism | Pros | Cons |
|---|---|---|---|
| DaemonSet with filelog | One Collector per node tailing /var/log/pods | No app changes; works for any language | Requires hostPath mount; parsing burden in Collector |
| Sidecar OTel SDK | Each pod sends OTLP directly | Structured at source; trace correlation automatic | Pod resource overhead; harder for 3rd-party images |
| Stdout + Kubernetes API enrichment | DaemonSet tails stdout, calls k8s API for pod metadata | Rich Kubernetes attributes (labels, owners) | Extra API load; permissions complexity |
The most common production layout is a DaemonSet + filelog + the k8sattributes processor, which enriches every LogRecord with pod, namespace, deployment, and node attributes pulled from the Kubernetes API:
processors:
k8sattributes:
auth_type: serviceAccount
extract:
metadata:
- k8s.pod.name
- k8s.namespace.name
- k8s.deployment.name
- k8s.node.name
labels:
- tag_name: app
key: app.kubernetes.io/name
That processor alone is often the difference between “logs I can search” and “logs I can join with metrics and traces.”
Key Takeaway: The Collector’s
filelog,journald, andfluentforwardreceivers, plus thek8sattributesprocessor, let you ingest logs from new OTLP-native services, legacy file-based apps, host daemons, and existing Fluent Bit deployments — all into one normalized OTel log pipeline, without a big-bang migration.
9.3 Cross-Signal Correlation
Structured logs and OTLP transport are means to an end. The end is correlation — the ability to be looking at a trace in Tempo, click a button, and land in the precise log lines that the failing span produced; or to find an ERROR log in Loki, click the trace_id, and land in the corresponding distributed trace.
9.3.1 Stamping trace_id and span_id on Log Records
There are two ways to get trace IDs onto a log record [Source: https://grafana.com/docs/grafana/latest/datasources/loki/configure-loki-data-source/#derived-fields]:
- Automatic, via SDK or bridge. The OTel logging API reads the active span from context (the same context used by tracing instrumentation) and copies its
trace_idandspan_idonto the LogRecord. This is the path Logback, Log4j2,ILogger, and the Python OTel handler all take. - Manual, via logger enrichment. In languages where the logs SDK is immature, you fetch the active span yourself and inject the IDs as structured fields on every log. Patterns: a pino formatter (Node.js), a zap option (Go), a Serilog enricher (.NET classic), a logrus hook.
Either way the result must be the same: every log line carries the exact 32-hex-character trace_id and 16-hex-character span_id that the tracer would have sent to the trace backend. A mismatch — extra dashes, wrong case, truncation — silently breaks correlation.
A common 2025 pitfall is mismatched casing: some loggers serialize trace IDs as uppercase, but Tempo expects lowercase. The Collector’s transform processor can normalize this:
processors:
transform/normalize-trace-ids:
log_statements:
- context: log
statements:
- set(attributes["trace_id"], ConvertCase(attributes["trace_id"], "lower"))
- set(attributes["span_id"], ConvertCase(attributes["span_id"], "lower"))
9.3.2 Unified Resource Attributes
Trace-log linking via trace_id is half the story. The other half is resource attribute consistency. Both your traces and your logs should carry the same service.name, service.namespace, deployment.environment, and k8s.namespace.name. Without that, Grafana cannot construct a sensible “Show logs” query when you click on a span.
In an OTel SDK setup, resource attributes are configured once and shared by all three signals:
# Environment / config
OTEL_RESOURCE_ATTRIBUTES=service.name=payments-api,service.namespace=checkout,deployment.environment=prod
That single environment variable populates the Resource section of every trace, metric, and LogRecord emitted by the process. The Collector can further upsert cluster-level attributes that the application cannot know about:
processors:
resource/cluster:
attributes:
- key: k8s.cluster.name
value: prod-us-east-1
action: upsert
- key: cloud.region
value: us-east-1
action: upsert
9.3.3 The Loki + Tempo + Grafana Pivot
Grafana Loki and Tempo are the open-source duo that pioneered the trace-to-logs UX. The high-level flow:
Figure 9.3: Trace-to-logs and logs-to-trace pivot in Grafana
sequenceDiagram
participant App as Application
participant OC as OTel Collector
participant Loki as Loki (Logs)
participant Tempo as Tempo (Traces)
participant G as Grafana
participant U as User
App->>OC: OTLP logs + traces (shared trace_id)
OC->>Loki: Logs pipeline
OC->>Tempo: Traces pipeline
Note over U,G: Trace -> Logs pivot
U->>G: Open trace in Tempo
G->>Tempo: Fetch spans
Tempo-->>G: Spans with trace_id
U->>G: Click "Show logs"
G->>Loki: LogQL {service="payments-api"} <br/>| json | trace_id="8f1b..."
Loki-->>G: Matching log lines
G-->>U: Render trace + logs timeline
Note over U,G: Logs -> Trace pivot
U->>G: Click trace_id in log line
G->>Tempo: Lookup trace by trace_id
Tempo-->>G: Full trace
G-->>U: Render trace view
The Collector configuration:
exporters:
loki:
endpoint: http://loki:3100/loki/api/v1/push
default_labels_enabled:
exporter: false
job: false
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource/cluster]
exporters: [otlp/tempo]
logs:
receivers: [otlp, filelog]
processors: [k8sattributes, resource/cluster, transform/normalize-trace-ids, batch]
exporters: [loki]
Then on the Grafana side, the Loki data source defines a derived field that converts the trace_id attribute of any log line into a clickable link to the Tempo data source:
- Name:
trace_id - Regex:
"trace_id"\s*:\s*"([a-f0-9]{32})" - URL/Internal link: Tempo
- Field:
${__value.raw}
The Tempo data source gets the reverse mapping under Trace to logs: pick Loki as the logs source, list the tags whose values should match (service.name, k8s.namespace.name, deployment.environment), and define the label mapping (service.name → service).
A critical performance note: never index trace_id as a Loki label. Each unique trace becomes a label value, exploding cardinality and devastating Loki’s index. Keep trace_id as a structured field within the log body, and rely on derived fields plus | json filtering at query time [Source: https://grafana.com/docs/tempo/latest/configuration/grafana-agent/]. Loki labels should be bounded-cardinality attributes only: service, environment, namespace, pod.
The resulting Grafana query, generated automatically when a user clicks “Show logs” on a span:
{service="payments-api", env="prod"} | json | trace_id = "8f1b5fe2d5de4a51b8884f8f4cdde3f5"
Key Takeaway: Cross-signal correlation requires three aligned pieces: every log carries the same
trace_id/span_idas the trace, both signals shareservice.name-style resource attributes, and Grafana is configured with Loki derived fields and Tempo trace-to-logs — without all three, the UX silently degrades to “open two tabs and grep.”
9.4 Events and Span Events
The final corner of the logs story is the most confusing for newcomers: OpenTelemetry has two things both called “events.” A LogRecord in the logs signal is one. A span event in the traces signal is the other. They look superficially similar — a timestamp, a name or message, attributes — but they live in different pipelines and obey different rules.
9.4.1 Span Events vs. LogRecords
A span event is a timestamped annotation attached inside a span. It has no independent identity; it is shipped as part of its parent span over the trace pipeline, inheriting that span’s trace_id and span_id automatically. Span events have a name (such as "exception" or "retry") and attributes, but no severity level [Source: https://opentelemetry.io/docs/specs/otel/logs/data-model/].
A LogRecord is a first-class log entry. It has its own severity, its own body, and may exist without any span context at all (startup messages, cron job output, background workers). When it does have trace context, that context is carried as explicit trace_id/span_id fields.
Here is a comparison table that captures the distinctions most likely to trip people up:
| Aspect | LogRecord | Span Event |
|---|---|---|
| Signal | Logs | Traces |
| Independent identity | Yes — exists on its own | No — lives inside a span |
| Severity | Yes (severity_number, severity_text) | No |
| Body | Yes (text or structured) | No (just name + attributes) |
| Trace correlation | Optional, via explicit trace_id/span_id | Automatic — inherits from parent span |
| Pipeline | Logs pipeline (OTLP logs, Loki, ELK, Splunk) | Trace pipeline (OTLP traces, Tempo, Jaeger) |
| Affected by trace sampling | No (separate sampling) | Yes — dropped if span is dropped |
| Volume profile | High; designed for log-scale backends | Low; embedded in spans |
| Retention | Typically days to months | Typically hours to days |
| Best for | ”What is the app doing over time?" | "What happened inside this operation?” |
The sampling row is the most operationally important. If your tracing pipeline samples at 1%, 99% of span events disappear. If you need to keep an event around for postmortems no matter what, it must be a LogRecord.
Figure 9.4: Span events vs. LogRecords — different pipelines, joined by trace_id
flowchart LR
subgraph Traces["Trace pipeline (sampled)"]
SP["Span: GET /orders/{id}<br/>trace_id=8f1b...e3f5<br/>span_id=d2a4...b0ce"]
SP -.- E1["Span event: exception<br/>exception.type=NPE"]
SP -.- E2["Span event: cache.miss<br/>cache.key=user:42"]
SP --> Tempo["Tempo"]
end
subgraph Logs["Logs pipeline (unsampled)"]
L1["LogRecord<br/>severity=ERROR<br/>body=Order lookup failed<br/>trace_id=8f1b...e3f5"]
L2["LogRecord<br/>severity=INFO<br/>body=checkout.completed"]
L1 --> Loki["Loki"]
L2 --> Loki
end
L1 -. "shared trace_id<br/>(cross-signal join)" .-> SP
Figure 9.5: Decision tree — LogRecord vs. span event
flowchart TD
Start["New piece of information<br/>to capture"] --> Q1{"Must survive even if<br/>trace is sampled out?"}
Q1 -->|Yes| LR["Emit as LogRecord"]
Q1 -->|No| Q2{"Describes a moment<br/>inside one span's operation?"}
Q2 -->|Yes| Q3{"Searched in logs backend?<br/>(audit, business, security)"}
Q2 -->|No| LR
Q3 -->|Yes| Both["Emit BOTH:<br/>LogRecord + span event"]
Q3 -->|No| SE["Add as span event<br/>(exception / retry / state)"]
LR --> Loki2["Logs pipeline -> Loki"]
SE --> Tempo2["Trace pipeline -> Tempo"]
Both --> Loki2
Both --> Tempo2
9.4.2 Span Events: Annotating Operations
Span events shine when you want to capture moments within an operation without inflating your tracing schema with extra spans. Typical uses:
- Exceptions — The OTel semantic convention defines an
"exception"span event withexception.type,exception.message,exception.stacktrace, andexception.escaped[Source: https://opentelemetry.io/docs/specs/semconv/exceptions/exceptions-spans/]. Most language SDKs add this automatically when an unhandled exception propagates out of a span. - Retries — A
"retry"event withretry.count,retry.delay_ms,retry.reasonlets you see the sub-rhythm of a flaky downstream call without each retry becoming its own span. - State transitions —
"connection.established","circuit_breaker.opened","feature_flag.evaluated". - Cache hits/misses — A
"cache.hit"or"cache.miss"event on the database span is more compact than a separate cache span.
A Java example, adding a retry event:
Span span = Span.current();
span.addEvent("retry", Attributes.builder()
.put("retry.count", attempt)
.put("retry.delay_ms", backoffMs)
.put("retry.reason", "timeout")
.build());
9.4.3 Domain Events: Product Analytics in the Logs Pipeline
A third use of the term “event” comes from product analytics: a “user.signed_up” or “checkout.completed” record meant for product dashboards. In OpenTelemetry, these are best modeled as LogRecords with a specific naming convention (event.name attribute, often event.domain to namespace them):
{
"timestamp": "2025-03-10T10:15:30.123Z",
"severity_number": 9,
"severity_text": "INFO",
"body": "Checkout completed",
"attributes": {
"event.name": "checkout.completed",
"event.domain": "commerce",
"order.id": "ord_8f3a",
"order.total_cents": 4200,
"user.id": "12345",
"service.name": "checkout-api"
},
"trace_id": "8f1b5fe2d5de4a51b8884f8f4cdde3f5",
"span_id": "d2a41c3ff7a1b0ce"
}
The advantage: domain events ride the logs pipeline (long retention, durable, unsampled) while still carrying trace context for forensic analysis when something goes wrong.
9.4.4 Decision Heuristic: Which Signal?
For a given piece of information, ask:
| Question | Lean toward |
|---|---|
| ”Will I search this in the logs backend?” | LogRecord |
| ”Does it describe something inside a span’s operation?” | Span event |
| ”Do I need this even if the trace is sampled out?” | LogRecord |
| ”Is it a retry, exception, or state transition in a specific operation?” | Span event |
| ”Is it a high-volume cross-cutting signal (audit, security, business)?” | LogRecord |
| ”Is it a rare, notable moment only meaningful within a single trace?” | Span event |
For the most important incidents — a 500 from a payment service — emit both: a LogRecord captures the cross-service debugging detail and survives sampling; a span event on the relevant span keeps the in-trace view rich for engineers diving into Tempo. The two are correlated automatically by shared trace_id/span_id.
Key Takeaway: Use span events for in-operation milestones (exceptions, retries, state changes) that benefit from automatic trace correlation, and use LogRecords for cross-cutting, high-volume, or sampling-resistant signals — and for the most important events, emit both, knowing they will join up automatically via shared trace context.
Chapter Summary
Logs in 2025 are no longer a separate world from traces and metrics — they are the third stable signal in OpenTelemetry, with a portable LogRecord schema and a mature Collector pipeline. The headline shifts from this chapter are:
- The LogRecord schema (timestamp, severity, body, attributes, trace_id/span_id, resource) is stable. Native SDK support is strong in Java and .NET, usable in Python, and still experimental in Go and Node.js — for the latter, inject
trace_id/span_idmanually into your existing logger. - Log bridges keep developer ergonomics intact. Code keeps using Logback,
logging,ILogger, or winston; the bridge converts events to OTel LogRecords with resource and trace context attached. - Edge collection does not require a rip-and-replace. The Collector’s
filelog,journald, andfluentforwardreceivers, plus thek8sattributesprocessor, let new OTLP-native services coexist with Fluent Bit/Fluentd/Vector for as long as needed. - Cross-signal correlation demands three aligned things: identical
trace_id/span_idon every log line, shared resource attributes across signals, and Grafana data sources configured with Loki derived fields and Tempo “Show logs.” - Cardinality discipline: keep
trace_idas a log field, never a Loki label. Labels are for bounded-cardinality dimensions (service, environment, namespace). - Span events vs LogRecords: span events are in-trace annotations that disappear under sampling; LogRecords are durable, queryable, severity-bearing entries. Use both for important incidents.
The recurring theme is that observability is not three siloed pipelines but one cross-referenced data graph, and the trace_id/span_id pair is the glue that holds it together.
Key Terms
| Term | Definition |
|---|---|
| LogRecord | The atomic unit of the OpenTelemetry logs signal, with fields for timestamp, severity, body, attributes, trace_id/span_id, and resource context |
| Log bridge | An adapter (appender, handler, provider) that converts a logging framework’s events into OpenTelemetry LogRecords and exports them via OTLP |
| Filelog receiver | An OpenTelemetry Collector receiver that tails log files on disk, parses each line, and emits LogRecords — essential for legacy or third-party applications |
| Span event | A timestamped annotation inside a span (no independent identity, no severity), used to capture milestones such as retries or exceptions within an operation |
| Trace ID | A 128-bit identifier (32 hex characters) that ties all signals related to a single distributed operation together |
| Span ID | A 64-bit identifier (16 hex characters) for a single span within a trace, used along with trace_id to correlate logs to specific operations |
| Loki | Grafana’s log aggregation system, designed around indexed labels (bounded cardinality) plus full-text and JSON field search at query time |
| Structured logging | The practice of emitting logs as machine-parseable key-value records (typically JSON) rather than free-form text, enabling reliable filtering and correlation |
| Resource attributes | Service-level metadata (service.name, deployment.environment, k8s.namespace.name) shared across traces, metrics, and logs to make cross-signal queries possible |
| Derived field | A Grafana Loki data source feature that uses a regex to extract a value (like trace_id) from log lines and turn it into a clickable link to another data source |
| Trace-to-logs | A Grafana Tempo data source feature that builds a Loki query from a span’s resource attributes and trace_id, allowing one-click pivot from trace to logs |
Chapter 10: The OpenTelemetry Collector in Depth
The OpenTelemetry Collector is the centerpiece of any non-trivial observability deployment. It is a vendor-neutral, pluggable data plane that receives telemetry from applications and infrastructure, transforms it on the fly, and exports it to one or more backends — Prometheus, Loki, Tempo, Jaeger, SaaS observability vendors, or all of the above simultaneously. Think of the Collector as the USB-C hub of observability: a single device that lets dozens of input cables (receivers) flow through programmable adapters (processors) into dozens of output ports (exporters), without forcing applications or backends to know about one another.
This chapter takes a deep look at how the Collector is composed, how to shape data with processors like transform and tail_sampling, which receivers and exporters you will reach for most often, and how to operate it reliably under load.
Learning Objectives
By the end of this chapter you will be able to:
- Compose a Collector pipeline from receivers, processors, exporters, and extensions
- Apply
transform,filter,batch, andtail_samplingprocessors to control cost and shape data - Operate the Collector reliably with health checks, queuing, and back-pressure tuning
10.1 Pipeline Architecture
The Collector is not a single black box; it is a configurable pipeline engine built from four kinds of components plus optional extensions. Every piece of telemetry flowing through the Collector takes the same conceptual journey: it enters through a receiver, passes through a chain of processors, and leaves through one or more exporters. Pipelines are declared per signal type (traces, metrics, logs), and the service block is what actually wires the components together into runnable pipelines.
Figure 10.0: Canonical Collector pipeline anatomy
flowchart LR
R1[OTLP receiver] --> P1[memory_limiter]
R2[Prometheus receiver] --> P1
P1 --> P2[k8sattributes / resource]
P2 --> P3[filter / tail_sampling]
P3 --> P4[transform OTTL]
P4 --> P5[batch]
P5 --> E1[OTLP exporter]
P5 --> E2[prometheusremotewrite]
P5 --> E3[debug]
Figure 10.1: Multi-signal pipeline topology
graph TD
subgraph Extensions
HC[health_check]
PP[pprof]
ZP[zpages]
end
subgraph "traces pipeline"
TR[otlp receiver] --> TML[memory_limiter]
TML --> TK8S[k8sattributes]
TK8S --> TB[batch]
TB --> TE[otlp/tempo exporter]
end
subgraph "metrics pipeline"
MR[otlp receiver] --> MML[memory_limiter]
MML --> MK8S[k8sattributes]
MK8S --> MB[batch]
MB --> ME[prometheusremotewrite exporter]
end
subgraph "logs pipeline"
LR[otlp receiver] --> LML[memory_limiter]
LML --> LK8S[k8sattributes]
LK8S --> LB[batch]
LB --> LE[loki exporter]
end
10.1.1 The four component types (plus extensions)
| Component | Role | Examples |
|---|---|---|
| Receiver | Accepts data in (push) or pulls data from a source | otlp, prometheus, hostmetrics, filelog, kafka |
| Processor | Mutates, filters, batches, samples, or enriches data in flight | memory_limiter, batch, transform, tail_sampling |
| Exporter | Sends data to one or more backends | otlp, prometheusremotewrite, loki, debug |
| Connector | Joins two pipelines: acts as an exporter on one side and a receiver on the other | spanmetrics, routing, forward |
| Extension | Non-pipeline capabilities (health, profiling, debugging) | health_check, pprof, zpages, file_storage |
Connectors are a relatively recent addition and are the cleanest way to derive one signal from another — for example, generating RED metrics (Rate, Errors, Duration) from spans by piping traces into a spanmetrics connector and out as metrics on a separate pipeline.
10.1.2 Pipelines per signal type
Each pipeline is strictly typed: a traces pipeline can only contain receivers, processors, and exporters that understand spans. The same is true for metrics and logs. You can declare multiple pipelines of the same type (traces/internal, traces/external) to apply different processing to different streams.
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
check_interval: 1s
limit_mib: 800
spike_limit_mib: 200
batch:
timeout: 5s
send_batch_size: 512
exporters:
otlp/tempo:
endpoint: tempo-distributor.observability:4317
tls: { insecure: true }
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
loki:
endpoint: http://loki-gateway/loki/api/v1/push
extensions:
health_check:
pprof:
zpages:
service:
extensions: [health_check, pprof, zpages]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
The service.pipelines section is the contract that turns isolated component definitions into a running data plane. A component declared in receivers, processors, or exporters but not referenced in service.pipelines is silently ignored.
10.1.3 Processor order matters
Processors run in the order listed in the pipeline. This is one of the most consequential, and most commonly overlooked, properties of Collector configuration. As a rule of thumb [Source: https://kodekloud.com/blog/kubernetes-best-practices-2025/]:
memory_limiteralways first — so back-pressure kicks in before later, more expensive processors waste CPU.- Enrichment processors next (e.g.,
k8sattributes,resource) — so downstream filters can see the full context. - Filter / sampling next — drop unwanted data before transforms touch it.
transform/ scrubbing — reshape what’s left.batchlast — coalesce into large outbound batches just before the exporter.
Key Takeaway: A Collector pipeline is a typed chain of receivers, processors, and exporters wired together in
service.pipelines. Component order, not just component choice, is what defines behavior.
10.2 Key Processors
Processors are where most of the Collector’s intelligence lives. Two of them — batch and memory_limiter — are effectively mandatory in production; the rest you reach for as your needs grow.
10.2.1 memory_limiter and batch — the mandatory pair
The memory_limiter processor measures the Collector’s own memory usage on a fixed interval and, when usage crosses configurable thresholds, refuses new data (returning errors to receivers). This is what creates back-pressure: instead of the Collector silently dying from an out-of-memory kill, upstream senders see failures, retry, and slow down [Source: https://kubernetes.io/docs/setup/best-practices/cluster-large/].
processors:
memory_limiter:
check_interval: 1s
limit_mib: 800 # ~80% of container memory limit
spike_limit_mib: 200 # tolerance for short bursts
The batch processor groups telemetry into larger payloads before they reach an exporter. Larger batches mean fewer gRPC/HTTP calls and dramatically better throughput, at the cost of a few seconds of added latency.
processors:
batch:
timeout: 5s
send_batch_size: 512
send_batch_max_size: 4096
Analogy: batch is a hotel shuttle that waits up to five minutes (or until the seats are full) before driving to the airport — much more efficient than running an empty taxi for every guest. memory_limiter is the bouncer at the lobby door who turns guests away when the lobby is full, so the building never collapses.
10.2.2 attributes, resource, and transform (OTTL)
The attributes and resource processors handle straightforward add/update/delete operations on attribute keys. For anything richer — conditional logic, regex substitution, cross-field arithmetic — reach for the transform processor, which uses the OpenTelemetry Transformation Language (OTTL) [Source: https://arxiv.org/html/2501.11709v3].
OTTL is an expression-based DSL. A statement looks like a function call with an optional where clause:
set(target, value) where <boolean condition>
Statements run inside a context — span, metric, datapoint, log, resource, or scope — and can read or modify fields like attributes["key"], name, body, severity_text, and resource.attributes["key"].
Here is a transform block that normalizes HTTP routes (so /users/42 and /users/9000 are aggregated together), scrubs PII, and tags every span emitted by the checkout service:
processors:
transform:
error_mode: ignore
trace_statements:
- context: span
statements:
# Collapse user IDs in URL paths so cardinality stays bounded
- replace_pattern(attributes["http.target"], "/users/[0-9]+", "/users/:id") where attributes["http.target"] != nil
# Remove PII before exporting
- delete_key(attributes, "user.email")
- delete_key(attributes, "user.id")
# Whitelist what's allowed to leave
- keep_keys(attributes, ["http.method", "http.target", "http.status_code", "service.name"])
# Mark anything from the checkout service
- set(attributes["env"], "prod") where resource.attributes["service.name"] == "checkout-service"
The error_mode knob is important: propagate (the default in some versions) can fail an entire batch when a single statement errors; ignore is the safer choice in most production pipelines.
A close cousin is the filter processor, which uses OTTL conditions to drop data outright. Filtering health-check spans is a classic use:
processors:
filter/web:
traces:
span:
exclude:
match_type: expr
expressions:
- 'name == "/healthz" or attributes["http.target"] == "/healthz"'
Order matters here too. Putting filter before transform saves CPU because dropped data never gets reshaped. Putting transform before filter lets filters operate on normalized values. Pick the order that matches your intent.
10.2.3 tail_sampling and probabilistic_sampler
Sampling is the cost-control lever for traces. The probabilistic_sampler makes a quick, stateless decision (e.g., “keep 5%”) based on the trace ID. It is cheap and predictable but blind — it cannot prefer error or slow traces over normal ones.
The tail_sampling processor is fundamentally different: it buffers all spans for a trace in memory, keyed by trace ID, and decides whether to keep or drop the entire trace only after decision_wait seconds have passed or all spans have arrived [Source: https://news.ycombinator.com/item?id=44095189]. Because it sees the whole trace, it can sample based on end-to-end latency, final status code, or attributes that only appear on a leaf span.
A production-grade tail sampling policy usually combines several rules through a composite policy with priority ordering:
processors:
tail_sampling:
decision_wait: 10s
num_traces: 50000
expected_new_traces_per_sec: 2000
policies:
- name: main
type: composite
composite:
max_total_spans_per_second: 1000
policy_order:
- error-traces
- slow-traces
- premium-tenants
- baseline
sub_policies:
error-traces:
type: status_code
status_code:
status_codes: [ERROR]
slow-traces:
type: latency
latency:
threshold_ms: 4000
premium-tenants:
type: string_attribute
string_attribute:
key: tenant.tier
values: ["gold", "platinum"]
baseline:
type: probabilistic
probabilistic:
sampling_percentage: 1.0
This config keeps every error trace and every trace slower than 4 seconds, keeps everything from premium tenants, and falls back to 1% random sampling for the rest — all under an overall budget of 1,000 spans/sec.
A crucial gotcha: tail sampling only works if SDKs export spans unsampled (always_on or parentbased(always_on)). If the SDK already dropped the spans before they reached the Collector, no policy can resurrect them [Source: http://susandumais.com/CHI2012-12-tailanswers-chi2012.pdf].
Figure 10.2: Tail sampling decision flow
sequenceDiagram
participant SDK as "SDK always_on"
participant Col as "Collector tail_sampling"
participant Buf as "Trace buffer"
participant Pol as "Policy evaluator"
participant BE as "Backend Tempo"
SDK->>Col: Span A trace T1 root
Col->>Buf: Buffer T1 spans
SDK->>Col: Span B trace T1 child
Col->>Buf: Buffer T1 spans
SDK->>Col: Span C trace T1 error
Col->>Buf: Buffer T1 spans
Note over Col,Buf: Wait decision_wait 10s
Col->>Pol: Evaluate composite policies
Pol->>Pol: error-traces matches? YES
Pol-->>Col: KEEP trace T1
Col->>BE: Export all T1 spans
Note over Col: Traces matching no policy<br/>are dropped before export
| Aspect | Head/parent-based sampler | tail_sampling processor |
|---|---|---|
| Decision point | Root span start | After decision_wait in Collector |
| Sees full trace | No | Yes |
| Can prefer errors / slow traces | No | Yes |
| SDK overhead | Low (drops at source) | High (must export everything) |
| Collector memory & CPU | Minimal | Substantial (buffers spans) |
10.2.4 k8sattributes — Kubernetes enrichment
The k8sattributes processor watches the Kubernetes API and decorates incoming telemetry with metadata about the pod that sent it: namespace, deployment name, node, labels, annotations. It identifies the sender either by inbound connection IP or by an explicit k8s.pod.ip resource attribute.
processors:
k8sattributes:
auth_type: serviceAccount
filter:
node_from_env_var: K8S_NODE_NAME
extract:
metadata:
- k8s.pod.name
- k8s.namespace.name
- k8s.node.name
- k8s.pod.uid
- k8s.deployment.name
pod_association:
- from: connection
- from: resource_attribute
name: k8s.pod.ip
Run k8sattributes on the agent (DaemonSet) Collector — the one that receives data directly from local pods — where connection-based association still works. Running it on a central gateway is rarely useful because the source IP it sees is the agent’s, not the application pod’s. Limit the extract.metadata list to what you actually query on; every additional field multiplies cardinality and API-server load.
Key Takeaway:
memory_limiterandbatchare non-negotiable in production;transform(with OTTL),filter, andtail_samplingreshape and sample data;k8sattributesenriches with Kubernetes context — but only when run close to the source.
10.3 Key Receivers and Exporters
The Collector’s strength is its plug-and-play library of receivers and exporters. A handful cover the vast majority of real-world deployments.
10.3.1 Workhorse receivers
| Receiver | What it ingests | Typical use |
|---|---|---|
otlp | OTLP/gRPC and OTLP/HTTP (traces, metrics, logs) | Default for SDK and Collector-to-Collector traffic |
prometheus | Scrapes Prometheus /metrics endpoints | Migrating from Prometheus, scraping exporters |
hostmetrics | OS-level CPU, memory, disk, network, filesystem, process | Node-agent monitoring (CPU%, memory, load average) |
filelog | Tails log files with multiline and parser support | Collecting container logs from /var/log/pods on the node |
kafka | Reads OTLP-encoded data from Kafka topics | Decoupling ingest from processing |
jaeger, zipkin | Legacy formats for migration | Brownfield environments still emitting Jaeger/Zipkin |
A common DaemonSet receiver block looks like this:
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
hostmetrics:
collection_interval: 30s
scrapers:
cpu:
memory:
disk:
filesystem:
network:
load:
filelog:
include: ["/var/log/pods/*/*/*.log"]
start_at: end
include_file_path: true
operators:
- type: container
The prometheus receiver is worth a special mention: it accepts native Prometheus scrape config, which means an existing prometheus.yml can be lifted into the Collector almost verbatim — a powerful migration path when teams want to start consolidating their data plane without rewriting their scrape rules.
10.3.2 Workhorse exporters
| Exporter | Destination | Notes |
|---|---|---|
otlp | Any OTLP-compatible backend (Tempo, Jaeger, vendors) | Default, supports gRPC and HTTP |
prometheusremotewrite | Prometheus, Mimir, Cortex, Thanos | For metrics fan-out into the Prometheus ecosystem |
loki | Grafana Loki | Logs only; attribute-to-label mapping is configurable |
debug | stdout of the Collector itself | Replaces the older logging exporter; invaluable for dev |
kafka | Kafka topic, OTLP-encoded | Pairs with the kafka receiver to build buffered pipelines |
file | Local file (JSON) | Disaster-recovery sink, offline replay |
A production gateway exporter block typically configures both queueing and retry on the otlp exporter:
exporters:
otlp/tempo:
endpoint: tempo-distributor.observability:4317
tls: { insecure: true }
sending_queue:
enabled: true
num_consumers: 10
queue_size: 2000
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 60s
max_elapsed_time: 0 # retry forever
prometheusremotewrite:
endpoint: http://mimir.observability:8080/api/v1/push
resource_to_telemetry_conversion: { enabled: true }
loki:
endpoint: http://loki-gateway/loki/api/v1/push
debug:
verbosity: basic
resource_to_telemetry_conversion: true is a handy switch on prometheusremotewrite that promotes OTLP resource attributes (like service.name, k8s.pod.name) into Prometheus labels so they become queryable in PromQL.
10.3.3 Connectors — the inter-pipeline glue
A connector behaves as an exporter on one pipeline and as a receiver on another. The canonical example is spanmetrics, which consumes spans and emits aggregated RED metrics — exactly the kind of derived signal you want produced once, near the source, instead of repeatedly in each backend.
connectors:
spanmetrics:
histogram:
explicit:
buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s]
dimensions:
- name: http.method
- name: http.status_code
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [spanmetrics, otlp/tempo]
metrics/spans:
receivers: [spanmetrics]
processors: [batch]
exporters: [prometheusremotewrite]
Notice how the same spanmetrics instance appears as an exporter under the traces pipeline and as a receiver under metrics/spans. Other useful connectors include routing (split traffic by attribute) and forward (chain pipelines together).
Key Takeaway: OTLP and Prometheus receivers cover most ingest; OTLP, Prometheus remote-write, and Loki exporters cover most egress; connectors like
spanmetricscleanly derive one signal from another inside the same Collector.
10.4 Reliability and Operations
Once a Collector is the single conduit for an organization’s telemetry, it becomes a critical piece of infrastructure. Reliability hinges on three things: durable queues, fast feedback through extensions, and resource sizing that matches load.
10.4.1 Persistent queue and retry on failure
Every OTLP exporter — and most other exporters — supports a sending_queue and retry_on_failure. By default the sending queue is in memory: fast, but lost on restart. Pairing it with the file_storage extension makes the queue persistent across restarts, so an OOM kill or rolling deployment doesn’t drop telemetry already accepted from upstream [Source: https://www.pulumi.com/blog/kubernetes-best-practices-i-wish-i-had-known-before/].
extensions:
file_storage:
directory: /var/lib/otelcol/storage
timeout: 1s
exporters:
otlp/tempo:
endpoint: tempo-distributor.observability:4317
sending_queue:
enabled: true
storage: file_storage # makes the queue durable
num_consumers: 10
queue_size: 2000
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 60s
max_elapsed_time: 0
Figure 10.4: memory_limiter, batch, sending queue, and retry interplay
flowchart LR
IN[Incoming spans / metrics / logs] --> ML{memory_limiter<br/>over threshold?}
ML -- "yes: refuse" --> REJ[Return error to receiver<br/>upstream retries / backs off]
ML -- "no: accept" --> BAT[batch<br/>timeout or send_batch_size]
BAT --> SQ[(sending_queue<br/>file_storage backed)]
SQ --> EXP[Exporter consumer pool<br/>num_consumers]
EXP -- "success" --> BE[Backend]
EXP -- "failure" --> RET{retry_on_failure<br/>exponential backoff}
RET -- "retry" --> SQ
RET -- "queue full" --> DROP[Drop oldest<br/>otelcol_exporter_send_failed]
Sizing tips:
queue_size: start at 1,000–5,000 batches per exporter. Each batch can hold thousands of spans/metrics/log records.num_consumers: 5–10 is typical. Increase only if the backend can handle more parallelism without rejecting.retry_on_failure: exponential backoff withinitial_interval: 5sandmax_interval: 60s; setmax_elapsed_time: 0for critical data and letmemory_limiterdrop the oldest items when the queue fills.
10.4.2 Extensions: health_check, pprof, zpages
Extensions don’t participate in pipelines, but they are how you operate the Collector day-to-day.
| Extension | Endpoint (default) | Use |
|---|---|---|
health_check | :13133/ | Kubernetes liveness/readiness probes |
pprof | :1777/debug/pprof/ | CPU and heap profiling under load |
zpages | :55679/debug/ | Live in-process views: pipelines, exporter queues, recent spans |
file_storage | n/a (filesystem) | Backing store for persistent queues |
A reasonable Kubernetes probe configuration looks like:
extensions:
health_check:
endpoint: 0.0.0.0:13133
pprof:
endpoint: 0.0.0.0:1777
zpages:
endpoint: 0.0.0.0:55679
service:
extensions: [health_check, pprof, zpages, file_storage]
In your pod spec:
livenessProbe:
httpGet: { path: /, port: 13133 }
initialDelaySeconds: 10
readinessProbe:
httpGet: { path: /, port: 13133 }
periodSeconds: 5
zpages is especially useful during incidents — it exposes per-component counters and sampled recent traces directly from the Collector’s process, so you can answer “Is data flowing? Is anything dropping?” without leaving the cluster.
Figure 10.3: Two-tier agent + gateway Collector topology
flowchart LR
subgraph "Node 1"
P1[App pod] --> A1["Agent Collector<br/>DaemonSet<br/>memory_limiter<br/>k8sattributes<br/>light batch"]
P2[App pod] --> A1
end
subgraph "Node 2"
P3[App pod] --> A2["Agent Collector<br/>DaemonSet<br/>memory_limiter<br/>k8sattributes<br/>light batch"]
P4[App pod] --> A2
end
subgraph "Node N"
P5[App pod] --> A3["Agent Collector<br/>DaemonSet<br/>memory_limiter<br/>k8sattributes<br/>light batch"]
end
A1 --> GW["Gateway Collector<br/>Deployment + HPA<br/>tail_sampling<br/>transform<br/>heavy batch<br/>persistent queue"]
A2 --> GW
A3 --> GW
GW --> TEMPO[(Tempo)]
GW --> MIMIR[(Mimir)]
GW --> LOKI[(Loki)]
10.4.3 Sizing, throughput, and memory tuning
In Kubernetes, the recommended pattern is a two-tier topology [Source: https://kodekloud.com/blog/kubernetes-best-practices-2025/]:
- Agent (DaemonSet): one Collector per node. Handles OTLP from local pods, tails container logs via
filelog, scrapeshostmetrics, appliesk8sattributes, does light batching, forwards to the gateway. - Gateway (Deployment + HPA): shared by all nodes. Hosts CPU-heavy processors (
tail_sampling,transform, largebatch), holds the durable sending queues, and centralizes egress to backends.
Recommended starting resources:
| Role | CPU request | CPU limit | Memory request | Memory limit |
|---|---|---|---|---|
| Agent (DaemonSet) | 100–250m | 500–750m | 256–512 Mi | 512 Mi–1 Gi |
| Gateway (Deployment) | 500m–1 vCPU | 2–4 vCPU | 1–2 Gi | 2–4 Gi |
Tune from there using observed peaks. The memory_limiter should target 70–80% of the container memory limit, with spike_limit_mib covering the largest plausible batch:
processors:
memory_limiter:
check_interval: 1s
limit_mib: 1600 # for a 2 GiB-limit gateway pod
spike_limit_mib: 400
For the gateway, an HPA on CPU (target 60–70% utilization) and a minReplicas: 2 gives you graceful scaling and high availability [Source: https://www.gravitee.io/blog/top-5-kubernetes-deployment-strategies]:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: otel-gateway
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: otel-gateway
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
Agents are not HPA-scaled; they scale with node count via the DaemonSet, so adjust per-agent resources instead. Tail sampling capacity follows a simple rule of thumb: num_traces ≥ expected_new_traces_per_sec × decision_wait × 2. At 2,000 traces/sec and a 10-second decision_wait, that’s num_traces: 40000 minimum — round up to 50,000 for headroom.
Monitor the Collector with… itself: every Collector exposes its own internal metrics on port 8888 (e.g., otelcol_exporter_queue_size, otelcol_processor_dropped_spans, otelcol_receiver_refused_spans). Alert on:
- Memory near the
limit_mibfor sustained periods - Sustained growth of
otelcol_exporter_queue_size - Any non-zero
otelcol_processor_refused_*orotelcol_exporter_send_failed_*counters - HPA at
maxReplicaswhile queues continue to grow
Key Takeaway: Run a two-tier (agent + gateway) topology, give the gateway a durable sending queue with retry, expose
health_check/pprof/zpagesfor operability, and sizememory_limiter,num_traces, and HPA targets from observed load rather than guesses.
Chapter Summary
The OpenTelemetry Collector is a pluggable, vendor-neutral data plane composed of receivers, processors, exporters, connectors, and extensions, with pipelines declared per signal type in the service block. Order is decisive: memory_limiter belongs first, enrichment and filtering come next, transforms reshape what survives, and batch is the final stop before an exporter. OTTL — the OpenTelemetry Transformation Language — gives the transform and filter processors a small, context-aware DSL for setting, deleting, whitelisting, and pattern-replacing attributes, with where clauses to scope each statement.
For cost control on traces, tail_sampling buffers spans by trace ID and decides per-trace after decision_wait, enabling policies that prefer errors, slow traces, and key tenants while a probabilistic backstop gives baseline coverage. Workhorse components — otlp, prometheus, hostmetrics, filelog, kafka receivers; otlp, prometheusremotewrite, loki, debug exporters; and connectors like spanmetrics — cover the overwhelming majority of real deployments.
Reliability comes from three habits: configuring memory_limiter and batch everywhere, enabling persistent sending queues with retry on the gateway, and exposing health_check, pprof, and zpages for fast operational feedback. In Kubernetes, deploy a DaemonSet agent for local concerns (logs, host metrics, k8sattributes) and a horizontally-scaled gateway Deployment for tail sampling, transforms, and egress. Size everything from real metrics, not hope — the Collector’s own :8888 endpoint is your honest source of truth.
Key Terms
| Term | Definition |
|---|---|
| Receiver | Component that ingests telemetry into the Collector via push (OTLP) or pull (Prometheus scrape, file tail, etc.). |
| Processor | Component that mutates, filters, batches, samples, or enriches telemetry as it flows through a pipeline. |
| Exporter | Component that sends telemetry from the Collector to one or more downstream backends. |
| Connector | Hybrid component acting as an exporter on one pipeline and a receiver on another — used to derive signals (e.g., metrics from spans). |
| OTTL | OpenTelemetry Transformation Language — context-aware DSL used by transform (statements) and filter (conditions). |
| Tail sampling | Per-trace sampling decision made after spans have been buffered, enabling policies based on full-trace properties. |
memory_limiter | Processor that monitors Collector memory and applies back-pressure (refuses new data) above configured thresholds. |
k8sattributes | Processor that enriches telemetry with Kubernetes pod metadata (namespace, deployment, labels) via the Kubernetes API. |
Chapter 11: Sampling, Performance, and Cost Control
Observability is one of those engineering disciplines where the cure can become as expensive as the disease. A team gets a P1 incident, blames lack of telemetry, and overcorrects by instrumenting everything at 100% sampling, adding every available label, and shipping every line of structured log to a hot index. Three months later, the observability bill rivals the infrastructure bill, the ingestion pipeline is back-pressuring, and engineers are still no faster at finding the root cause.
The discipline this chapter teaches is the opposite reflex: deliberately keeping the right telemetry and dropping the rest, in the right place, at the right cost. We will look at sampling strategies (head vs tail), cardinality management for metrics, the real performance overhead of instrumentation in Java and Go, and how to design a cost-aware architecture that scales.
Think of telemetry like a city’s water system. You do not need to bottle and refrigerate every drop that flows through a pipe — but you do need flow sensors at junctions, alarms when a main bursts, and a sampled chemistry test now and then. Get the sampling right, and a small storage tank tells you everything you need. Get it wrong, and you are paying to refrigerate the entire reservoir.
Learning Objectives
By the end of this chapter, you will be able to:
- Compare head-based and tail-based sampling and pick the right strategy per workload.
- Tune cardinality and ingestion rates to control observability spend.
- Measure and reduce the performance overhead of instrumentation in production services.
- Design tiered storage and retention policies that match telemetry value to cost.
Sampling Strategies
Sampling is the single most powerful lever you have for controlling telemetry cost and overhead. It is also the most misunderstood. Engineers reach for “100% sampling, we want to see everything,” not realizing that at high throughput this can cost more than the application itself. The trick is to sample in a way that preserves what matters — errors, slow traces, unusual tenants — while discarding the redundant majority.
In OpenTelemetry, two sampling families dominate: head-based sampling (decided in the SDK at the start of a trace) and tail-based sampling (decided in the Collector after most of the trace has been seen) [Source: https://opentelemetry.io/docs/zero-code/java/agent/performance/].
Head-based: ParentBased, TraceIdRatio, AlwaysOn/Off
Head-based sampling makes the keep-or-drop decision at the first service that creates the root span — typically the SDK in your front-door API. Because the decision is made up front, downstream services never have to record, store, or ship the spans of a dropped trace. This is the cheapest possible form of sampling.
The canonical head sampler in OTel is TraceIdRatioBased. It hashes the trace ID (or uses its high bits) to produce a deterministic sample with probability p — so two services seeing the same trace ID will always agree. Set p = 0.1 and you keep 10% of traces, uniformly at random [Source: https://opentelemetry.io/docs/zero-code/java/agent/performance/].
In isolation, TraceIdRatioBased is not enough. If Service A keeps a trace and Service B independently re-decides, you get incoherent traces where the parent is sampled but children are missing. The fix is to wrap it in ParentBased: child spans honor the parent’s decision regardless of their own configured rate. The idiomatic OTel sampler is therefore:
ParentBased(root=TraceIdRatioBased(0.1))
This means: “If I am a root span, decide using a 10% ratio. If I have a parent, do whatever the parent did.” This single line is what makes distributed sampling coherent.
The other head samplers are extremes: AlwaysOn keeps every trace (good for dev/staging), AlwaysOff drops every trace (useful as a kill switch or for ephemeral services where traces are noise).
Figure 11.1: Head sampling KEEP vs DROP across services
sequenceDiagram
participant C as Client
participant A as Service A<br/>(root span)
participant B as Service B
participant D as Service D
participant Col as Collector
Note over A: KEEP path (sampled in)
C->>A: request 1
A->>A: TraceIdRatio: keep
A->>B: call (sampled=true)
B->>D: call (sampled=true)
A-->>Col: export spans
B-->>Col: export spans
D-->>Col: export spans
Note over A: DROP path (sampled out)
C->>A: request 2
A->>A: TraceIdRatio: drop
A->>B: call (sampled=false)
B->>D: call (sampled=false)
Note over A,D: no span recording,<br/>no export, no cost
Tail-based at the Collector
Head sampling is statistically unbiased but stupid: it will happily throw away the one trace where the payment service threw a 500, just because the dice rolled the wrong way. For rare-but-important events, you want to decide after you have seen the whole trace.
Tail-based sampling lives in the OpenTelemetry Collector, in the tail_sampling processor. The mechanics are:
- SDKs run
AlwaysOn(or a generous head sample like 50%) and ship every span. - The Collector buffers spans in memory, grouped by trace ID.
- It waits a
decision_waitwindow (often 5–30 seconds) for the trace to “complete.” - It evaluates policies — keep if any span has
error=true, keep if duration > 3s, keep iftenant_id=enterprise-1234, otherwise sample 10% randomly. - Selected traces are flushed to the backend; the rest are dropped.
The advantage is huge: you get full knowledge of the trace, including attributes that downstream services added (status codes, latency, tenant). The disadvantage is equally large: you must buffer everything in collector memory until the decision window closes.
The collector’s memory cost scales linearly with throughput and the wait window:
buffered_bytes ≈ spans/sec × avg_spans/trace × avg_span_size × wait_window_seconds
Plug in realistic numbers: 10,000 spans/sec × 20 spans/trace × 1 KB/span × 10s window = 2 GB of in-memory spans [Source: https://opentelemetry.io/docs/zero-code/java/agent/performance/]. Double the window, double the memory. This is why tail sampling typically requires sharding collectors by trace ID (so all spans of a given trace land on the same collector) and aggressive timeout tuning.
Tail sampling also adds observability latency: the time between a trace happening and the trace appearing in your backend. If your SREs depend on traces for live incident detection, a 30-second decision window means a 30-second blind spot.
Probabilistic vs Adaptive Sampling
Both head and tail sampling can be either probabilistic (fixed rate) or adaptive (rate adjusts to volume).
Probabilistic is the default and the simplest: “keep 10% of traces, forever.” It is deterministic, predictable, and easy to reason about for SLO math (your sampled error rate is an unbiased estimate of the true error rate). Most teams should start here.
Adaptive sampling dynamically adjusts the rate to hit a target traces-per-second budget. If traffic spikes 5×, the sampler drops the rate from 10% to 2% to keep the ingest pipeline steady. This is great for cost control but breaks naive statistical extrapolation — a 1% error rate in a 2%-sampled bucket means something different than in a 50%-sampled bucket. Tools like the OTel Collector probabilistic_sampler processor support this pattern with sampling_percentage driven by upstream load.
A common production pattern is the hybrid: a light head sample (say, ParentBased(TraceIdRatioBased(0.2)) for 20%) plus a tail sampler that further refines the 20% to keep errors and slow traces. This caps collector memory at a fifth of full firehose, while still letting the tail layer find interesting traces. The catch: errors dropped by the head sampler are lost forever. “Keep all errors” becomes “keep all errors among the 20% that survived head sampling” [Source: https://github.com/open-telemetry/opentelemetry-java-instrumentation/discussions/2104].
Figure 11.2: Hybrid head + tail sampling decision pipeline
flowchart TD
Start[Trace begins<br/>at SDK]
Start --> Head{Head sampler<br/>ParentBased<br/>TraceIdRatio 20%}
Head -->|80% drop| Drop1[Discard at SDK<br/>no spans recorded]
Head -->|20% keep| Export[Export spans<br/>to Collector]
Export --> Buffer[Buffer by trace_id<br/>decision_wait window]
Buffer --> Tail{Tail sampling<br/>policies}
Tail -->|error=true| Keep1[Persist to backend]
Tail -->|latency > 3s| Keep2[Persist to backend]
Tail -->|tenant=enterprise| Keep3[Persist to backend]
Tail -->|10% random| Keep4[Persist to backend]
Tail -->|none matched| Drop2[Discard at Collector]
Comparison: When to Use Which
| Dimension | Head-based (TraceIdRatioBased) | Tail-based (collector tail_sampling) |
|---|---|---|
| Where decided | SDK, at first span | Collector, after buffering |
| Decision timing | Immediate, pre-export | After decision_wait (5–30 s) |
| Latency impact (request) | Negligible | None on request, but observability lag |
| Memory cost | Very low | High; scales with spans/sec × wait |
| Network cost | Low (drops never leave service) | High (all spans cross the wire) |
| Accuracy for rare events | Poor (random drops) | Excellent (sees full trace) |
| Statistical properties | Unbiased random sample | Biased toward “interesting” |
| Configuration complexity | Simple, per-SDK setting | Complex; tune buffers, timeouts, policies |
| Best for | High-QPS, cost-sensitive APIs | Rare-error capture, targeted debugging |
| Example use case | 50k RPS API at 1% sample | Payment service, keep all errors + slow |
[Source: https://opentelemetry.io/docs/zero-code/java/agent/performance/]
Key Takeaway: Head-based sampling is cheap, deterministic, and ideal for high-volume systems where statistical aggregates matter more than every individual trace. Tail-based sampling is expensive and complex, but it is the only way to reliably catch rare-but-important traces — use it where the cost of missing a trace exceeds the cost of buffering all traces.
Cardinality Management
If sampling controls trace cost, cardinality controls metric cost. In a Prometheus-style time-series database, the unit of storage is not the metric — it is the unique combination of metric name plus label set. Each unique combination is one time series, with its own samples, its own memory footprint in the head block, and its own row in long-term storage [Source: https://blog.codinghorror.com/the-problem-with-logging/].
A metric like http_requests_total with labels {method="GET", status="200"} is one series. Add a user_id label with 100,000 distinct users and you have potentially 100,000 series per method-status combination. The math is multiplicative, not additive, and the explosion is brutal.
Identifying High-Cardinality Labels
The first job is forensic: figuring out which metrics and labels are causing the blow-up. Prometheus has a number of built-in tools for this.
The most useful PromQL query is the “top-N by series count”:
topk(20, count by (__name__)({__name__=~".+"}))
This returns the 20 metric names with the most series. Once you have a suspect, drill into its labels:
topk(10, count by (label_name) (http_server_requests_seconds_count))
For deeper analysis, promtool tsdb analyze reads the on-disk block format and produces a report of the heaviest labels and series:
promtool tsdb analyze /path/to/data > tsdb-report.txt
The TSDB also exposes meta-metrics that should be on every Prometheus operator’s dashboard: prometheus_tsdb_head_series (live series count), prometheus_tsdb_head_series_created_total (churn), and prometheus_tsdb_wal_fsync_duration_seconds (early sign of disk pressure).
For users running Grafana Mimir, the Cardinality Explorer UI and a set of API endpoints make this even easier — /api/v1/cardinality/metric_names, /api/v1/cardinality/label_names?metric=X, and so on. You can script these into CI/CD so a build fails if a new metric crosses, say, 50,000 series [Source: https://blog.codinghorror.com/the-problem-with-logging/].
Allow/Deny Lists and Attribute Drops
Once you know the offenders, the fastest fix is to drop them at scrape time using metric_relabel_configs. This Prometheus config runs after the sample has been scraped but before it is committed to storage, making it the right place for cardinality surgery.
Drop a single noisy label:
metric_relabel_configs:
- source_labels: [uri]
regex: ".+"
action: labeldrop
Drop several at once:
- regex: "user_id|session_id|request_id"
action: labeldrop
Keep only an allow-listed set (paranoid mode):
- regex: "job|instance|method|status_code"
action: labelkeep
Drop entire metric families:
- source_labels: [__name__]
regex: ".*_per_user_.*"
action: drop
Normalize dynamic path segments:
- source_labels: [path]
regex: "/api/v1/users/[0-9]+/(.*)"
target_label: path
replacement: "/api/v1/users/:id/$1"
action: replace
That last pattern — normalizing /users/12345/orders into /users/:id/orders — is one of the most useful tricks in the playbook. It collapses unbounded user-ID-laden paths into a bounded set of route templates without losing the structural information.
In the OpenTelemetry Collector, the equivalent surgery happens in the attributes and transform processors, which can drop, rename, or hash attributes before they reach the metrics exporter.
Aggregating Away Unbounded Dimensions
Sometimes a label genuinely matters for some questions but is too expensive to keep on the raw metric. The pattern here is recording rules: pre-compute and persist a lower-cardinality aggregation, and query the rolled-up series instead of the raw one.
groups:
- name: myapp_aggregates
interval: 30s
rules:
- record: myapp:http_request_duration_seconds_bucket:service
expr: |
sum by (service, le) (
myapp_http_request_duration_seconds_bucket
)
This rule sums the raw histogram across all labels except service and le (bucket boundary), so quantile queries become trivial:
histogram_quantile(0.95, rate(myapp:http_request_duration_seconds_bucket:service[5m]))
You can keep the high-cardinality raw metric at short retention (say, 2 hours) for live debugging, while the rolled-up recording rule retains for 90 days at a tiny fraction of the storage.
The best long-term fix, however, is always application-level: do not emit unbounded labels in the first place. Use templated route names. Use stable error_code enums, not raw exception messages. Never put user_id, session_id, request_id, email, or cart_id on a metric — those belong on traces or logs, where the storage model can handle high cardinality [Source: https://blog.codinghorror.com/the-problem-with-logging/].
If you already have bad data in storage, Mimir and Prometheus both expose deletion APIs (DELETE /api/v1/admin/tsdb/delete_series with matcher and time range) followed by promtool tsdb clean-tombstones to actually reclaim disk.
Figure 11.3: Cardinality reduction funnel
flowchart TD
Raw["Raw exposition<br/>http_requests_total{method, status,<br/>user_id, request_id, path}<br/>~1,000,000 series<br/>(cost: 1000x)"]
Raw -->|"metric_relabel_configs:<br/>labeldrop user_id, request_id<br/>normalize /users/:id/"| Mid
Mid["After scrape-time relabel<br/>http_requests_total{method, status, route}<br/>~50,000 series<br/>(cost: 50x)"]
Mid -->|"recording rule:<br/>sum by (service, le)"| Roll
Roll["Recording-rule rollup<br/>myapp:http_request_duration:service<br/>~200 series<br/>(cost: 1x)"]
Key Takeaway: Each unique label combination is its own time series, and high-cardinality labels multiply storage cost. Identify offenders with
topkqueries and Mimir’s Cardinality Explorer, drop them at scrape time withmetric_relabel_configs, aggregate via recording rules, and — above all — never put unbounded IDs on metric labels at the application level.
Performance Overhead
The third lever is the cost of generating telemetry — the CPU and memory the instrumentation itself consumes inside your application. This is the cost engineers worry about most, but it is usually the cost that matters least if you have sampling and cardinality under control. Still, knowing the numbers — and how to tune them — is part of being a competent observability operator.
CPU and Memory Cost of SDK Instrumentation
The honest answer is “it depends on the runtime, the agent style, and the workload.” But we can give meaningful ranges.
Java auto-instrumentation (the OpenTelemetry Java Agent) is the heaviest-weight common case because it uses bytecode instrumentation. The Elastic EDOT Java benchmark on a sample JVM service gives concrete numbers [Source: https://www.elastic.co/docs/reference/opentelemetry/edot-sdks/java/overhead]:
| Metric | No agent | With EDOT Java | Relative impact |
|---|---|---|---|
| Startup time | 5.55 s | 6.82 s | ~+23% (+1.3 s) |
| p95 request latency | 1.96 ms | 2.06 ms | ~+5% |
| Total system CPU | 53.82% | 54.25% | ~+0.8% absolute |
For most well-tuned JVM microservices, expect 1–5% additional CPU and tens of MB of extra heap [Source: https://opentelemetry.io/docs/zero-code/java/agent/performance/]. The OpenTelemetry community discussion on this topic puts the worst-case at up to ~20% CPU and ≥0.5 ms extra latency per instrumented hop for very high-throughput, fully instrumented apps without sampling [Source: https://github.com/open-telemetry/opentelemetry-java-instrumentation/discussions/2104]. That worst case is largely driven by GC pressure: every span allocates objects, and at high QPS the allocation rate dominates.
Go auto-instrumentation has a different architecture. The Go SDK is library-based — you import packages and add middleware — rather than a runtime bytecode agent. There is no JVM-style startup penalty and no class-loading hit. In practice, Go OpenTelemetry overhead at sampled rates is typically in the low single-digit CPU percent range with a few to tens of MB of additional RSS [Source: https://opentelemetry.io/docs/zero-code/java/agent/performance/].
Across both runtimes, the dominant cost is almost never the trace API itself — it is the attributes you collect and the exporter you use. Avoid expensive attribute collection in hot loops (don’t call getName() on a reflection-heavy object for every span), keep attribute lists short, and prefer the async batched exporter.
Batching, Async Export, and Buffer Tuning
The OpenTelemetry BatchSpanProcessor (BSP) is the workhorse that decouples span generation from span export. Spans go into an in-memory queue, a background thread pulls them in batches, and the exporter ships them over HTTP/gRPC. The tunable knobs all trade memory for CPU and loss-resistance:
| Knob | Effect of larger value | Effect of smaller value |
|---|---|---|
| Batch size | Better CPU/network efficiency, more memory | Lower memory, more frequent exports |
| Schedule delay | Fewer export calls, spans linger in memory | Lower loss risk, more even export |
| Max queue size | Survives bigger spikes without dropping | Smaller memory footprint, drops earlier |
| Sampling rate | More traces shipped, higher overhead | Cheaper, less coverage |
[Source: https://opentelemetry.io/docs/zero-code/java/agent/performance/]
Sampling is by far the most powerful knob. Going from 100% to 10% sampling reduces per-request export work by roughly 10×, and that compound saving cascades through the entire pipeline — fewer spans allocated, fewer batches, less network, less collector CPU, less backend ingest cost. Most production systems run at 1–10% probabilistic sampling, sometimes augmented with tail sampling for rare-error coverage.
Two other patterns matter at scale. First, back-pressure: when the queue fills, BSP drops spans rather than blocking the application. You must monitor otel_sdk_span_processor_dropped_spans (or the equivalent backend metric); a sustained nonzero rate means you are losing trace data and need to either sample harder or grow the queue. Second, async-only exporters in production: the synchronous SimpleSpanProcessor blocks the application thread on every export and should be reserved for dev/test.
Benchmarking Instrumented vs Uninstrumented Code
The right way to make capacity decisions is to measure your own application, not trust generic numbers. A minimal benchmark protocol:
- Baseline run: same hardware, same load generator, instrumentation disabled. Record p50/p95/p99 latency, CPU%, RSS, GC pause time, throughput.
- Instrumented run: same workload, instrumentation enabled at the sampling rate you intend to ship. Record the same metrics.
- Compute deltas: percentage CPU increase, absolute latency increase per request, RSS delta in MB.
- Stress run: drive load 2–3× above expected peak to find the breakpoint where instrumentation overhead becomes nonlinear (usually GC-driven on JVM).
- Tune and re-measure: lower sampling rate, drop unnecessary instrumentations, raise queue size, and re-run.
Plan capacity with headroom: +5–20% CPU for Java auto-instrumentation, low single-digit for Go, and budget heap growth on the JVM side [Source: https://www.elastic.co/docs/reference/opentelemetry/edot-sdks/java/overhead].
Figure 11.4: Async export pipeline (BatchSpanProcessor)
flowchart LR
App["Application thread<br/>span.end()"]
App -->|enqueue<br/>non-blocking| Queue["In-memory queue<br/>max_queue_size"]
Queue -->|"drop on full<br/>(dropped_spans counter)"| DropPath[Span dropped]
Queue -->|background worker<br/>pulls batches| Batcher["Batcher<br/>batch_size /<br/>schedule_delay"]
Batcher -->|OTLP gRPC/HTTP| Exp[Exporter]
Exp --> Col[OTel Collector]
Col --> Backend["Backend<br/>(Tempo / Mimir / Loki)"]
Key Takeaway: OpenTelemetry instrumentation typically costs 1–5% CPU and tens of MB of memory in well-tuned services, with Java agents at the higher end and Go SDKs at the lower. The most powerful overhead knob is the sampling rate, followed by batching configuration. Always benchmark against your own workload — generic numbers are a starting point, not a substitute for measurement.
Cost-aware Architecture
Sampling, cardinality, and overhead controls are tactical. The strategic layer is architecture: where in the pipeline you do what work, how telemetry flows from edge to backend, and how long each tier retains data. A well-architected observability stack costs a fraction of a poorly architected one for the same operational value.
Think of it as cold-chain logistics for telemetry. Fresh data needs fast, expensive storage. As it ages, you transfer it to progressively cheaper tiers, eventually freezing it into cold archive or deleting it entirely. The trick is matching the temperature curve of value-over-time to the cost curve of storage tiers.
Metric Aggregation at the Edge
The further telemetry travels before it is reduced, the more expensive it becomes. Network bytes, collector CPU, backend ingestion fees, hot-storage GB-hours — all scale with the volume that crosses each boundary.
Edge aggregation means doing reduction work as close to the source as possible:
- In the application: use OpenTelemetry SDK views to aggregate histogram buckets, drop unwanted attributes, or convert deltas to cumulative before the export call. This is the cheapest possible place to do the work.
- In a sidecar or node agent: a per-node OTel Collector or Prometheus Agent does the first round of relabeling, aggregation, and sampling. Network egress from that node is already reduced.
- In a regional collector tier: a second collector tier consolidates across nodes, applies cross-cutting policies (tenant routing, tail sampling), and forwards to the central backend.
This “fan-in funnel” is the dominant production pattern. A typical Kubernetes observability stack might run a Prometheus Agent or OTel Collector DaemonSet (one per node), feeding a regional collector StatefulSet (one per cluster), feeding the central backend. Each tier reduces volume by 2–10×.
Figure 11.5: Fan-in collector tiers and tiered storage
flowchart LR
subgraph Apps["App pods (many)"]
A1[App + SDK]
A2[App + SDK]
A3[App + SDK]
end
subgraph Node["Node-level Collector<br/>(DaemonSet)"]
N1[Relabel + sample]
end
subgraph Regional["Regional Collector tier<br/>(StatefulSet)"]
R1["Tail sampling<br/>cardinality enforcement"]
end
subgraph Backend["Observability backend<br/>(Mimir / Tempo / Loki)"]
Hot["Hot tier<br/>SSD, 2–24 h"]
Warm["Warm tier<br/>S3, 7–30 d"]
Cold["Cold tier<br/>Glacier, 90 d – 1 y"]
end
A1 --> N1
A2 --> N1
A3 --> N1
N1 -->|2–10x reduction| R1
R1 -->|2–10x reduction| Hot
Hot -->|age out| Warm
Warm -->|age out| Cold
Log Volume Reduction Strategies
Logs are the wildest tier in cost. Unlike metrics (bounded by cardinality) and traces (bounded by sampling), logs are a free-form firehose that engineers reflexively crank up under stress. A few patterns to keep log spend in check:
- Structured logging only: JSON or equivalent. This lets the pipeline filter and route deterministically rather than running expensive regex over free-text.
- Severity-based routing: keep DEBUG and INFO in a short-retention warm tier (3–7 days). Send WARN and above to longer retention. Send only ERROR+ to the always-on alerting pipeline.
- Sample DEBUG/INFO in production: 1–10% sampling of high-volume application logs, similar to traces. Per-tenant or per-route exceptions for active investigations.
- Drop or hash unbounded fields: stack traces are large and often duplicate — deduplicate by hash, store the hash with the log, keep the full trace only on the first occurrence per N-minute window.
- Convert logs to metrics where possible: a counter of
db_timeout_total{service="orders"}is vastly cheaper than ten thousand log lines of “DB timeout in orders.” Use log-to-metric processors in the collector. - Suppress duplicate spam: a “circuit breaker open” log line at 1000 lines/sec is not telling you anything new after the first one. Use a rate-limiting processor.
Tiered Storage and Retention Policies
A modern observability backend like Mimir, Tempo, or Loki splits storage into tiers:
| Tier | Latency | Cost | Typical use |
|---|---|---|---|
| Hot (memory / SSD) | ms | High | Last 2–24 h, live alerting, on-call investigation |
| Warm (object storage, e.g. S3) | seconds | Medium | 1–30 days, recent incident review, capacity planning |
| Cold (archive / Glacier) | minutes–hours | Very low | 30 d – years, compliance, long-term trend analysis |
A reasonable default policy for each pillar:
| Pillar | Hot | Warm | Cold |
|---|---|---|---|
| Metrics (raw) | 2 h | 15 d | — |
| Metrics (recording rules) | 24 h | 90 d | 1 y |
| Traces (sampled) | 24 h | 7 d | — |
| Traces (interesting, tail-sampled) | 24 h | 30 d | 90 d |
| Logs (ERROR+) | 7 d | 30 d | 1 y |
| Logs (INFO/DEBUG, sampled) | 24 h | 7 d | — |
These numbers are illustrative — real retention should be driven by your incident-postmortem cadence, audit/compliance requirements, and how often investigations actually reach back beyond 7, 30, or 90 days. The principle is to measure how often queries hit each retention bucket and prune accordingly. Most teams discover that less than 1% of queries reach beyond 30 days, which is a strong signal that anything older should move to cold storage.
The architectural lever that ties this all together is back-pressure handling. When ingest spikes above capacity — whether from a traffic surge, a noisy deploy, or a runaway debug log — every component in the pipeline must have a documented behavior: drop, buffer, downsample, or block. The wrong default (block) cascades into application latency. The right default (drop with metric) keeps the application healthy and surfaces the overload to operators via clear telemetry-about-telemetry counters.
Key Takeaway: Cost-aware observability architecture treats telemetry as a tiered logistics problem: reduce volume as early as possible (edge aggregation), match retention to query frequency (hot/warm/cold tiers), and design every component for graceful degradation under back-pressure. The cheapest byte is the one you never shipped.
Chapter Summary
Sampling, cardinality, performance overhead, and architecture are the four levers that determine whether your observability stack is a tool or a cost center.
Sampling is your primary cost lever. Use ParentBased(TraceIdRatioBased(p)) for cheap, coherent head sampling at 1–10% on high-volume APIs. Add tail sampling at the collector when you must guarantee capture of rare events like errors or slow traces — but budget memory carefully (spans/sec × spans/trace × span_size × wait_seconds) and accept 5–30 seconds of observability latency.
Cardinality is the metric-side equivalent of sampling. Each unique label combination is a series, and unbounded labels (user_id, request_id, raw paths) explode cost multiplicatively. Identify offenders with topk queries and Mimir’s Cardinality Explorer, drop or normalize them at scrape time with metric_relabel_configs, aggregate via recording rules, and fix the application code so they never reach the pipeline.
Performance overhead of instrumentation is usually 1–5% CPU and tens of MB of memory in well-tuned services, with Java agents at the higher end (up to ~20% in worst cases) and Go SDKs at the lower end. The sampling rate is the most powerful overhead knob; batching tuning is second. Always benchmark against your own workload.
Architecture ties it all together. Reduce volume as early as possible (edge aggregation in app, node, regional tiers), match retention to query frequency (hot/warm/cold tiers), convert logs to metrics where possible, and design every stage for back-pressure with documented drop/buffer behavior. The cheapest, fastest, most accurate telemetry is the telemetry you correctly chose not to keep.
The discipline these four levers share is intentionality: every byte of telemetry should exist because you made a deliberate decision that its value exceeds its cost.
Key Terms
| Term | Definition |
|---|---|
| Head sampling | Sampling decision made at the start of a trace in the SDK, before spans are recorded. Cheap and deterministic but may miss rare events. |
| Tail sampling | Sampling decision made in the collector after the trace has (mostly) completed. Accurate for rare events but expensive in memory and adds observability latency. |
| TraceIdRatioBased | OpenTelemetry head sampler that hashes the trace ID to make a deterministic keep-or-drop decision at probability p. Usually wrapped in ParentBased. |
| ParentBased | OpenTelemetry sampler wrapper that makes child spans inherit their parent’s sampling decision, ensuring coherent traces across services. |
| Adaptive sampling | Sampling whose rate adjusts dynamically to traffic volume or a target traces-per-second budget. Useful for cost control but complicates statistical extrapolation. |
| Cardinality | The number of unique time series produced by a metric, equal to the product of the distinct values of each label. Unbounded labels cause cardinality explosion. |
metric_relabel_configs | Prometheus configuration that operates on individual scraped samples to drop, rename, or normalize labels. The primary scrape-time cardinality control. |
| Recording rule | A Prometheus rule that pre-computes and persists an aggregation (e.g., sum by (service, le)) so queries hit a lower-cardinality derived series. |
| BatchSpanProcessor (BSP) | OpenTelemetry SDK component that buffers spans and exports them asynchronously in batches. Tunable via batch size, schedule delay, and queue size. |
| Batching | The practice of accumulating telemetry into groups before export to amortize per-call overhead and improve network efficiency, at the cost of slightly increased memory and latency. |
| Back-pressure | The condition where a downstream component cannot accept telemetry as fast as the upstream produces it. Must be handled by dropping, buffering, downsampling, or blocking — with the chosen behavior documented. |
| Retention | The duration that telemetry is kept queryable in each storage tier. Hot tiers (ms-latency) are short and expensive; cold tiers (minute-latency) are long and cheap. |
| Tiered storage | Architecture that automatically migrates telemetry between hot, warm, and cold tiers based on age, matching cost to query frequency. |
| Edge aggregation | The practice of reducing telemetry volume (sampling, aggregating, dropping attributes) as close to the source as possible — in the SDK, sidecar, or node agent — to minimize downstream cost. |
Chapter 12: SLOs, Alerting, and Operational Excellence
This is the capstone chapter. Earlier chapters gave you the signals — metrics from Prometheus, traces and logs from OpenTelemetry, dashboards from Grafana. This chapter gives you the discipline that turns those signals into a reliable production practice. You will move from “we collect telemetry” to “we run services to a contract with our users, and we know — quantitatively — when that contract is at risk.”
We will build the SLI/SLO model on top of PromQL, design multi-window multi-burn-rate (MWMBR) alerts that respect error budgets, configure Alertmanager so the on-call rotation is humane instead of corrosive, sketch a reference architecture for a Kubernetes platform, and finish with a look at where observability is heading — profiles as a fourth signal, AI-assisted root cause analysis, and the convergence of OpenTelemetry semantic conventions.
Learning Objectives
By the end of this chapter, you will be able to:
- Define SLIs and SLOs using PromQL and OpenTelemetry-sourced metrics, including latency, availability, freshness, and correctness indicators.
- Build actionable, low-noise alerting based on multi-window multi-burn-rate logic and error-budget policy.
- Configure Alertmanager routing, grouping, inhibition, and silencing to reduce on-call fatigue without losing real signal.
- Design a complete observability platform that evolves from greenfield through a mature production deployment, including capacity planning for the platform itself.
- Recognize emerging trends — continuous profiling, AI-assisted RCA, and semantic convention convergence — and decide when to invest in them.
Service Level Indicators and Objectives
A production service is a promise to its users. Service Level Indicators (SLIs) measure how well you keep that promise; Service Level Objectives (SLOs) are the targets you commit to; and the error budget is the contractually allowed slack between perfect and “good enough.” This is the conceptual machinery that lets engineering and product organizations talk about reliability without arguing about anecdotes.
Choosing Good SLIs
A good SLI is a ratio between good events and valid events — for example, “the fraction of HTTP requests that returned a non-5xx within 300 ms” — measured from the user’s perspective. Ratios are powerful because they normalize naturally with traffic: a 1% error rate is the same whether you serve 10 RPS or 10,000 RPS [Source: https://www.dash0.com/guides/prometheus-monitoring].
Four families of SLIs cover most services:
| SLI Family | Question It Answers | Example PromQL Shape |
|---|---|---|
| Availability | Did the request succeed? | sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) |
| Latency | Was the response fast enough? | sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) |
| Freshness | Is the data recent enough? | time() - max(pipeline_last_success_timestamp_seconds) < 600 |
| Correctness | Did the system compute the right answer? | Domain-specific — e.g., reconciliation counters |
Notice that two of these — availability and latency — are directly expressible from any standard OpenTelemetry HTTP instrumentation: http.server.duration histograms and http.server.request.count counters share the same http.response.status_code attribute that PromQL groups on. This is why aligning on OTel semantic conventions (Chapter 8) pays off here: every service in your fleet uses the same metric names, so a single SLO recording-rule template applies everywhere.
Analogy: think of SLIs as the dashboard of a car. Availability is “does the engine start?”, latency is “how fast can it accelerate?”, freshness is “how old is the GPS reading?”, and correctness is “does the odometer read what we actually drove?” The car can be running while still failing on any of these dimensions.
Figure 12.1: SLI → SLO → Error Budget → Burn Rate
graph TD
SLI[Service Level Indicator<br/>good events / valid events<br/>e.g. non-5xx requests under 300ms]
SLO[Service Level Objective<br/>target on the SLI<br/>e.g. 99.9% over 30 days]
EB[Error Budget<br/>= 1 - SLO<br/>e.g. 0.1% ~ 43.2 min / month]
BR[Burn Rate<br/>observed_error_fraction / error_budget<br/>how fast budget is consumed]
A[Burn-Rate Alert<br/>fires when consumption is unsustainable<br/>page or ticket by severity]
SLI --> SLO
SLO --> EB
EB --> BR
BR --> A
Error Budgets
Once you commit to an SLO, simple arithmetic gives you the error budget:
error_budget = 1 - SLO
For a 99.9% availability SLO over 30 days:
error_budget = 1 - 0.999 = 0.001 = 0.1%- 30 days = 43,200 minutes
- Allowed downtime ≈ 0.001 × 43,200 = 43.2 minutes per month [Source: https://prometheus.io/docs/prometheus/latest/getting_started/]
That number is the most useful artifact in your reliability program. It converts an abstract percentage into a concrete budget that engineers, product managers, and on-call leads can reason about. If you have already burned 30 minutes this month, you have 13 minutes left — and that knowledge changes deployment risk decisions immediately.
A common error-budget policy reads: “While budget remains, ship features aggressively. When budget is exhausted, freeze risky deploys until the next window.” This makes reliability a forcing function on engineering priorities rather than a vague aspiration [Source: https://www.dash0.com/guides/prometheus-monitoring].
The table below shows budgets for common SLO targets — note how each additional nine is exponentially expensive:
| SLO | Allowed Bad Fraction | Monthly Budget (30d) | Quarterly Budget (90d) |
|---|---|---|---|
| 99% | 1.0% | 7 h 12 m | 21 h 36 m |
| 99.5% | 0.5% | 3 h 36 m | 10 h 48 m |
| 99.9% | 0.1% | 43.2 m | 2 h 9.6 m |
| 99.95% | 0.05% | 21.6 m | 1 h 4.8 m |
| 99.99% | 0.01% | 4.32 m | 12.96 m |
Burn Rate and Multi-Window Multi-Burn-Rate Alerts
The burn rate B measures how fast you are consuming the error budget relative to a steady-state pace:
B = observed_error_fraction / error_budget
If B = 1, you will consume exactly the entire budget over the SLO window. If B = 14.4, you’ll exhaust a 30-day budget in roughly 30/14.4 ≈ 2 days. Burn rate is the right abstraction for alerting because it answers the only question that matters: at the current pace, when do we run out of budget? [Source: https://www.dash0.com/guides/prometheus-monitoring]
The naive approach — alert when error rate exceeds some constant threshold — has two problems: it pages on transient blips (false positives), and it does not detect slow, steady erosion of the budget (false negatives). The Google SRE workbook MWMBR pattern solves both by requiring agreement across two windows of different durations [Source: https://prometheus.io/docs/prometheus/latest/getting_started/]:
- Short window gives fast detection of severe spikes.
- Long window prevents one-off blips from paging anyone.
For a 99.9% / 30-day SLO, the canonical thresholds are:
| Alert Type | Short Window | Long Window | Burn Rate | Error Threshold | Severity |
|---|---|---|---|---|---|
| Fast-burn page | 5 m | 1 h | 14.4 | 1.44% | page |
| Medium-burn page | 30 m | 6 h | 6 | 0.6% | page |
| Slow-burn ticket | 2 h | 24 h | 3 | 0.3% | ticket |
| Slowest-burn ticket | 6 h | 3 d | 1 | 0.1% | ticket |
Implement the ratios as recording rules so they evaluate once and can be reused by alerts, dashboards, and ad-hoc queries:
groups:
- name: slo-recording-rules
interval: 30s
rules:
- record: slo:http_errors:ratio_rate5m
expr: |
sum by (service, env) (rate(http_requests_total{code=~"5.."}[5m]))
/
sum by (service, env) (rate(http_requests_total[5m]))
- record: slo:http_errors:ratio_rate1h
expr: |
sum by (service, env) (rate(http_requests_total{code=~"5.."}[1h]))
/
sum by (service, env) (rate(http_requests_total[1h]))
- record: slo:http_errors:ratio_rate6h
expr: |
sum by (service, env) (rate(http_requests_total{code=~"5.."}[6h]))
/
sum by (service, env) (rate(http_requests_total[6h]))
- record: slo:http_errors:ratio_rate3d
expr: |
sum by (service, env) (rate(http_requests_total{code=~"5.."}[3d]))
/
sum by (service, env) (rate(http_requests_total[3d]))
Then the MWMBR alert rules:
- alert: SLOErrorBudgetBurnFast
expr: |
(
slo:http_errors:ratio_rate5m{service="checkout"} > (14.4 * 0.001)
and
slo:http_errors:ratio_rate1h{service="checkout"} > (14.4 * 0.001)
)
and
sum by (service, env) (rate(http_requests_total{service="checkout"}[5m])) > 1
for: 2m
labels:
severity: page
slo: availability-99.9-30d
team: payments
annotations:
summary: "Checkout burning error budget at >14.4x in {{ $labels.env }}"
description: |
5m and 1h error ratios both exceed 1.44%. At this rate the 30-day
budget will be exhausted in ~2 days. Investigate immediately.
runbook_url: "https://runbooks.example.com/checkout/slo-fast-burn"
- alert: SLOErrorBudgetBurnSlow
expr: |
(
slo:http_errors:ratio_rate6h{service="checkout"} > (1 * 0.001)
and
slo:http_errors:ratio_rate3d{service="checkout"} > (1 * 0.001)
)
for: 15m
labels:
severity: ticket
slo: availability-99.9-30d
annotations:
summary: "Checkout slow burn — budget exhaustion within SLO window"
runbook_url: "https://runbooks.example.com/checkout/slo-slow-burn"
Two production details worth noting. First, the sum(rate(...)) > 1 gate suppresses spurious 100% error ratios from a single failed request in a low-traffic window [Source: https://www.sysdig.com/blog/prometheus-exporters-best-practices]. Second, latency SLOs reuse the same machinery — just substitute a histogram-based “good ratio”:
slo:http_latency:good_ratio_5m =
sum by (service) (rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum by (service) (rate(http_request_duration_seconds_count[5m]))
Figure 12.2: Multi-Window Multi-Burn-Rate Alert Escalation
graph TD
ER[Observed Error Ratio<br/>recording rules over 5m, 1h, 6h, 3d]
F1{5m AND 1h<br/>above 14.4x burn?}
M1{30m AND 6h<br/>above 6x burn?}
S1{2h AND 24h<br/>above 3x burn?}
SS1{6h AND 3d<br/>above 1x burn?}
P1[Fast-Burn PAGE<br/>budget gone in ~2 days<br/>wake on-call now]
P2[Medium-Burn PAGE<br/>budget gone in ~5 days]
T1[Slow-Burn TICKET<br/>budget gone in ~10 days]
T2[Slowest-Burn TICKET<br/>budget exhaustion in window]
OK[No alert<br/>budget healthy]
ER --> F1
F1 -->|yes| P1
F1 -->|no| M1
M1 -->|yes| P2
M1 -->|no| S1
S1 -->|yes| T1
S1 -->|no| SS1
SS1 -->|yes| T2
SS1 -->|no| OK
Key Takeaway: SLOs convert reliability from opinion into arithmetic. Express SLIs as good/valid ratios using PromQL on OTel-conformant metrics, derive an error budget, and alert on burn rate across paired short and long windows so you page on spikes and catch slow erosion — without crying wolf.
Alerting Architecture
A well-tuned SLO is wasted if the alert it produces lands in a flood of unrelated noise at 3 a.m. The alerting architecture — how alerts route, group, inhibit, and reach humans — determines whether on-call is a sustainable practice or a route to burnout.
Alertmanager Routing, Grouping, and Inhibition
Alertmanager sits downstream of every Prometheus and routes alerts to receivers (PagerDuty, Slack, email). Three primitives shape its behavior: grouping, routing trees, and inhibition.
Grouping collapses multiple related alerts into a single notification. The right group_by labels are stable, low-cardinality identifiers of an incident, not of individual resources. Grouping by pod or container will explode notifications during every rolling deploy; grouping by alertname, service, severity, env produces one notification per actual problem [Source: https://institute.sfeir.com/en/kubernetes-training/deploy-kube-prometheus-stack-production-kubernetes/].
Routing trees branch by severity, environment, and team ownership. Critical prod alerts go to PagerDuty; warnings go to email or a team Slack channel; info goes to a low-priority channel or nowhere [Source: https://www.plural.sh/blog/prometheus-operator-kubernetes-guide/].
Inhibition rules suppress symptom alerts when a known root-cause alert is firing — for instance, a node-down alert silences every per-pod alert on the same node.
A production-shaped Alertmanager configuration:
global:
resolve_timeout: 5m
route:
receiver: 'default-slack'
group_by: ['alertname', 'service', 'severity', 'env']
group_wait: 60s # collect bursts before first send
group_interval: 10m # how often to send updates for an active group
repeat_interval: 4h # re-notify cadence for unresolved pages
routes:
- matchers:
- severity="critical"
routes:
- matchers: [ env="prod" ]
receiver: 'pagerduty-prod'
continue: true
routes:
- matchers: [ team="payments" ]
receiver: 'pagerduty-payments'
- matchers: [ team="platform" ]
receiver: 'pagerduty-platform'
- matchers: [ env=~"staging|dev" ]
receiver: 'slack-nonprod'
- matchers:
- severity="warning"
receiver: 'slack-warnings'
group_interval: 30m
repeat_interval: 12h
- matchers:
- severity="info"
receiver: 'slack-info'
repeat_interval: 24h
inhibit_rules:
- source_matchers: [ severity="critical", alertname="KubernetesNodeDown" ]
target_matchers: [ alertname=~"KubePodCrashLooping|KubePodNotReady|InstanceDown" ]
equal: ['node', 'env']
- source_matchers: [ severity="critical", alertname="DatabaseUnavailable" ]
target_matchers: [ alertname=~"SLOErrorBudgetBurn.*" ]
equal: ['env', 'service']
receivers:
- name: 'pagerduty-prod'
pagerduty_configs:
- service_key: '<KEY>'
- name: 'pagerduty-payments'
pagerduty_configs:
- service_key: '<KEY>'
- name: 'slack-warnings'
slack_configs:
- api_url: '<URL>'
channel: '#alerts-warnings'
send_resolved: true
Pay attention to the timing knobs:
| Setting | Typical Range | Effect |
|---|---|---|
group_wait | 30s – 2m | Hold first notification to collect siblings |
group_interval | 5m – 30m | Cadence of updates while group active |
repeat_interval | 2h–6h pages / 12h–24h tickets | Cadence of re-pages for unresolved issues |
For high availability, run three Alertmanager replicas with gossip clustering so notifications are deduplicated even during rolling restarts; otherwise a restart can double-fire every active alert [Source: https://institute.sfeir.com/en/kubernetes-training/deploy-kube-prometheus-stack-production-kubernetes/].
On-Call Ergonomics: Actionability and Runbook Links
Every page must be actionable: the on-call engineer should know within 60 seconds what to do. The hard rule is no page without a runbook URL. A runbook entry should answer five questions:
- How do I confirm this is real? (PromQL, dashboard, log query)
- What is the most likely cause?
- What are the safe remediation steps?
- When do I escalate, and to whom?
- What is the rollback criterion?
Annotate every alert rule:
annotations:
summary: "Checkout SLO fast burn in {{ $labels.env }}"
description: |
5m/1h error ratio > 1.44% for service={{ $labels.service }}.
Budget exhaustion in ~2 days at current rate.
dashboard_url: "https://grafana.example.com/d/checkout"
runbook_url: "https://runbooks.example.com/checkout/slo-fast-burn"
Track three on-call hygiene metrics monthly: alerts per incident, percentage of pages outside business hours, and percentage of alerts with valid runbooks. Any number trending in the wrong direction is a fixable problem [Source: https://community.grafana.com/t/best-practices-we-can-implement-in-production/79356].
Anti-Patterns: Paging on Causes vs Symptoms
The most common failure mode in alerting is paging on causes — “CPU > 90%”, “disk I/O latency > 50 ms”, “garbage collector pauses > 200 ms” — instead of symptoms that affect users. A CPU at 99% with users perfectly happy is not an incident; the same CPU at 60% during a brownout is. Anchor pages on the SLI ladder: the user noticed something is wrong. Reserve cause-based alerts as warning-tier signals or as inputs to ticketing automation [Source: https://www.dash0.com/guides/prometheus-monitoring].
Other anti-patterns worth naming:
- Static thresholds on dynamic systems. “Latency > 500 ms” pages every Black Friday. Burn-rate alerts are the cure.
- Alerts without owners. Every rule must carry a
teamlabel; orphan alerts decay into noise. - Per-pod alerts. Kubernetes is designed to churn pods. Alert on aggregates (Deployment, Service) not individual replicas.
- Silences that never expire. Always set bounded durations and require comments referencing a ticket [Source: https://prometheus.io/docs/prometheus/latest/getting_started/].
- No
for:clause. Without minimum duration, every blip becomes a page.
Figure 12.3: Alertmanager Routing Tree with Grouping and Inhibition
flowchart TD
P[Prometheus<br/>alert rules fire] --> AM[Alertmanager<br/>HA cluster x3]
AM --> G[Group by<br/>alertname, service,<br/>severity, env]
G --> I{Inhibition rules<br/>root-cause active?}
I -->|suppressed| X[Drop symptom alert]
I -->|allowed| R{Route by severity}
R -->|critical| RE{env?}
R -->|warning| SW[Slack #alerts-warnings<br/>repeat 12h]
R -->|info| SI[Slack #alerts-info<br/>repeat 24h]
RE -->|prod| RT{team?}
RE -->|staging or dev| SN[Slack #nonprod]
RT -->|payments| PDP[PagerDuty<br/>payments rotation]
RT -->|platform| PDF[PagerDuty<br/>platform rotation]
PDP --> RB[Runbook URL<br/>+ dashboard link<br/>+ silence controls]
PDF --> RB
Key Takeaway: A humane on-call rotation depends on Alertmanager doing aggressive grouping and inhibition, a routing tree that respects severity and ownership, and a strict rule that every page links to a runbook and points at a user-visible symptom — never a raw cause metric.
Putting It All Together
You have all the pieces. The remaining job is integration — designing a platform that holds together at scale, can be rolled out incrementally, and is itself observable.
Reference Architecture for a Kubernetes Platform
A mature cloud-native observability stack on Kubernetes has six logical layers:
- Instrumentation layer — OpenTelemetry SDKs and auto-instrumentation in application processes, emitting OTLP for metrics, traces, logs, and (increasingly) profiles.
- Collection layer — OpenTelemetry Collector DaemonSets (per node) for host/pod metrics and OTLP receivers, plus Collector Deployments for fan-in, processing, and routing. Prometheus is deployed via the Prometheus Operator for scrape-based metrics (kube-state-metrics, node-exporter, cAdvisor) [Source: https://www.plural.sh/blog/prometheus-operator-kubernetes-guide/].
- Storage layer — Prometheus for short-term metric storage; remote-write to a long-term store (Thanos, Mimir, Cortex); Tempo or Jaeger for traces; Loki or Elasticsearch for logs; Pyroscope or Parca for profiles.
- Rule and alerting layer — Prometheus rule files (recording + alerting), Alertmanager HA cluster, runbook hosting (often in the same docs site as user-facing documentation).
- Visualization & exploration layer — Grafana with datasource configuration spanning all signal stores, dashboards organized by domain (service, infrastructure, business KPIs).
- Meta-observability layer — A small, segregated Prometheus and Alertmanager whose only job is monitoring the observability stack itself.
Figure 12.4: Reference Observability Platform on Kubernetes
flowchart LR
subgraph L1[1. Instrumentation]
APP[Application Pods<br/>OTel SDKs<br/>auto-instrumentation]
end
subgraph L2[2. Collection]
DS[OTel Collector<br/>DaemonSet]
DEP[OTel Collector<br/>Deployment fan-in]
PROM[Prometheus Operator<br/>ServiceMonitors]
end
subgraph L3[3. Storage]
TSDB[Prometheus TSDB<br/>short-term]
LTM[Thanos / Mimir<br/>long-term metrics]
TR[Tempo / Jaeger<br/>traces]
LG[Loki<br/>logs]
PR[Pyroscope / Parca<br/>profiles]
end
subgraph L4[4. Rules & Alerting]
RR[Recording Rules]
AR[Alerting Rules<br/>MWMBR]
AM[Alertmanager HA<br/>routing + inhibition]
end
subgraph L5[5. Visualization]
GR[Grafana<br/>dashboards + Explore]
end
subgraph L6[6. Meta-Observability]
WD[Watchdog Prometheus<br/>+ Alertmanager<br/>monitors the monitors]
end
APP --> DS
APP --> DEP
APP --> PROM
DS --> TSDB
DS --> TR
DS --> LG
DEP --> LTM
PROM --> TSDB
TSDB --> LTM
TSDB --> RR
RR --> AR
AR --> AM
TSDB --> GR
LTM --> GR
TR --> GR
LG --> GR
PR --> GR
AM --> GR
WD -.watches.-> L2
WD -.watches.-> L3
WD -.watches.-> L4
The Prometheus Operator simplifies operations because rule files, scrape configs, and Alertmanager configs become Kubernetes CRDs (PrometheusRule, ServiceMonitor, AlertmanagerConfig) managed by the same GitOps tooling as the applications they observe [Source: https://www.plural.sh/blog/prometheus-operator-kubernetes-guide/].
Greenfield Rollout Strategy
If you are starting from scratch, the rollout sequence that minimizes risk while maximizing early value is:
| Week | Action | Outcome |
|---|---|---|
| 1 | Deploy kube-prometheus-stack with default rules | Cluster-level metrics, basic alerts, Grafana dashboards |
| 2 | Add OTel Collector DaemonSet, route to Prometheus + Tempo | First traces from auto-instrumented services |
| 3 | Define 2–3 critical SLOs with recording rules + MWMBR alerts | First user-anchored pages |
| 4 | Configure Alertmanager routing tree, inhibition, runbook URLs | Reduced noise; team ownership in place |
| 5–6 | Long-term metric store (Thanos/Mimir) via remote-write | Multi-month retention for capacity reviews |
| 7–8 | Add logs (Loki) and continuous profiling (Pyroscope) | Full four-signal stack |
| Ongoing | Per-team SLO definition workshops | Reliability conversations become a team practice |
Migration from Prometheus-only stacks follows a “wrap, don’t replace” rule. Keep Prometheus where it works (scrape-based infra metrics, alerting), and add OpenTelemetry Collectors as the entry point for new signals. The Collector’s prometheusremotewrite exporter ships OTel metrics into Prometheus or Mimir, while the prometheus receiver lets the Collector scrape existing exporters [Source: https://www.dash0.com/guides/prometheus-monitoring]. The result is a unified pipeline without a forklift migration.
Capacity Planning for the Observability Platform Itself (Meta-Observability)
Observability platforms tend to grow until they consume a meaningful fraction of cluster capacity. Treat the platform like any other service with SLOs of its own. The most important meta-metrics to alert on:
| Component | Meta-Metric | Why It Matters |
|---|---|---|
| Prometheus | prometheus_rule_evaluation_duration_seconds | Rule eval lag means alerts arrive late |
| Prometheus | prometheus_tsdb_head_series | Cardinality runaway is the #1 outage cause |
| Prometheus | prometheus_target_scrape_pool_sync_total failures | Missed scrapes = blind spots |
| Alertmanager | alertmanager_notifications_failed_total | Pages may not reach humans |
| Alertmanager | alertmanager_cluster_members | HA quorum lost → duplicate notifications |
| OTel Collector | otelcol_exporter_queue_size / _queue_capacity | Backpressure → telemetry loss |
| OTel Collector | otelcol_processor_dropped_spans | Sampling/budgets working too aggressively |
| Tempo/Loki | Ingestion rate vs. configured limits | Quota violations drop data silently |
A small secondary Prometheus with a separate Alertmanager — sometimes called a “watchdog” — scrapes the primary stack and pages the platform team if it goes blind. The pattern is “who watches the watchmen?” implemented as a few hundred lines of YAML [Source: https://prometheus.io/docs/prometheus/latest/getting_started/].
Capacity sizing rules of thumb:
- Prometheus memory: ~3 KB per active series in steady state; budget 25% headroom for query bursts.
- Prometheus disk: ~1.5 bytes per sample after compression × samples/sec × retention seconds.
- Tempo/Jaeger: dominated by trace data volume; head-based sampling at 1–10% is typical, with tail-based sampling for error/slow traces [Source: https://www.dash0.com/guides/prometheus-monitoring].
- OTel Collector: 1 GB RAM per ~20k spans/sec with batching; CPU scales linearly with attribute count.
- Loki: log volume dominates; structured logs at WARN+ in prod are usually sustainable, DEBUG in prod is not.
Key Takeaway: A production observability platform is six layers — instrument, collect, store, rule, visualize, and observe itself. Roll out incrementally, lean on the Prometheus Operator and OTel Collector to keep configuration as code, and treat capacity planning for the platform as seriously as you would any user-facing service.
Where Observability Is Heading
The fundamentals — SLOs, MWMBR, Alertmanager hygiene — have been stable for nearly a decade. The interesting changes are happening at the edges, and they will reshape the platform you just built.
Profiles as a Fourth Signal
For most of observability’s history there have been “three pillars”: metrics, logs, traces. Continuous profiling — always-on sampling of stack traces with associated resource usage (CPU cycles, allocated bytes, off-CPU wait time) at 50–200 Hz with 1–3% overhead — is becoming the fourth [Source: https://www.dash0.com/guides/prometheus-monitoring].
What it answers, that the other three cannot:
| Signal | Question |
|---|---|
| Metrics | Is something wrong? |
| Logs | What happened? |
| Traces | Where in the request path? |
| Profiles | Exactly which code is consuming resources, and how did that change? |
A concrete example: metrics show p99 latency rising from 200 ms to 500 ms after a deploy. Traces narrow the slowness to CalculateDiscounts spans. Profiles reveal that 40% of CPU is now in a new calculate_rewards_v2 function, dominated by hashmap operations on a high-cardinality in-memory map. Without profiles, you have a suspect; with profiles, you have the exact lines of code.
Two open-source projects lead the space:
- Grafana Pyroscope — Prometheus-like label model, SDKs for Go/Java/.NET/Python/Ruby, eBPF support, native integration with the rest of the Grafana stack and a “drill from metric to profile” UX.
- Parca / Polar Signals — Kubernetes-native, primarily eBPF-based DaemonSet that profiles every process on a node with no code changes.
The OpenTelemetry profiling signal (currently maturing) adds profiles as a first-class OTLP signal alongside metrics/logs/traces, sharing resource attributes (service.name, k8s.pod.name) and trace/span IDs. Once stable, the same Collector pipeline that already routes your other signals will route profiles too [Source: https://prometheus.io/docs/prometheus/latest/getting_started/].
Adopt profiling when you have at least two of: CPU/memory bills you cannot easily attribute, mysterious tail latency, or performance regressions that are hard to bisect.
Continuous AI-Assisted Root Cause Analysis
The next pragmatic shift is in RCA workflows. Vendors and open-source projects are building systems that:
- Correlate signals automatically. When an SLO burn-rate alert fires, the system pulls the matching traces, the topology of upstream and downstream services, profiles taken during the same window, and recently merged commits and feature flags.
- Summarize hypotheses. An LLM synthesizes “What changed?”, “Where is the load concentrated?”, and “What is similar to past incidents?” into a draft RCA the on-call engineer can edit.
- Suggest queries. Instead of free-text PromQL or LogQL, engineers describe an investigation in natural language and the assistant proposes queries grounded in the available metric and log schemas.
This is not a replacement for engineering judgment. It is an accelerator that reduces median time-to-diagnose by removing the manual query-writing tax during incidents. Two prerequisites must be in place for it to work well: OTel semantic conventions so the assistant has a consistent vocabulary, and structured runbooks that the assistant can quote as remediation playbooks.
Convergence of OpenTelemetry Semantic Conventions
The longest-running success of the OTel project is that vendors are converging on the same names for the same things — service.name, http.response.status_code, db.system, k8s.pod.name, deployment.environment — so dashboards, alerts, and runbooks written against one backend can be ported to another with relatively little work [Source: https://www.dash0.com/guides/prometheus-monitoring].
For platform teams, the practical implications are:
- Vendor lock-in shrinks. Switching from one APM to another becomes a Collector reconfiguration rather than a re-instrumentation project.
- Cross-team analytics work. When every service uses
service.name, you can compute fleet-wide SLO compliance with a single query. - AI assistants become more useful. Standard attribute names mean models trained on one organization’s telemetry can generalize to another’s.
The strategic takeaway: invest in semantic-convention conformance now. Add CI checks that reject instrumentation using non-standard attribute names; require the OTel resource attributes on every service via the Collector’s resourcedetection processor.
Figure 12.5: The Four Pillars and the Path Forward
graph TD
subgraph PILLARS[Signals of Modern Observability]
M[Metrics<br/>Is something wrong?<br/>Prometheus, OTLP]
L[Logs<br/>What happened?<br/>Loki, Elasticsearch]
T[Traces<br/>Where in the request path?<br/>Tempo, Jaeger]
PF[Profiles<br/>Which lines of code?<br/>Pyroscope, Parca]
end
SC[OTel Semantic Conventions<br/>service.name, http.response.status_code,<br/>k8s.pod.name, deployment.environment]
AI[AI-Assisted RCA<br/>correlate signals, summarize hypotheses,<br/>suggest queries from runbooks]
INC[Incident<br/>faster MTTD & MTTR<br/>portable across vendors]
M --> SC
L --> SC
T --> SC
PF --> SC
SC --> AI
AI --> INC
Key Takeaway: The next decade of observability adds profiles as a fully supported signal, weaves AI assistance into the incident response loop, and accelerates semantic-convention convergence that turns vendor switching into a configuration exercise. Build the foundation on OTel today and you will inherit those benefits without re-platforming.
Chapter Summary
You have crossed the bridge from “we collect telemetry” to “we operate to a contract.” The model is:
- SLIs measure user-visible behavior as good/valid ratios.
- SLOs are targets on SLIs; their inverse is the error budget.
- Burn-rate alerts using multi-window multi-burn-rate (MWMBR) logic page only when the budget is genuinely at risk, not on every blip.
- Alertmanager turns alerts into humane notifications via grouping, severity-aware routing, inhibition, runbook-linked annotations, and HA clustering.
- A reference architecture layers OTel instrumentation, Prometheus + Operator, long-term storage, rules and alerting, Grafana, and meta-observability of the platform itself.
- The next frontier is continuous profiling as a fourth signal, AI-assisted RCA built on standardized telemetry, and OTel semantic convention convergence that erodes vendor lock-in.
The discipline embedded in these patterns is what separates teams that run reliable services from teams that fight fires. The arithmetic of error budgets gives you a shared language with product and engineering leadership. The grouping and inhibition rules give you a sustainable on-call rotation. The meta-observability layer gives you confidence that your monitoring still works when everything else doesn’t. And the trajectory toward profiles, AI assistance, and convention-conformant telemetry means the investment compounds: the foundation you build today gets more powerful with every new capability the ecosystem ships.
If you have made it through this textbook and applied even half of these practices, you are running a cloud-native observability stack that would have been state of the art at most major tech companies five years ago. The work from here is iteration: refining SLOs as you learn what users actually care about, tightening alert rules after every incident review, and adopting new signals as they mature. That is the operational excellence loop.
Key Terms
| Term | Definition |
|---|---|
| SLI | Service Level Indicator. A quantitative measure of a service property — usually a ratio of good events to valid events — that reflects user experience (availability, latency, freshness, correctness). |
| SLO | Service Level Objective. A target value or range for an SLI over a defined window (e.g., 99.9% availability over 30 days). |
| Error budget | The allowed amount of “bad” behavior under an SLO, computed as 1 - SLO over the SLO window. For 99.9% over 30 days, ≈ 43 minutes of downtime per month. |
| Burn rate | The ratio of observed error fraction to error budget. A burn rate of 1 exhausts budget exactly over the SLO window; 14.4 exhausts it in ~2 days for a 30-day window. |
| MWMBR | Multi-window multi-burn-rate. An alerting pattern that requires both a short and a long window to exceed a burn-rate threshold before firing, balancing fast detection against false positives. |
| Alertmanager | The Prometheus-ecosystem component that receives alerts, deduplicates and groups them, applies inhibition rules and silences, and routes notifications to receivers like PagerDuty or Slack. |
| Inhibition rule | An Alertmanager rule that suppresses a target alert while a higher-severity source alert is firing for the same labeled objects — used to silence symptom alerts when a root-cause alert exists. |
| Runbook | An operational document linked from every paging alert that explains how to confirm the issue, likely causes, remediation steps, escalation paths, and rollback criteria. |
| Meta-observability | The practice of monitoring the observability platform itself — Prometheus rule eval duration, Alertmanager notification success, Collector queue saturation, exporter scrape health — typically via a separate, isolated stack. |
| Continuous profiling | Always-on sampling of stack traces with associated resource usage (CPU, memory, off-CPU time) at 50–200 Hz with 1–3% overhead, visualized as flame graphs and diff views — the emerging fourth signal of observability. |
| OTel semantic conventions | The standardized set of attribute names (service.name, http.response.status_code, k8s.pod.name, etc.) defined by the OpenTelemetry project that enable cross-vendor portability of dashboards, alerts, and AI-assisted tooling. |