Cloud-native Observability with Prometheus and OpenTelemetry

An intermediate, hands-on guide to instrumenting, collecting, querying, and operationalizing telemetry in cloud-native systems using Prometheus and OpenTelemetry.

Table of Contents


Chapter 1: Foundations of Cloud-native Observability

Learning Objectives

From Monitoring to Observability

Definition of observability borrowed from control theory

The word observability originated in control theory, where it describes the degree to which the internal state of a system can be inferred from its external outputs. Modern software borrows that idea: a system is observable when its emitted telemetry — metrics, logs, and traces — is rich enough that engineers can reconstruct what is happening inside it, even for situations no one anticipated [Source: https://opentelemetry.io].

This stands in contrast to monitoring, which is fundamentally about checking known signals against predefined thresholds. Monitoring answers questions you wrote down ahead of time: “Is checkout-service returning too many 5xx errors?” or “Is CPU usage on node node-3 above 90%?” [Source: https://prometheus.io]. It is a detection mechanism — a tripwire — built around failure modes you already understand.

A useful analogy: monitoring is like the dashboard of a car. Speed, fuel level, engine temperature, and a few warning lights cover the most common failure cases. Observability is more like having a full diagnostic port plus a mechanic’s toolkit on hand. When the car behaves strangely in a way the dashboard cannot explain — a vibration only at 73 mph in the rain — you need the ability to probe, query, and correlate signals that were never wired up to a warning light.

Key Takeaway: Monitoring tells you that something is wrong by checking predefined signals; observability gives you the raw material to figure out what and why by enabling ad-hoc exploration of correlated telemetry. Both are necessary, but observability is what makes complex cloud-native systems debuggable.

Known-unknowns vs unknown-unknowns

The classic framing for this distinction is known unknowns versus unknown unknowns. A known unknown is a failure mode you can name in advance: “the database might run out of connections,” “the disk might fill up,” “a pod might enter CrashLoopBackOff.” For each of these, you can pre-build a metric, threshold, and alert [Source: https://prometheus.io].

Unknown unknowns are the failures that nobody on the team has imagined yet. Consider a real-world scenario: sporadic 500 errors appear from checkout-service only during peak traffic, only for the /checkout/card endpoint, only on version v2, and only for tenant_type="enterprise". No engineer wrote an alert for that combination of conditions. With pure monitoring you might see only a vague error-rate spike. With observability, you can slice the metric by labels, pivot to the traces matching the failing requests, and finally jump to the logs for the specific failing spans to discover that a misconfigured PAYMENT_TIMEOUT=200ms environment variable was deployed in v2 [Source: https://opentelemetry.io].

That investigative path — metric → trace → log — is the practical essence of observability. It lets you discover the question you didn’t know to ask.

Key Takeaway: Monitoring is sufficient when you can enumerate failure modes in advance. Distributed cloud-native systems generate too many novel failure combinations for that approach, which is why observability — high-cardinality, correlated, queryable telemetry — has become a baseline requirement rather than a luxury.

Why dashboards alone are insufficient in distributed systems

Dashboards are fundamentally a summarization tool: they aggregate raw telemetry into a small number of charts chosen ahead of time. That works beautifully when the number of important questions is small and stable. It breaks down when the number of meaningful slices through your data explodes.

In a microservice deployment with 30 services, 5 versions in flight at any moment, dozens of tenants, multiple regions, feature flags, and Kubernetes-supplied attributes (namespace, deployment, pod, node), the dimensional space of interesting views runs into the millions. No fixed dashboard can pre-render every relevant slice [Source: https://www.cncf.io].

Figure 1.1: Monitoring dashboards vs. observability query interfaces

flowchart LR
    subgraph Monitoring["Monitoring (pre-built dashboards)"]
        direction TB
        G1["Gauge: CPU %"]
        G2["Gauge: Error rate"]
        G3["Gauge: Latency p99"]
        G4["Fixed thresholds<br/>and alerts"]
    end
    subgraph Observability["Observability (ad-hoc query interface)"]
        direction TB
        Q["Free-form query<br/>by service, version,<br/>tenant, region, endpoint"]
        Q --> R1["Slice metrics"]
        Q --> R2["Pivot to traces"]
        Q --> R3["Drill into logs"]
    end
    Monitoring -. "answers known questions" .-> Known["Known unknowns"]
    Observability -. "discovers new questions" .-> Unknown["Unknown unknowns"]

A second problem is that dashboards summarize signals, but rarely correlate across signals. When an alert fires on a latency metric, the human operator must mentally bridge from “the chart looks bad” to “let me find a representative trace” to “let me read the logs for that trace.” Observability tooling closes that loop with exemplars and trace IDs that make the pivot a single click rather than a sprawling investigation.

Key Takeaway: Dashboards are a useful summary surface, but they cannot anticipate every question a distributed system will provoke. Observability shifts the workflow from “look at pre-built charts” to “ask arbitrary questions of correlated telemetry.”

The Three Pillars: Metrics, Logs, and Traces

Strengths and weaknesses of each signal

The three pillars of observability — metrics, logs, and traces — are complementary because each has different strengths along the axes of cardinality, cost, and temporal granularity.

Metrics are numeric time series: counters, gauges, and histograms sampled at fixed intervals. They are cheap, summarizable, and ideal for alerting. A typical Kubernetes metric looks like http_requests_total{service="api",status="500"} or a latency histogram like request_duration_seconds_bucket{le="0.1",service="checkout"} [Source: https://prometheus.io]. Their weakness is that they are inherently aggregated: you trade per-event detail for compactness.

Logs are discrete events — usually structured JSON in cloud-native systems — emitted by applications and infrastructure. In Kubernetes this includes application logs from containers, kubelet logs, API server logs, and ingress controller logs. Logs are excellent for capturing what exactly happened in a single moment, including exception stack traces, parameter values, and business context. Their weakness is volume: at scale, indexing and storing every log line is expensive [Source: https://opentelemetry.io].

Traces are end-to-end records of a request as it flows through multiple services. A trace is composed of spans, each representing a unit of work in one service, with parent-child relationships representing the call hierarchy. Trace context is propagated across processes via HTTP headers like W3C traceparent. Traces are the only signal type that natively captures the structure of a distributed request. Their weakness is also volume, which is why traces are usually sampled rather than collected exhaustively [Source: https://opentelemetry.io].

SignalBest forCardinality toleranceTypical cost driver
MetricsAlerting, trends, SLOsLowNumber of time series
LogsDetailed per-event contextHighStorage and indexing volume
TracesCross-service request flowHighSample rate and span count

Figure 1.2: The three pillars and their connective tissue

flowchart TD
    Obs["Observability"]
    Obs --> M["Metrics<br/>aggregated numeric<br/>time series"]
    Obs --> L["Logs<br/>discrete structured<br/>events"]
    Obs --> T["Traces<br/>causal request flow<br/>across services"]
    M -- "exemplars<br/>(metric point to trace)" --> T
    T -- "trace_id in log lines" --> L
    L -- "span_id back to trace" --> T
    M -. "resource attributes<br/>(k8s.namespace, service, version)" .- L
    M -. "resource attributes" .- T
    L -. "resource attributes" .- T

Key Takeaway: No single pillar is sufficient on its own. Metrics tell you that something is wrong at scale; logs tell you what exactly happened in one event; traces tell you where in the call chain the problem lives. A mature observability stack uses all three deliberately.

Correlation between signals via exemplars and trace IDs

The leap from “three separate pillars” to “one observability fabric” happens when the signals are correlated. Two mechanisms make this possible.

First, exemplars are pointers attached to metric data points that link a specific bucket of a histogram (for example, a latency point at 2.3 seconds) to one concrete trace that contributed to that bucket. When you see a tall bar in a latency histogram, an exemplar lets you click straight to a representative slow trace, rather than guessing which of millions of requests it was [Source: https://prometheus.io].

Second, trace IDs embedded in structured logs allow you to jump from any log line to the full trace it belongs to — and from any span in a trace to the exact log lines that span emitted. When a trace shows a span failing with a TimeoutError, you can pivot directly to that span’s logs to see parameters, exception details, and surrounding context [Source: https://opentelemetry.io].

Resource attributes such as k8s.namespace.name, k8s.deployment.name, and k8s.pod.name tie all three pillars to the underlying Kubernetes workload, so a single label can pivot you from a metric chart to a log query to a trace search without losing context.

Key Takeaway: Exemplars connect metrics to traces; trace IDs in logs connect logs to traces; resource attributes connect everything to the workload. Together these three correlations let an operator move fluidly between pillars during an incident.

Cardinality, sampling, and cost trade-offs

Cardinality is the number of unique combinations of labels in a telemetry dataset. It is the single most important operational concept in cloud-native observability, because each pillar handles it differently — and getting it wrong is expensive [Source: https://prometheus.io].

In a Prometheus-style metrics system, every unique combination of label values creates a new time series. A counter like http_requests_total{service, endpoint, status, pod} with 10 services × 50 endpoints × 5 statuses × 200 pods produces 500,000 time series. Add user_id to the label set and the count explodes to millions, causing memory pressure, slow queries, and even out-of-memory crashes on the Prometheus server [Source: https://prometheus.io].

The practical rule is:

Sampling strategies vary. Head-based sampling decides at the start of a request whether to keep it; it is cheap but blind to errors. Tail-based sampling buffers spans and decides after the request completes, allowing you to retain 100% of errors and slow traces while sampling normal traffic at a low rate.

Key Takeaway: Cardinality is a first-class design concern. Push low-cardinality dimensions to metrics, high-cardinality detail to logs and traces, and sample aggressively to control cost. Misplacing a label can quietly bankrupt a metrics backend.

Cloud-native Operational Context

Kubernetes, containers, and ephemeral workloads

Traditional monitoring was built for a world of long-lived hosts with stable identities: hostnames, IPs, and static service inventories. Kubernetes turns those assumptions upside down. Pods are short-lived by design — they are created and destroyed in seconds due to autoscaling, rolling deployments, crash loops, and health-probe failures. Pod names and IPs are not stable identifiers; each restart can produce new ones [Source: https://kubernetes.io].

Several practical pathologies follow from this. A dashboard keyed by pod name becomes unreadable as identities churn every few minutes. Historical time series for a specific pod are meaningless once that pod is gone. Alerts that fire on a pod that has already terminated produce “resource not found” errors when an operator clicks through. Batch jobs or crashing pods may exist for only seconds — too briefly for a legacy agent to discover them at all [Source: https://kubernetes.io].

Figure 1.3: Rolling deployment re-keys metrics from pods to the Deployment

flowchart LR
    subgraph T0["t=0: v1 steady state"]
        P1A["pod checkout-v1-a"]
        P1B["pod checkout-v1-b"]
        P1C["pod checkout-v1-c"]
    end
    subgraph T1["t=1: rolling update"]
        P1B2["pod checkout-v1-b"]
        P2A["pod checkout-v2-a"]
        P2B["pod checkout-v2-b"]
    end
    subgraph T2["t=2: v2 steady state"]
        P2A2["pod checkout-v2-a"]
        P2B2["pod checkout-v2-b"]
        P2C["pod checkout-v2-c"]
    end
    T0 --> T1 --> T2
    T0 --> Q["Stable query:<br/>sum by (service, version)<br/>(rate(http_requests_total))"]
    T1 --> Q
    T2 --> Q
    Q --> Dash["Continuous series<br/>keyed by Deployment +<br/>version, not pod name"]

The cloud-native response is to treat workloads as the unit of observability, not pods or hosts. Aggregate metrics across all pods in a Deployment using labels like service, namespace, and version. Use Prometheus’s built-in Kubernetes service discovery to find scrape targets dynamically via the Kubernetes API rather than maintaining a static target list. Send logs from containers via a node-level DaemonSet collector (Fluent Bit, Vector) to a centralized store so that records survive the pod that created them [Source: https://prometheus.io].

A concrete example: during a rolling deployment from v1 to v2, you should be able to issue the query sum(rate(http_requests_total{service="checkout", version="v2"}[5m])) and trust the answer even though the underlying pods that contributed to it may all have been replaced during the query window.

Key Takeaway: Ephemeral pods break host-centric monitoring. Cloud-native observability aggregates at the workload level using stable labels (deployment, namespace, version) and relies on dynamic Kubernetes service discovery rather than static target lists.

Microservices and the death of the call stack

In a monolith, debugging a user request usually means reading a single process’s stack trace. Almost all interesting work happens in-process, in a single language runtime, so a JVM profiler or .NET APM agent can see the whole transaction.

Microservices destroy this comfortable assumption. A single user request can traverse dozens of services, each potentially written in a different language (Go, Java, Node.js, Python, Rust, .NET) and communicating over multiple protocols (HTTP, gRPC, Kafka, SQS, gRPC streaming) [Source: https://opentelemetry.io]. There is no single stack trace — there is only a distributed call hierarchy that lives in no one process’s memory.

Traditional APM tools were designed around proprietary agents tightly coupled to specific runtimes. They cannot reliably stitch a request together when it crosses language boundaries or hops onto a message queue. The result is that engineers see traces only within one service, and cross-team debugging becomes guesswork: “Is it my service or theirs?” without a shared trace ID [Source: https://opentelemetry.io].

The cloud-native answer is distributed tracing built on open standards:

Key Takeaway: The call stack does not span service boundaries in a microservices architecture. Distributed tracing — with vendor-neutral SDKs and W3C Trace Context propagation — is the only practical mechanism for reconstructing a request’s full path across polyglot services.

Service mesh, sidecars, and infrastructure telemetry

A service mesh like Istio, Linkerd, or Consul Connect injects a sidecar proxy (commonly Envoy) next to each application pod. Critical request behaviors — mTLS termination, retries, timeouts, circuit breakers, traffic splitting, routing — execute in the proxy, not in the application code [Source: https://istio.io].

This creates an observability gap for any tool that only watches application processes. The app may report a healthy error rate while the mesh is silently retrying 503 responses, throttling traffic, or failing TLS handshakes. A misconfigured destination rule or an aggressive outlier-detection policy can degrade user experience without ever touching application metrics [Source: https://istio.io].

The implication is that mesh proxies must be treated as first-class observability targets. Scrape Envoy stats endpoints alongside application metrics. Collect mesh access logs to see per-request behavior, including retries and route decisions. Combine application spans with mesh spans (via OpenTelemetry or Envoy tracing) so that an end-to-end trace shows exactly where time was spent — in the app, in the sidecar, or on the wire between them.

Figure 1.4: Application and sidecar emit separate, correlated telemetry streams

flowchart LR
    subgraph Pod["Kubernetes Pod"]
        App["Application container<br/>(business logic)"]
        Envoy["Envoy sidecar<br/>(mTLS, retries, routing)"]
        App <-->|"localhost"| Envoy
    end
    App -->|"app metrics<br/>(/metrics)"| Prom["Prometheus"]
    App -->|"app traces<br/>(OTLP spans)"| Backend["Tracing backend<br/>(Jaeger / Tempo)"]
    Envoy -->|"mesh metrics<br/>(Envoy stats)"| Prom
    Envoy -->|"mesh access logs"| Logs["Log backend<br/>(Loki)"]
    Envoy -->|"mesh spans"| Backend
    Prom --> Graf["Grafana<br/>correlated view"]
    Backend --> Graf
    Logs --> Graf

Beyond the mesh, cloud-native observability must cover infrastructure telemetry as well: kubelet and control-plane metrics from kube-state-metrics, node-level metrics from the node exporter, and Kubernetes events that describe pod restarts, evictions, and scheduling decisions. Correlating these with application telemetry is what lets you say, “the latency spike at 03:14 UTC coincided with the autoscaler evicting three pods from node-7.”

Key Takeaway: Modern request behavior is distributed across application code, sidecar proxies, and Kubernetes control-plane components. Cloud-native observability must instrument all three layers — and correlate them through shared labels and trace IDs — to give a complete picture of what happened.

The CNCF Observability Landscape

Prometheus as the de-facto metrics standard

Prometheus is the dominant metrics monitoring system in the CNCF ecosystem. It is a CNCF Graduated project — among the earliest to reach that maturity tier, alongside Kubernetes itself [Source: https://www.cncf.io]. Graduation signals widespread, production-grade adoption, strong governance, and a healthy ecosystem of integrations.

Architecturally, Prometheus is a pull-based system. It expects services to expose HTTP /metrics endpoints in the Prometheus exposition format (or the equivalent OpenMetrics format) and scrapes them on a schedule. The scraped data is stored in a purpose-built time-series database, and engineers query it with PromQL, a domain-specific language designed for metrics analysis. Alerts are evaluated as PromQL rules and routed through the Alertmanager component [Source: https://prometheus.io].

Prometheus is metrics-only. It does not natively handle logs or traces — and that single-responsibility focus is part of why it is so durable. Its strengths are:

For scaling beyond a single Prometheus instance, the ecosystem provides compatible long-term storage and federation systems — Thanos, Cortex, Mimir, and VictoriaMetrics — all of which speak PromQL and the Prometheus remote-write protocol.

Key Takeaway: Prometheus is the de-facto metrics standard in cloud-native systems: a CNCF Graduated, pull-based metrics database with PromQL, Alertmanager, and a vast exporter ecosystem. It excels at metrics — and only metrics.

OpenTelemetry as the vendor-neutral instrumentation standard

OpenTelemetry (often abbreviated OTel) is the CNCF’s answer to the historical chaos of proprietary, vendor-specific instrumentation. It is also a CNCF Graduated project as of 2025, reflecting the maturity of its specifications, SDKs, and Collector [Source: https://opentelemetry.io].

OpenTelemetry’s scope is fundamentally different from Prometheus’s. It is not a backend; it is an instrumentation and pipeline standard. Its components are:

The vendor-neutrality argument is OpenTelemetry’s signature value. Instrument once with OpenTelemetry, and the choice of backend becomes a configuration decision rather than a code rewrite. Migrating from one tracing vendor to another — or running multiple in parallel — requires changes only in Collector pipelines, not in application source [Source: https://opentelemetry.io].

Key Takeaway: OpenTelemetry is the vendor-neutral instrumentation standard for cloud-native telemetry. It provides SDKs, auto-instrumentation, semantic conventions, a Collector, and the OTLP wire protocol — covering all three pillars (traces, metrics, logs) with one consistent model.

Backends: Grafana, Jaeger, Tempo, Loki, Mimir, and commercial vendors

Once telemetry is instrumented (via OpenTelemetry) and metrics are scraped or collected (via Prometheus or the Collector), it has to land somewhere. The CNCF backend ecosystem in 2025 is rich and specialized:

BackendSignalRole
PrometheusMetricsLocal scraping, TSDB, PromQL, alerting
Mimir / Cortex / Thanos / VictoriaMetricsMetricsLong-term, horizontally scalable Prometheus-compatible storage
JaegerTracesDistributed tracing backend, open-source, CNCF Graduated
TempoTracesGrafana-stack trace backend optimized for object storage
LokiLogsGrafana-stack log aggregation, label-indexed (cheap) storage
GrafanaVisualizationDashboards, alerting UI, multi-backend query interface

Beyond the open-source landscape, every major commercial observability vendor — Datadog, New Relic, Splunk, Honeycomb, Dynatrace, Chronosphere, Lightstep, and others — natively ingests OTLP and most also support Prometheus remote-write [Source: https://opentelemetry.io]. This is the practical payoff of vendor neutrality: a team can move between open-source and SaaS backends, or even run hybrid configurations, without rewriting instrumentation.

The dominant 2025 pattern is hybrid:

Figure 1.5: End-to-end cloud-native observability stack

flowchart LR
    Apps["Applications<br/>(Go, Java, Python, ...)"] --> SDK["OpenTelemetry SDK<br/>+ auto-instrumentation"]
    SDK -->|"OTLP (gRPC/HTTP)"| Col["OpenTelemetry Collector<br/>(DaemonSet / gateway)"]
    Scrape["Prometheus scrape<br/>(/metrics endpoints)"] --> Prom["Prometheus<br/>(local TSDB + PromQL)"]
    Apps -. "expose /metrics" .-> Scrape
    Col -->|"metrics<br/>(remote-write)"| Mimir["Mimir / Thanos /<br/>VictoriaMetrics<br/>(long-term metrics)"]
    Prom -->|"remote-write"| Mimir
    Col -->|"traces"| Tempo["Tempo / Jaeger<br/>(trace storage)"]
    Col -->|"logs"| Loki["Loki / OpenSearch<br/>(log storage)"]
    Prom --> Graf["Grafana<br/>(dashboards + alerts)"]
    Mimir --> Graf
    Tempo --> Graf
    Loki --> Graf
    Prom --> AM["Alertmanager"]

In this arrangement Prometheus and OpenTelemetry are complementary, not competitors. As the CNCF SIG Observability community puts it in spirit: use OpenTelemetry for how you instrument and move telemetry; use Prometheus for where metrics live and how you query and alert on them [Source: https://www.cncf.io].

Key Takeaway: The 2025 CNCF observability stack pairs OpenTelemetry (for instrumentation and routing) with Prometheus (for metrics storage and PromQL alerting), feeding specialized backends — Jaeger or Tempo for traces, Loki for logs, Grafana for visualization — and remaining portable between open-source and commercial vendors through OTLP.

Chapter Summary

Observability is a property of a system: the degree to which an operator can infer its internal state from its external outputs. In cloud-native environments — where workloads are ephemeral, requests cross dozens of services in different languages, and infrastructure behavior lives partially in sidecar proxies — that property must be designed in deliberately. Traditional monitoring, which checks predefined signals against thresholds, remains valuable for catching known failure modes but cannot explain the novel, unknown-unknown incidents that distributed systems routinely produce.

The three pillars — metrics, logs, and traces — are the raw materials of observability, each with different cardinality tolerance and cost characteristics. Metrics are low-cardinality time series ideal for alerting and SLOs; logs are high-cardinality discrete events ideal for detailed per-request context; traces are end-to-end records of a request’s path across services. The killer feature is correlation: exemplars link metrics to traces, trace IDs in structured logs link logs to traces, and Kubernetes resource attributes tie everything to the underlying workload, letting an operator pivot from “this chart looks bad” to “this exact log line in this exact span” in seconds.

In the CNCF ecosystem of 2025, two graduated projects anchor the stack. Prometheus provides metrics scraping, the PromQL query language, and Alertmanager — the battle-tested metrics brain. OpenTelemetry provides language SDKs, semantic conventions, the Collector, and the OTLP wire protocol — the vendor-neutral nervous system for all three pillars. The dominant pattern is to instrument with OpenTelemetry, store metrics in Prometheus (or a Prometheus-compatible long-term store), and route traces and logs to specialized backends like Jaeger, Tempo, and Loki, with Grafana stitching it all together. The rest of this book builds out the practical mechanics of that stack.

Key Terms

TermDefinition
observabilityThe property of a system whereby its internal state can be inferred from its external outputs (telemetry); in software, achieved by emitting rich, correlated metrics, logs, and traces.
telemetryThe data emitted by a system that describes its behavior — metrics, logs, traces, and increasingly profiles — collected for monitoring and observability purposes.
signalA category of telemetry data. In OpenTelemetry parlance, the canonical signals are metrics, logs, and traces, each with its own data model.
cardinalityThe number of unique combinations of label or attribute values in a telemetry dataset. High cardinality is cheap for logs/traces but expensive for metrics, where each combination creates a separate time series.
ephemeral workloadA short-lived compute unit — most commonly a Kubernetes pod — whose identity (name, IP) is not stable and that may be created and destroyed within seconds by deployments, autoscaling, or crashes.
service meshAn infrastructure layer (e.g., Istio, Linkerd) that injects sidecar proxies next to application pods to handle mTLS, retries, timeouts, routing, and traffic policy at the network layer rather than in application code.
CNCFThe Cloud Native Computing Foundation, the open-source foundation that hosts Kubernetes, Prometheus, OpenTelemetry, Jaeger, and many other graduated and incubating cloud-native projects.
vendor neutralityA design principle, central to OpenTelemetry, in which instrumentation and telemetry pipelines are independent of any specific backend so that operators can switch or combine vendors without rewriting application code.

Chapter 2: The Three Signals: Metrics, Logs, and Traces in Depth

Learning Objectives

Chapter 1 introduced observability as a property that emerges from three complementary signals: metrics, logs, and traces. That framing is useful, but it can be misleading if you treat the three as interchangeable buckets of “telemetry.” They are not. Each signal has a distinct data model, distinct ingestion economics, and distinct questions it can answer well. Choosing the wrong signal for a question is like trying to measure room temperature with a video camera — you can sort of do it, but you’re paying enormous storage and processing costs for information that a $5 thermometer would deliver instantly.

This chapter goes one level deeper. We dissect the wire format of each signal, examine the cardinality math that drives cost, and then show how exemplars and W3C Trace Context turn three independent data streams into a single navigable investigative narrative.

Figure 2.1: The three signals and their correlation hooks

graph TD
    M["Metrics<br/>counts &amp; aggregates<br/>'how many / how fast'"]
    L["Logs<br/>discrete events<br/>'what happened'"]
    T["Traces<br/>causal span tree<br/>'why was it slow'"]
    M -- "exemplars<br/>(trace_id pointers)" --- T
    T -- "trace_id / span_id<br/>in LogRecord" --- L
    M -- "shared Resource<br/>(service.name, k8s.*)" --- L
    Hub(("Unified<br/>investigation"))
    M --> Hub
    L --> Hub
    T --> Hub

Metrics — Numbers Over Time

A metric is a numeric measurement of a system property recorded at a point in time. Metrics are the cheapest of the three signals because the storage system records aggregates, not individual events. A counter that has been incremented one billion times occupies the same space as one that has been incremented ten times — both are just a current value plus a timestamp series.

Think of metrics as the dashboard gauges in a car: speed, RPM, fuel level. They tell you the state of the system at a glance, but they don’t tell you why the engine started knocking three miles back.

Counters, gauges, histograms, and summaries

Prometheus, which has become the de facto standard for cloud-native metrics, defines four core instrument types:

The histogram-versus-summary choice is one of the most consequential decisions in Prometheus instrumentation, and it trips up nearly every team at least once. The critical difference: summary quantiles cannot be aggregated across instances. Averaging the p99 from each of your ten pods does not give you the true global p99. You can safely aggregate only _sum and _count from a summary. Histogram buckets, in contrast, are fully aggregatable — you can sum() buckets across instances, zones, or services and then call histogram_quantile() on the aggregate to get a meaningful service-wide tail latency [Source: https://natesnewsletter.substack.com/p/context-windows-are-a-lie-the-myth].

AspectCounterGaugeHistogramSummary
DirectionUp onlyAnyDistributionDistribution
Quantile computeN/AN/AServer-side (PromQL)Client-side (in process)
Cross-instance aggregationTrivialTrivialYes (sum buckets)Only _sum/_count
Cardinality per metric1 series1 seriesN buckets + 2N quantiles + 2
Best forRates, totalsLevels, sizesSLO latency, fleet statsPer-instance fixed quantiles

The practical rule: for SLO-critical latency where you need fleet-wide p99 and the ability to choose new quantiles later without redeploying code, use histograms. For low-cardinality, per-process diagnostics where you only ever care about one box at a time, summaries are acceptable [Source: https://magazine.sebastianraschka.com/p/llm-evaluation-4-approaches].

Classic histograms have a well-known drawback: bucket count multiplies cardinality. Native histograms, introduced in Prometheus 2.40, encode dynamic log-spaced buckets inside a single time series using a binary format, dramatically reducing series count while supporting high resolution [Source: https://pubmed.ncbi.nlm.nih.gov/help/]. If your stack supports them end-to-end (client library, server, remote storage), they are usually the better default for new metrics.

Time-series labels and dimensionality

A Prometheus time series is uniquely identified by its metric name plus the set of label key-value pairs:

http_requests_total{method="GET", path="/api/orders", status="200", service="checkout"}

Each unique combination of labels creates a new time series with its own storage allocation. This is the cardinality of a metric, and it is the single biggest cost driver in any Prometheus deployment.

The math is brutal. Suppose http_requests_total carries five labels with these cardinalities:

Total potential series: 5 × 50 × 6 × 20 × 4 = 120,000 series for one metric. Add a user_id label with even 10,000 unique values and you have 1.2 billion potential series. This is why veterans say “never label by user ID, request ID, or trace ID.” It’s also why exemplars exist — they let you reference a trace ID without making it a series-defining label.

A useful analogy: labels are dimensions of a spreadsheet. Each dimension multiplies the number of cells. A 2D sheet of methods × paths is manageable; a 7D sheet quickly exceeds the heat death of the universe.

Aggregation, downsampling, and retention

Once metrics are in the time-series database, the next concern is keeping them around without going broke. Prometheus by default scrapes every 15 seconds and stores raw samples for ~15 days. For longer retention, the standard pattern is:

  1. Recording rules evaluate expensive PromQL queries periodically and store the result as a new, cheaper time series. For example, instance:cpu_usage:rate5m is far cheaper to query a year of than re-aggregating raw CPU samples each time.
  2. Remote write ships data to long-term storage backends like Thanos, Cortex, or Mimir, which handle compaction and downsampling.
  3. Downsampling reduces resolution as data ages: raw 15-second samples become 5-minute averages after 30 days and 1-hour averages after a year.

For histograms, downsampling is more nuanced. Classic histogram buckets compose cleanly — sum the buckets, then compute the quantile. Native histograms can merge layouts more flexibly. Summaries cannot be downsampled meaningfully at all, because the quantile series are already lossy projections.

Key Takeaway: Metrics are cheap, aggregated, and ideal for “how many” and “how fast” questions. The cost is dominated by cardinality — label combinations multiply storage — and the histogram-versus-summary choice determines whether you can meaningfully ask fleet-wide tail-latency questions.

Logs — Structured Events

A log is a discrete event record emitted by an application at a specific moment. Where a metric says “23,481 requests in the last minute,” a log says “request #382749 from user 42 to /api/orders failed at 10:03:21 with NullPointerException.” Logs preserve the individual story that metrics aggregate away.

The car analogy: if metrics are the dashboard, logs are the diagnostic trouble code log that the mechanic plugs into. They record specific events with their context.

Structured vs unstructured logging

For decades, application logs looked like this:

2024-05-10T10:00:00Z my-host app: ERROR user 42 not found

This is unstructured logging — a free-form string. Humans can read it; computers struggle. To search for “all errors involving user 42 across the fleet,” some downstream tool has to parse that string with regular expressions, hope every team used the same format, and reconstruct fields that the application already knew. It’s an information-destruction pipeline.

Structured logging flips this around: the application emits machine-readable key-value records from the start.

{
  "ts": "2024-05-10T10:00:00Z",
  "level": "info",
  "service": "billing",
  "request_id": "abc-123",
  "user_id": 42,
  "msg": "charged credit card",
  "amount": 99.99
}

Now “all errors involving user 42” is an indexed field lookup, not a regex hunt. OpenTelemetry takes structured logging a step further by defining a formal LogRecord data model [Source: https://blog.codinghorror.com/the-problem-with-logging/]. A LogRecord includes:

The genius of severity_number is that it normalizes severities across logging frameworks. A backend query like severity_number >= 17 (“ERROR or worse”) works whether the source used Python’s logging.ERROR, Java’s WARN, Go’s log.Error, or .NET’s LogLevel.Error [Source: https://natesnewsletter.substack.com/p/context-windows-are-a-lie-the-myth].

Figure 2.2: OpenTelemetry LogRecord schema

graph TD
    LR["LogRecord"]
    LR --> TS["timestamp<br/>application time"]
    LR --> OTS["observed_timestamp<br/>pipeline time"]
    LR --> SEV["Severity"]
    SEV --> SN["severity_number<br/>1-24 normalized"]
    SEV --> ST["severity_text<br/>'WARN', 'Warning'..."]
    LR --> B["body : AnyValue<br/>string / num / map / list"]
    LR --> A["attributes<br/>http.method, db.system..."]
    LR --> R["resource<br/>service.name, k8s.*, host.*"]
    LR --> TC["Trace Context"]
    TC --> TID["trace_id"]
    TC --> SID["span_id"]
    TC --> TF["trace_flags"]
    LR --> IS["instrumentation_scope"]
    LR --> DA["dropped_attributes_count"]

Log levels, ingestion pipelines, and indexing strategies

Application code chooses a severity level at the log call site:

A typical pipeline looks like this:

  1. Application writes structured logs to stdout (the Twelve-Factor recommendation).
  2. Node agent (Fluent Bit, Vector, or OpenTelemetry Collector) tails container stdout, parses, enriches with Kubernetes metadata (pod name, namespace, node), and ships.
  3. Aggregator/gateway buffers, batches, and may sample or redact.
  4. Storage backend indexes for search. Common choices: Elasticsearch (full-text inverted index), Loki (label-only index over chunked raw lines), or vendor services.

Indexing strategy matters enormously for cost. Elasticsearch-style full-text indexing lets you search any word in any log instantly, but the index can be larger than the data itself and grows roughly linearly with log volume. Loki-style label-only indexing keeps the index tiny by only indexing a handful of labels (service, pod, namespace) and storing raw log lines compressed; queries then grep through chunks. The trade-off: Loki is dramatically cheaper to operate but slower for arbitrary substring searches.

Compared to traditional shippers like Fluentd or Logstash, the OpenTelemetry approach standardizes the data model itself. Fluentd and Logstash are pipelines — powerful at routing and transforming, but they impose no global schema. Each route defines its own JSON shape, severity semantics, and field names. The OpenTelemetry Collector plays a similar pipeline role but operates on the typed OTLP model across all three signals [Source: http://susandumais.com/CHI2012-12-tailanswers-chi2012.pdf].

AspectOpenTelemetry LogsFluentd / Logstash
Data modelStandardized LogRecord schemaNo global standard; per-pipeline JSON
SeverityNormalized severity_number + severity_textString, ad-hoc normalization
Service contextBuilt-in Resource shared with traces/metricsPlugin-specific conventions
Trace correlationFirst-class trace_id/span_id fieldsManual injection + custom parsing
Multi-signalLogs, traces, metrics share ResourceLogs only
PipelineOTel Collector (multi-signal)Log-centric
Semantic conventionsOfficial spec (http., db., rpc.*)None standardized

Why logs alone are insufficient for distributed root cause analysis

A common misconception is that “good logs” suffice for observability. They don’t, for a fundamental architectural reason: logs are emitted per service, but root causes in distributed systems live in the relationships between services.

Consider a checkout request that times out. The checkout-service log shows “called inventory: timeout after 5s.” The inventory-service log shows “received call, query took 4.8s.” The postgres log shows “lock wait.” These three lines, scattered across three different log streams, describe one causal chain. Reconstructing it requires:

  1. Knowing the request ID and consistently propagating it through every service (which most teams botch at least one boundary of).
  2. Time-aligning logs whose clocks may differ by milliseconds or seconds.
  3. Guessing the causal order from timestamps that don’t capture parent/child relationships.
  4. Repeating this for every layer of the call graph.

This is exactly the problem traces solve, and it’s why logs alone — no matter how structured — cannot answer “what made this specific slow request slow?” in a microservices architecture [Source: https://arxiv.org/html/2501.11709v3]. Logs complement traces by carrying the rich context of individual events; they do not replace the causal graph.

Key Takeaway: Structured logs with a standardized data model (the OpenTelemetry LogRecord) capture rich per-event detail with normalized severity and shared resource context, but they cannot reconstruct causality across services on their own — that’s what traces are for.

Traces — Causally-linked Spans

A trace is the recorded journey of a single request through a distributed system. If metrics are the dashboard and logs are the event log, traces are the GPS track — they tell you not just that the trip happened but the exact route taken, which segments were slow, and where the detours were.

Trace, span, and span context

The core data model is a small hierarchy:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
              |       trace_id              |   span_id    |flags

When checkout-service calls inventory-service, it copies its current span context into a traceparent header on the outbound HTTP request. The receiving service reads that header, knows the parent’s trace_id and span_id, and starts a child span under the same trace. This is how a single trace stitches itself together across N services with no central coordinator.

A concrete example: a user clicks “place order.” The browser hits checkout-service, which calls inventory-service, which calls postgres, which calls payment-service. Five spans, all sharing one trace_id, each pointing to its parent’s span_id. Drawn on a timeline, this is the familiar flame graph view that tracing UIs render.

Figure 2.3: Span tree for a checkout request (one trace_id, parent-child via span_id)

graph TD
    Root["POST /checkout<br/>SERVER · checkout-service<br/>span_id=a1 · 980ms"]
    Root --> Inv["inventory.check<br/>CLIENT · checkout-service<br/>span_id=b2 · parent=a1 · 620ms"]
    Inv --> InvS["GET /inventory<br/>SERVER · inventory-service<br/>span_id=c3 · parent=b2 · 600ms"]
    InvS --> DB["db.query SELECT stock<br/>CLIENT · inventory-service<br/>span_id=d4 · parent=c3 · 540ms"]
    Root --> Pay["payment.charge<br/>CLIENT · checkout-service<br/>span_id=e5 · parent=a1 · 320ms"]
    Pay --> PayS["POST /charge<br/>SERVER · payment-service<br/>span_id=f6 · parent=e5 · 300ms"]

Parent-child relationships and span kinds

Each span (except the root) carries the span_id of its parent. This builds an explicit directed acyclic graph rather than the implicit one you’d have to reconstruct from log timestamps. The tree captures causality — the parent caused the child to happen — not merely temporal proximity.

OpenTelemetry classifies spans by span kind, which clarifies the role of the span in a network interaction:

A typical HTTP call generates two spans: a CLIENT span on the caller and a SERVER span on the callee, linked by the same trace_id and a parent-child relationship via the propagated context. Span kinds let analysis tools draw service maps automatically: any CLIENT span pointing to a SERVER span in a different service implies an edge between those services.

Spans also carry events (timestamped sub-points, e.g., “cache miss”) and links (references to other traces, useful for fan-out or queue-driven workflows where one input span causes work in many traces).

Service maps derived from trace topology

Because every cross-service call produces a CLIENT/SERVER pair tagged with service.name, a tracing backend can build a live service map by aggregating recent traces:

You no longer maintain a hand-drawn architecture diagram. The system draws its own current architecture from the traces it processes, automatically reflecting newly deployed services, retired ones, and unexpected dependencies (the classic “wait, why is checkout-service calling legacy-billing directly?” discovery).

Figure 2.4: Service map auto-derived from CLIENT→SERVER span pairs

graph LR
    Web["web-frontend"]
    Co["checkout-service"]
    Inv["inventory-service"]
    Pay["payment-service"]
    PG[("postgres")]
    Rd[("redis")]
    Web -- "1200 rps<br/>err 0.1%" --> Co
    Co -- "1100 rps<br/>err 0.4%" --> Inv
    Co -- "900 rps<br/>err 2.1% (hot)" --> Pay
    Inv -- "1100 rps<br/>p99 540ms" --> PG
    Co -- "1200 rps<br/>p99 5ms" --> Rd
    Pay -- "900 rps<br/>err 1.8%" --> PG

Key Takeaway: Traces capture causality across services via trace_id and parent span_id propagation through W3C Trace Context, enabling per-request flame graphs and auto-derived service maps — the one capability metrics and logs cannot deliver.

Correlating the Three

Each signal in isolation is useful; together they become an investigative narrative. The mechanics of correlation — how a dot on a Grafana chart links to a trace, which in turn surfaces the relevant log lines — rely on three standardized hooks.

Exemplars in Prometheus histograms

The naive way to link a latency spike to a slow request is to put trace_id in the metric labels. That blows up cardinality instantly: every unique trace_id creates a new series, and you’ll have billions in a day. Exemplars solve this by attaching a few representative trace_id/span_id pointers alongside metric samples without making them series-defining labels [Source: https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/prompt-engineering].

The OpenMetrics text exposition format appends an exemplar after the sample value, prefixed by #:

http_server_request_duration_seconds_bucket{le="0.5",method="GET",service="api"} 240 1700000100 \
  # {trace_id="4bf92f3577b34da6a3ce929d0e0e4736",span_id="00f067aa0ba902b7"} 1.700000099e+09

Constraints worth memorizing:

The trace ID format is exactly the W3C Trace Context format used by OpenTelemetry — 32 hex chars for trace_id, 16 hex chars for span_id — and the values must be the same IDs your tracer is shipping to Tempo or Jaeger. Mismatched propagation is the most common reason exemplars appear in Grafana but the “View trace” link returns “trace not found.”

The end-to-end workflow when a developer investigates a latency spike at 10:05:

  1. Grafana renders the p99 latency line from histogram_quantile(0.99, sum by (le) (rate(http_server_request_duration_seconds_bucket[5m]))).
  2. Small dots appear along the line wherever Prometheus has stored exemplars.
  3. Developer hovers a dot at the spike; popup shows trace_id=4bf92f3577…, latency 2.3s.
  4. Click “View trace.” Grafana queries the Tempo data source by trace_id.
  5. The trace opens: checkout-service called inventory-service which sat on a database lock for 2.1s.

The dot was sampled — not every slow request gets an exemplar, just enough to give the investigator a starting point. This is the practical realization of the metrics-to-traces correlation that the three pillars metaphor promises.

Figure 2.5: Metric-to-trace exemplar drill-down workflow

sequenceDiagram
    participant Dev as Developer
    participant Graf as Grafana
    participant Prom as Prometheus<br/>(TSDB + exemplar store)
    participant Tempo as Tempo
    Dev->>Graf: open p99 latency dashboard
    Graf->>Prom: histogram_quantile(0.99, ...)
    Prom-->>Graf: latency series + exemplar dots
    Note over Graf: spike at 10:05<br/>dot shows trace_id=4bf92f35...
    Dev->>Graf: hover dot, click "View trace"
    Graf->>Tempo: GET /traces/4bf92f35...
    Tempo-->>Graf: span tree (checkout → inventory → db)
    Note over Dev,Tempo: db.query span = 2.1s<br/>(lock wait identified)

Trace-to-logs joins via trace_id and span_id

The second leg of the correlation triangle is trace-to-logs, made possible because the OpenTelemetry LogRecord schema reserves dedicated trace_id and span_id fields at the top level — not buried in attributes [Source: https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices].

When an application logs inside an active span, the OpenTelemetry logging integration automatically copies the current span’s trace_id and span_id into the LogRecord. In Java with log4j2, in Python with the OTel logging instrumentation, in Go with the otelslog handler — the SDK does it for you. You no longer have to remember to call logger.info("...", "trace_id", currentSpan.TraceID()) at every log site.

From the trace view in Grafana or Tempo, “show logs for this span” is then a single derived field query:

{service="checkout"} |= "4bf92f3577b34da6a3ce929d0e0e4736"

The trace_id acts as the universal join key. The same query in reverse — start in logs, jump to the trace — gives you the bidirectional navigation that makes investigations feel fluid instead of archaeological.

A subtle pitfall: sampling mismatches. If you head-sample traces at 10% but log unconditionally on every request, 90% of your log trace_ids will point to traces that don’t exist in Tempo. To avoid this, use tail-based sampling (decide which traces to keep based on what happened in them — errors and slow ones always kept) or ensure that the sampling decision is made early and propagated, so logs from non-sampled traces don’t carry stale IDs.

Unified resource attributes across all signals

The third correlation hook is the Resource — the OpenTelemetry concept of a small, fixed set of attributes describing where the telemetry came from. The same resource attributes are attached to every metric, log, and trace emitted by a process:

service.name = "checkout-service"
service.namespace = "payments"
service.instance.id = "pod-abc123"
service.version = "1.4.2"
deployment.environment = "prod"
k8s.pod.name = "checkout-7f8b9-abcde"
k8s.namespace.name = "payments-prod"
host.name = "ip-10-0-1-42.ec2.internal"
cloud.region = "us-east-1"

Because all three signals share these attributes verbatim, a query like “all telemetry for service.name = checkout-service in deployment.environment = prod returns matching metrics, logs, and traces from a single filter — no vendor-specific tag mapping, no per-tool index naming conventions. This is the practical meaning of the OpenTelemetry promise of “vendor-neutral observability”: a service.name in one tool means the same thing in another.

Semantic conventions extend this consistency to operation-level attributes. The OpenTelemetry spec defines official keys like http.request.method, http.response.status_code, db.system.name, messaging.system, rpc.service. When everyone uses these standard names, queries portable across backends become routine.

Correlation hookJoinsMechanism
ExemplarsMetrics → Tracestrace_id/span_id appended to histogram samples in OpenMetrics
Trace context in LogRecordLogs ↔ Tracestrace_id/span_id as first-class LogRecord fields, auto-populated by SDK
Shared resource attributesAll threeservice.name, k8s., host. identical across signals from same process
Semantic conventionsAll threeStandard attribute keys (http., db., rpc.*) used uniformly

Figure 2.6: Correlation hub — three standardized hooks unifying the signals

graph TB
    subgraph Signals
        M["Metrics<br/>Prometheus TSDB"]
        L["Logs<br/>Loki / Elasticsearch"]
        T["Traces<br/>Tempo / Jaeger"]
    end
    H["Shared Resource<br/>service.name<br/>k8s.pod.name<br/>deployment.environment"]
    EX["Exemplars<br/>trace_id appended<br/>to histogram samples"]
    TC["LogRecord trace context<br/>top-level trace_id / span_id<br/>auto-populated by SDK"]
    M -- "drill via" --- EX
    EX -- "to" --- T
    T -- "join via" --- TC
    TC -- "to" --- L
    M --- H
    L --- H
    T --- H
    H --- Q["query: service.name=checkout<br/>returns metrics + logs + traces"]

Key Takeaway: The three signals become one investigative narrative through three standardized hooks — exemplars connect metrics to traces without cardinality blowup, OpenTelemetry’s LogRecord puts trace_id/span_id as top-level fields for trace-to-log joins, and a shared Resource model means service.name and friends mean the same thing across all signals.

Chapter Summary

This chapter dissected the three observability signals at the data-model level and showed how they interlock.

Metrics are cheap aggregates that answer “how many” and “how fast.” Prometheus offers four instrument types: counters, gauges, histograms, and summaries. The critical operational decision is histogram-vs-summary: summaries compute quantiles per-instance and cannot be aggregated across the fleet, while histogram buckets are fully aggregatable and let you compute new quantiles at query time. Native histograms (Prometheus 2.40+) further reduce series count via dynamic log-spaced buckets in a single time series. Cardinality — the product of unique label combinations — is the dominant cost, so labels like user_id or trace_id must never be used as series-defining metric labels.

Logs are discrete events with rich per-event context. The OpenTelemetry LogRecord defines a typed schema — timestamp, severity_number (normalized 1–24), severity_text, body (AnyValue), attributes, resource, trace context — that turns “lines of text” into structured telemetry events. This contrasts with Fluentd/Logstash pipelines, which transport JSON or syslog without a global data model. Despite their richness, logs alone cannot reconstruct causality across services in microservices systems.

Traces capture causality. A trace is identified by a 128-bit trace_id; each unit of work is a span with a 64-bit span_id and a parent pointer. W3C Trace Context propagates these IDs across process boundaries via the traceparent header. Span kinds (SERVER/CLIENT/PRODUCER/CONSUMER/INTERNAL) let tracing backends auto-derive service maps from the topology of observed spans.

Correlation is the payoff. Exemplars attach trace_id/span_id pointers to Prometheus histogram samples in OpenMetrics format, stored separately from normal series to avoid cardinality explosion. Trace IDs in LogRecord as top-level fields make trace-to-logs navigation a single join. A shared Resource — service.name, k8s.pod.name, deployment.environment — ties all three signals to the same process identity. Semantic conventions standardize the rest. Together these hooks turn three independent data streams into one coherent investigative narrative: spike on a graph → hover an exemplar dot → open the trace → see the slow span → click through to its logs → identify the root cause.

Key Terms

TermDefinition
counterA Prometheus metric instrument type representing a monotonically increasing value (or one that resets on process restart); used for rates and totals via rate().
gaugeA Prometheus metric instrument type representing a value that can go up or down; used for levels and sizes like memory in use or queue depth.
histogramA Prometheus distribution metric that exposes cumulative bucket counters (_bucket{le="..."}), plus _sum and _count; quantiles are computed server-side in PromQL with histogram_quantile(), and buckets are fully aggregatable across instances.
summaryA Prometheus distribution metric that computes quantiles client-side via a sliding-window algorithm and exposes them as {quantile="0.9"} series; quantile series cannot be aggregated across instances.
spanA single named unit of work within a trace, identified by a 64-bit span_id, with start time, end time, attributes, events, and a link to its parent span_id.
trace contextThe propagation envelope (typically a W3C traceparent HTTP header) that carries trace_id, current span_id, and trace_flags across process boundaries to stitch distributed work into one trace.
exemplarA small pointer (typically a trace_id and span_id) attached to a metric sample in OpenMetrics format, stored separately from normal series, enabling navigation from a metric spike to a specific trace without cardinality explosion.
structured loggingThe practice of emitting logs as typed key-value records rather than free-form strings, enabling indexed queries; OpenTelemetry formalizes this via the LogRecord schema.
trace_idA 128-bit identifier (32 hex characters in W3C Trace Context format) that uniquely identifies a single distributed request and is shared by every span and log emitted in service of that request.

Chapter 3: Prometheus Architecture and Data Model

If chapter 2 explained why metrics matter, this chapter is about how one of the most influential metrics systems in the world actually works. Prometheus is more than a “metrics database.” It is an opinionated bundle of design choices — a pull-based scraper, a multi-dimensional data model, a custom time-series database (TSDB), and a query engine — that together produce a system that is operationally simple at small scale and powerful at large scale.

Understanding Prometheus internals matters for two reasons. First, almost every cloud-native observability stack you will encounter borrows Prometheus’s conventions: the OpenMetrics exposition format, label-based identity, PromQL, and remote write. Second, when something goes wrong — slow queries, OOMing pods, missing data after a crash, runaway cardinality — you cannot debug it without a mental model of the scrape loop, the TSDB block layout, and the WAL.

This chapter walks through Prometheus from the outside in: the components, the pull model, the data model, and finally how data is stored, retained, and shipped to long-term storage.

Learning Objectives

By the end of this chapter, you will be able to:


Section 1: Server Components

A single prometheus binary contains several cooperating subsystems. Architecturally, you can think of Prometheus as a small distributed system that happens to run inside one process: a retrieval subsystem pulls data, a storage subsystem persists it, a query subsystem answers PromQL questions, and a rules subsystem fires alerts to an external Alertmanager.

Figure 3.1: Prometheus server components and external integrations

flowchart LR
    SD[Service Discovery<br/>K8s, Consul, EC2, DNS] --> R[Retrieval<br/>Scrape Loop]
    R -->|HTTP GET /metrics| Targets[(Scrape Targets)]
    R --> TSDB[(TSDB<br/>head + WAL + blocks)]
    TSDB --> API[HTTP API + Web UI]
    API --> Grafana[Grafana / Clients]
    TSDB --> RE[Rule Engine<br/>recording + alerting]
    RE --> AM[Alertmanager<br/>external]
    TSDB --> RW[Remote Write]
    RW --> LTS[(Long-term Storage<br/>Thanos / Mimir / Cortex)]

Retrieval (scrape) loop

The retrieval subsystem is the heart of Prometheus’s interaction with the outside world. It maintains a set of scrape targets, each of which is essentially a URL like http://10.0.4.17:9100/metrics, a scrape interval (commonly 15s or 30s), and an optional set of labels.

For each target, Prometheus runs a small state machine on a timer:

  1. Resolve the target’s address (service discovery may have changed it).
  2. Open an HTTP GET to /metrics with a configured timeout.
  3. Parse the response as OpenMetrics or the legacy Prometheus text format.
  4. Apply relabeling rules to drop or rewrite the resulting samples.
  5. Append the samples to the TSDB’s head block, tagged with the scrape timestamp.

If the scrape fails (timeout, 5xx, parse error), Prometheus still writes a synthetic series called up{job="...",instance="..."} with the value 0. On success, it writes up = 1 plus several built-in scrape_* metrics like scrape_duration_seconds and scrape_samples_scraped. These “meta” series are how you alert on monitoring itself.

A useful analogy: think of the scrape loop as a postal carrier who walks the same route every 15 seconds. The carrier does not wait for anyone to mail a letter; they pick up whatever is in the mailbox at that moment. If the mailbox is missing — the house has been demolished — the carrier files a report (up = 0) and moves on.

TSDB on-disk format and blocks

The TSDB (time-series database) is where samples land after the scrape loop. Prometheus does not use Postgres, RocksDB, or any general-purpose database for time series — it ships its own purpose-built engine designed for one workload: many millions of monotonically advancing numeric series.

The on-disk layout is straightforward. Inside the data directory you will find:

Conceptually there are two regions:

We will return to blocks, the WAL, and compaction in section 4. The important takeaway here is that the storage engine is not a free-for-all key-value store — it is shaped around time-bounded immutable units, which is what makes 90-day retention with sub-second queries feasible on a single node.

HTTP API and web UI

Everything you can do to Prometheus — query it, list targets, see the configuration, fire test alerts — goes through its HTTP API. Notable endpoints:

EndpointPurpose
/api/v1/queryInstant PromQL query
/api/v1/query_rangeRange PromQL query (what Grafana uses)
/api/v1/seriesSeries matching label selectors
/api/v1/labels, /api/v1/label/<n>/valuesDiscover label names and values
/api/v1/targetsActive and dropped scrape targets
/api/v1/rules, /api/v1/alertsConfigured rules and active alerts
/-/reload, /-/healthy, /-/readyLifecycle endpoints
/metricsPrometheus’s own metrics (Prometheus scrapes itself)

The web UI bundled with the server is intentionally minimal: a query box, a graph view, a targets page, and an alerts page. It exists so an operator can debug from a fresh laptop with no Grafana. For day-to-day dashboards, you point Grafana (or another visualization tool) at the same HTTP API.

Service discovery integrations

Static configuration — listing IPs in a YAML file — falls over the moment your environment becomes dynamic. Prometheus solves this with service discovery (SD), where the list of scrape targets is generated from an external source of truth.

Built-in SD integrations include Kubernetes, Consul, EC2, GCE, Azure, DNS SRV records, file-based SD, and many more. A Kubernetes SD configuration, for example, asks the API server for all pods matching a selector, then turns each pod into a target with labels like __meta_kubernetes_pod_name, __meta_kubernetes_namespace, and __meta_kubernetes_pod_label_app. Relabeling rules (covered in section 3) then promote these “underscore” labels into permanent target labels or drop them entirely.

Think of SD as the address book: Prometheus’s scrape loop is the postal carrier, but SD is what tells the carrier which houses exist this morning.

Key Takeaway: Prometheus is a small distributed system in a single binary — a scrape loop driven by service discovery, a purpose-built TSDB protected by a WAL, an HTTP query API, and a rules engine that talks to an external Alertmanager. Operating Prometheus well means understanding each of those subsystems independently.


Section 2: The Pull Model

Of all the design decisions in Prometheus, the choice to pull metrics rather than have services push them is the most contentious — and the most consequential. Understanding why pull was chosen helps you reason about when push is genuinely better and when it just feels easier because it is what you already know.

Why pull was chosen over push

In a pull model, targets expose a /metrics HTTP endpoint and Prometheus periodically calls it. In a push model, targets connect to a central collector and stream samples as they are produced.

The pull design buys Prometheus several properties that are surprisingly hard to replicate in push systems [Source: https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/prompt-engineering]:

The trade-offs are real, however. Pull is awkward when:

Pull is therefore not universally better — it is better for the steady-state of long-lived services in trusted networks, which happens to describe most of what runs inside a Kubernetes cluster.

Figure 3.2: Scrape loop sequence (pull model with implicit liveness)

sequenceDiagram
    participant SD as Service Discovery
    participant P as Prometheus<br/>Scrape Loop
    participant T as Target /metrics
    participant DB as TSDB Head

    SD->>P: target list (IPs, labels)
    loop every 15s
        P->>T: HTTP GET /metrics
        alt target healthy
            T-->>P: 200 OK + OpenMetrics body
            P->>P: parse + relabel samples
            P->>DB: append samples + "up=1"
        else target unreachable
            T--xP: timeout / 5xx / parse error
            P->>DB: append "up=0" (stale marker)
        end
    end

The Pushgateway for short-lived jobs

For the cases pull cannot reach naturally — most importantly short-lived batch jobs — Prometheus offers a companion service called the Pushgateway. It is a small HTTP server with a deceptively simple job: accept pushed metrics, store the current value in memory, and expose them on its own /metrics endpoint so Prometheus can scrape the Pushgateway like any other target.

Used correctly, the pattern looks like this:

  1. A nightly batch job starts, runs for two minutes, then finishes.
  2. Right before exiting, the job pushes nightly_import_last_run_status (0 for success), nightly_import_last_run_duration_seconds, and nightly_import_last_run_timestamp_seconds to the Pushgateway, grouped by job="nightly_import".
  3. Prometheus continues to scrape the Pushgateway every 15 seconds, so these “last run” metrics are visible to PromQL and Alertmanager whether or not the job is currently running.

A simple Go example illustrates the shape of this pattern:

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/push"
    "time"
)

var (
    jobDuration  = prometheus.NewGauge(prometheus.GaugeOpts{Name: "nightly_import_last_run_duration_seconds"})
    jobStatus    = prometheus.NewGauge(prometheus.GaugeOpts{Name: "nightly_import_last_run_status"})
    jobTimestamp = prometheus.NewGauge(prometheus.GaugeOpts{Name: "nightly_import_last_run_timestamp_seconds"})
)

func main() {
    start := time.Now()
    err := runImport()
    jobDuration.Set(time.Since(start).Seconds())
    jobTimestamp.Set(float64(time.Now().Unix()))
    if err != nil { jobStatus.Set(1) } else { jobStatus.Set(0) }

    _ = push.New("http://pushgateway:9091", "nightly_import").
        Collector(jobDuration).Collector(jobStatus).Collector(jobTimestamp).Push()
}

The anti-patterns are at least as important to know [Source: https://blog.codinghorror.com/the-problem-with-logging/]:

For most modern push needs, the OpenTelemetry Collector is a better answer (chapter 5 covers it in depth): it accepts OTLP push from applications, handles retries and batching, and can expose a Prometheus-compatible scrape endpoint or remote-write to a backend.

Federation and hierarchical scraping

What if you have too many targets for one Prometheus, or you want a “global” view across regions without one giant server scraping the world? Prometheus supports federation: one Prometheus scrapes another’s /federate endpoint and pulls a subset of its series.

A typical hierarchy looks like:

TierRoleRetentionScrape sources
LeafPer-cluster, per-region; scrape everythingShort (1-7 days)Pods, nodes, exporters
AggregatorPull recording-rule outputs from leavesMedium (15-30 days)Leaf /federate endpoints
GlobalCross-region dashboards and alertingLong (via remote write to Thanos/Mimir)Aggregator /federate endpoints

Federation is best used to pull aggregated series (the output of recording rules) rather than raw metrics — pulling raw, high-cardinality data through federation will bottleneck on a single HTTP scrape. For “ship everything to long-term storage,” remote write (section 4) is almost always a better fit.

Key Takeaway: Pull gives Prometheus implicit liveness, server-side load control, and a debuggable wire protocol; push (via Pushgateway or OTel Collector) is the escape hatch for short-lived jobs and firewalled targets. Use federation for aggregated views, remote write for raw data shipping.


Section 3: Data Model and Exposition Format

Prometheus’s data model is small enough to fit on an index card, yet it underpins every PromQL query you will ever write. Internalize it once and the rest of the system stops feeling magical.

Metric name + label set = unique time series

In Prometheus, a time series is uniquely identified by its metric name plus its set of labels. Every other property — the metric type, the help text, the unit — is metadata; the identity is the name and the labels.

Consider this single sample:

http_requests_total{job="api", method="GET", status="200", path="/users"}  1873  1717520400000

That has three parts:

Change any label value and you have a different series. http_requests_total{...,method="GET"} and http_requests_total{...,method="POST"} are entirely separate time series, stored separately, indexed separately, and counted separately against cardinality budgets.

This is the multi-dimensional data model: instead of inventing one metric per dimension (http_requests_GET_200_total, http_requests_POST_500_total, …) you have one metric name with multiple label dimensions, and PromQL lets you slice along those dimensions at query time:

sum by (status) (rate(http_requests_total{job="api"}[5m]))

A practical analogy: a metric name is a spreadsheet, labels are the columns, label values are the cell contents, and each unique row is a time series. Adding a column with high cardinality (say, user_id) is like adding a column to a spreadsheet that has one row per user — your data grows linearly in the number of distinct values, and so does Prometheus’s memory.

OpenMetrics text exposition format

Targets expose metrics in a deliberately simple line-based format. The original Prometheus text format was standardized as OpenMetrics in 2020 and is now an IETF-tracked specification. A typical /metrics payload looks like:

# HELP http_requests_total Total HTTP requests received.
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1873
http_requests_total{method="GET",status="500"} 4
http_requests_total{method="POST",status="200"} 219

# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 4.2848256e+07

# HELP http_request_duration_seconds HTTP request latency.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 1500
http_request_duration_seconds_bucket{le="0.5"} 1860
http_request_duration_seconds_bucket{le="1.0"} 1872
http_request_duration_seconds_bucket{le="+Inf"} 1873
http_request_duration_seconds_sum 92.7
http_request_duration_seconds_count 1873
# EOF

Key rules:

The format’s simplicity is the point: anyone can write an exporter in a few dozen lines of code in any language, and anyone can debug one with curl.

Honor labels, target labels, and relabeling

When Prometheus scrapes a target, it sees two sets of labels: the labels the target exposed in its /metrics output, and target labels that Prometheus attaches automatically (most importantly job and instance, plus anything service discovery contributed).

By default, if the target’s exposed labels conflict with target labels, the target labels win. The honor_labels: true setting in a scrape config flips this: the target’s labels take precedence. This matters mostly for the Pushgateway (where the pushing job has set its own job label intentionally) and for federation (where you want to preserve the original job label from the upstream Prometheus).

Relabeling is the small declarative DSL Prometheus uses to transform labels before storing samples. There are two flavors:

A worked example: Kubernetes SD discovers thousands of pods. You want to scrape only pods with the annotation prometheus.io/scrape=true, set their job label from prometheus.io/job, and drop a noisy histogram bucket:

scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Keep only pods that opt in.
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"
      # Set the job label from the pod's annotation.
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_job]
        target_label: job
      # Promote namespace and pod into permanent labels.
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
    metric_relabel_configs:
      # Drop a known cardinality-bomb metric.
      - source_labels: [__name__]
        regex: "go_gc_pauses_seconds_bucket"
        action: drop

Figure 3.3: Relabeling pipeline (target metadata to stored series)

flowchart LR
    SD[Service Discovery] -->|"__meta_kubernetes_*<br/>__meta_consul_*"| RC[relabel_configs<br/>keep / drop / rewrite]
    RC --> TL[Final target labels<br/>"job, instance, namespace, pod"]
    TL --> SCR[Scrape /metrics]
    SCR --> RAW[Raw samples<br/>"name + labels + value"]
    RAW --> MRC[metric_relabel_configs<br/>drop high-cardinality]
    MRC --> TSDB[(TSDB head block)]

Relabeling is the place where most operational bugs hide. Two rules of thumb: (1) always test relabeling against a known target with --log.level=debug, and (2) drop unbounded labels (user IDs, request IDs, HTTP paths with query strings) before they enter the TSDB — once they are in, they cost memory until retention removes them.

Key Takeaway: A time series is uniquely identified by metric_name + label_set; labels are how you slice data at query time but also how cardinality explodes. The OpenMetrics text format is intentionally readable, and relabeling is your control plane for shaping labels before and after each scrape.


Section 4: Storage, Retention, and Remote Write

Now we follow a sample from the moment Prometheus accepts it from the scrape loop, through the head block and WAL, into a 2-hour block on disk, through compaction, and finally out to a long-term storage system over remote write.

Local TSDB block compaction and WAL

When the scrape loop produces a sample, the TSDB does two things, in order [Source: https://natesnewsletter.substack.com/p/context-windows-are-a-lie-the-myth]:

  1. Append a record to the WAL. The sample (with its series ID, timestamp, and value) is encoded as a CRC-checksummed record and written to the current WAL segment file (typically ~128 MB segments named 00000001, 00000002, …).
  2. Update the in-memory head block. The TSDB looks up (or creates) an in-memory series for the label set and appends the sample to that series’ current chunk.

The WAL ensures durability: if Prometheus crashes after step 1 but before the head block is persisted as a finished block, the sample can be recovered on restart by replaying the WAL.

Every ~2 hours, the head’s accumulated samples are cut into a new immutable on-disk block. Each block directory looks like:

01J1H1Q2K3V3Y2.../
  meta.json       # ULID, minTime, maxTime, compaction level, sources, stats
  index           # symbol table + postings lists (label=value -> series IDs)
  chunks/
    000001        # concatenated XOR-compressed chunks (mmapped at query time)
    000002
  tombstones      # optional, deletion intervals

The index file is the crown jewel: it interns every label name and value into a symbol table, assigns each series an integer ID, and stores postings lists — sorted lists of series IDs for each label=value pair. When you run sum by (status) (rate(http_requests_total{job="api"}[5m])), the index lets Prometheus find every series matching __name__="http_requests_total" and job="api" by intersecting two postings lists in milliseconds, no table scan required.

Chunks themselves use XOR-based delta-of-delta compression for timestamps and Gorilla-style XOR for float values, which is brutally efficient: typical compressed sizes are ~1-2 bytes per sample. Chunks are read via memory-mapped I/O rather than copied into the heap, which is why Prometheus’s working set often shows modest RSS but a large OS page cache.

Background compaction periodically merges adjacent 2-hour blocks into larger ones (8h, then ~24h, then multi-day) [Source: https://www.youtube.com/watch?v=MqqKT6etxpQ]. Larger blocks mean fewer index files to open per query and better compression. Retention is enforced at the block level: when --storage.tsdb.retention.time=30d is set, any block whose maxTime is older than 30 days ago is deleted in one shot.

To keep WAL replay fast after a crash, Prometheus periodically writes a checkpoint (wal/checkpoint.N/) that snapshots the head state. After a successful checkpoint, all WAL segments numbered less than N can be deleted. On restart, recovery loads the latest checkpoint and replays only segments after it, stopping at the first record with an invalid CRC (which represents a torn write at the moment of crash).

Figure 3.4: TSDB block lifecycle from scrape to retention

stateDiagram-v2
    [*] --> Scrape: sample produced
    Scrape --> WAL: append CRC record
    WAL --> Head: update in-memory head chunk
    Head --> Head: accumulate ~2h of samples
    Head --> Block2h: cut immutable block<br/>(meta.json + index + chunks)
    Block2h --> Block8h: compact adjacent blocks
    Block8h --> Block24h: compact further
    Block24h --> BlockMulti: compact multi-day
    BlockMulti --> Deleted: retention horizon passed
    Deleted --> [*]
    Head --> Recovery: crash
    Recovery --> Head: replay WAL from checkpoint.N

Worked example — surviving a crash. Suppose Prometheus has wal/checkpoint.10/ plus segments 00000011, 00000012, 00000013, and crashes while writing to segment 13. On startup:

  1. All existing immutable blocks load normally — they need no WAL [Source: http://susandumais.com/CHI2012-12-tailanswers-chi2012.pdf].
  2. The head is rebuilt from checkpoint.10.
  3. Segments 11, 12, and 13 are replayed; recovery stops at the first record in segment 13 whose CRC fails.
  4. Any samples that did not complete their write are lost — but the head is consistent and queries work immediately.

If your Prometheus startup is slow, it almost always means either checkpoints are not happening (frequent restarts, repeated OOM kills) or WAL has grown because of an unusually long outage. The fix is to keep Prometheus healthy long enough to checkpoint.

Remote write protocol

Local TSDB is intentionally short-term storage — a few weeks at most. For longer retention, multi-cluster aggregation, or “metrics as a service,” Prometheus supports remote write: an HTTP-based protocol for shipping every sample Prometheus ingests to a remote backend.

The protocol is conceptually simple. Prometheus batches samples into Snappy-compressed protobuf messages and POSTs them to a configured URL:

remote_write:
  - url: https://mimir.example.com/api/v1/push
    basic_auth:
      username: tenant-42
      password_file: /etc/prometheus/mimir-token
    queue_config:
      capacity: 10000
      max_shards: 30
      min_backoff: 30ms
      max_backoff: 5s
    write_relabel_configs:
      - source_labels: [__name__]
        regex: "go_.*"
        action: drop   # don't ship Go runtime metrics

Important properties:

Long-term storage with Thanos, Cortex, and Mimir

Three open-source projects dominate the “Prometheus at scale” space. Each makes different architectural trade-offs.

AspectThanosCortexGrafana Mimir
Ingest modelSidecar uploads TSDB blocks; optional Receive for remote writeRemote write only (distributor -> ingester -> blocks)Remote write only; simplified Cortex
Multi-tenancyLabel-based; basicFirst-class tenant IDs, per-tenant limitsFirst-class, with shuffle-sharding
HA dedupQuery-time, stores all replicasIngest-time, stores one replicaIngest-time, improved
DownsamplingNative (5m, 1h) via compactorNone at storage layerNone at storage layer
Object storageS3/GCS/Azure/SwiftBlocks or chunks engineBlocks engine only
Best fit (2024-2025)Bolt long-term storage onto existing PrometheusLegacy deploymentsNew large-scale, multi-tenant platform

Thanos is the natural choice when you already operate Prometheus clusters and want to bolt on object-store-backed long-term storage with minimal disruption [Source: https://blog.codinghorror.com/the-problem-with-logging/]. A small sidecar runs next to each Prometheus, uploads each finalized 2-hour block to S3 (or GCS, Azure, MinIO), and exposes the local TSDB to a central Querier for “recent data.” A Store Gateway serves historical blocks from object storage; the Compactor merges and downsamples them to 5-minute and 1-hour resolutions, which is what makes long-range queries (90 days, 1 year) fast.

Cortex and Grafana Mimir take a different approach: they are remote-write-native, horizontally scalable, multi-tenant metric backends. Prometheus (or any OTel Collector or Agent) ships samples via remote write to a load-balanced endpoint that fans them out to distributors, then to ingesters that keep recent data in memory and a WAL before flushing TSDB blocks to object storage. Store-gateways serve historical reads, and a query-frontend / query-scheduler layer parallelizes and caches PromQL queries.

Cortex was the original of these two; Mimir is its evolved successor from Grafana Labs, with simplified storage (blocks only — no legacy chunks engine), better defaults, and significantly improved query performance at scale. In 2024-2025, most new “central metrics platform” deployments choose Mimir over Cortex, and existing Cortex shops are gradually migrating.

When choosing between them, a useful rule of thumb:

Figure 3.5: Long-term storage architectures — Thanos vs. Mimir/Cortex

flowchart TB
    subgraph Thanos["Thanos (sidecar pattern)"]
        direction LR
        TP[Prometheus] --> TS[Thanos Sidecar]
        TS -->|upload 2h blocks| TOS[(Object Store<br/>S3 / GCS)]
        TQ[Thanos Querier] --> TS
        TQ --> TSG[Store Gateway]
        TSG --> TOS
        TC[Compactor<br/>downsample 5m / 1h] --> TOS
    end

    subgraph Mimir["Mimir / Cortex (remote-write platform)"]
        direction LR
        MP[Prometheus / Agent] -->|remote_write| MD[Distributor]
        MD --> MI[Ingester<br/>head + WAL]
        MI -->|flush blocks| MOS[(Object Store)]
        MQF[Query Frontend<br/>+ cache] --> MQ[Querier]
        MQ --> MI
        MQ --> MSG[Store Gateway]
        MSG --> MOS
    end

Key Takeaway: Local TSDB is fast, durable, and short-term: WAL + head + 2-hour blocks + compaction + retention. For longer horizons and multi-cluster scale, remote write ships samples to a backend like Thanos (sidecar + downsampling for existing Prometheus), Mimir (multi-tenant remote-write platform), or Cortex (legacy predecessor of Mimir).


Chapter Summary

Prometheus is a small distributed system bundled into a single binary. Its retrieval subsystem pulls metrics over HTTP from targets discovered via service discovery; its TSDB persists those samples through a WAL into 2-hour blocks that compact into larger blocks over time; its HTTP API answers PromQL queries; and its rules engine evaluates recording rules and fires alerts to an external Alertmanager.

The pull model gives Prometheus implicit liveness, server-controlled load, and a debuggable wire protocol, at the cost of awkwardness around short-lived jobs and firewalled targets. The Pushgateway is a narrow escape hatch — appropriate for job-level batch metrics, dangerous when used for service-level SLOs. The OpenTelemetry Collector is increasingly the modern answer for genuinely push-shaped workloads.

The multi-dimensional data model — metric name plus a label set defines a unique time series — is the conceptual key to PromQL and to the operational pitfalls (cardinality explosion) that come with mishandling labels. The OpenMetrics text format is intentionally simple enough to debug with curl and write with printf.

Underneath, the TSDB is one of the more elegant pieces of open-source storage engineering: immutable time-bounded blocks, an index built around label postings lists, XOR-compressed memory-mapped chunks, and a WAL with checkpoints for crash recovery. Remote write is how Prometheus integrates with Thanos, Cortex, and Mimir to extend retention from weeks to years and to scale to many tenants and clusters.

When something goes wrong in production — slow queries, OOMing servers, missing data, runaway cardinality — the mental model from this chapter is the map. The next chapter builds on it by exploring PromQL itself, the query language that makes all of this storage useful.


Key Terms

TermDefinition
TSDBPrometheus’s purpose-built Time-Series Database: an in-memory head block plus immutable on-disk blocks (chunks + index + meta.json), protected by a write-ahead log.
ScrapeA single HTTP GET against a target’s /metrics endpoint by Prometheus’s retrieval loop, which parses the response and appends samples to the TSDB.
PushgatewayA standalone server that caches pushed metrics for short-lived batch jobs; Prometheus scrapes the Pushgateway like any other target. Misused as a general push ingest, it causes stale and ambiguous data.
Service discoveryThe mechanism by which Prometheus learns the current list of scrape targets — from Kubernetes, Consul, cloud APIs, DNS, or files — rather than from static config.
OpenMetricsThe IETF-tracked text exposition format Prometheus uses on /metrics endpoints; the successor to the original Prometheus text format.
RelabelingA declarative pipeline (relabel_configs and metric_relabel_configs) for selecting targets, rewriting labels, and dropping samples before and after a scrape.
Remote writeThe HTTP-based protocol Prometheus uses to ship every ingested sample to a remote backend such as Thanos Receive, Cortex, Mimir, or VictoriaMetrics.
FederationA pattern where one Prometheus scrapes a subset of another Prometheus’s series via the /federate endpoint, typically to aggregate recording-rule outputs across clusters.

Chapter 4: PromQL — Querying Time-Series Data

PromQL (the Prometheus Query Language) is the lens through which every metric in a Prometheus-based observability stack becomes operational insight. Without PromQL, the millions of samples Prometheus diligently scrapes are just numbers on a disk. With it, you can express questions like “what is the 99th-percentile checkout latency per region, excluding canary pods, over the last five minutes?” in a single line. That power, however, comes with sharp edges: a misplaced label, a counter that resets at the wrong moment, or a quantile computed in the wrong order can turn a confident dashboard into a comforting lie.

Think of PromQL as a spreadsheet formula language for time. Where a spreadsheet operates on rows and columns of static values, PromQL operates on labeled streams of timestamps and floats. Every expression returns a vector, and every vector has a shape — number of series, set of labels, and a temporal extent. Master the shape, and the language follows.

Learning Objectives

By the end of this chapter, you will be able to:

Figure 4.1: PromQL data-shape transitions

graph TD
    RAW["Raw samples in TSDB<br/>(timestamp, value, labels)"]
    SEL["Selector<br/>http_requests_total{job=&quot;api&quot;}"]
    IV["Instant Vector<br/>one sample per series at eval time"]
    RV["Range Vector<br/>append [5m] — many samples per series"]
    AGG["Aggregated Instant Vector<br/>sum/avg/topk by labels"]
    S["Scalar<br/>single number, no labels"]

    RAW --> SEL
    SEL --> IV
    IV -->|"append [duration]"| RV
    RV -->|"rate, increase, *_over_time"| IV
    IV -->|"sum by, avg by, topk"| AGG
    AGG -->|"scalar()"| S
    S -->|"comparison, arithmetic"| IV

4.1 PromQL Fundamentals

Before writing useful queries, you need a precise mental model of the data types PromQL manipulates. Confusion about these types is the single largest source of “why isn’t my query returning anything?” tickets.

Instant Vectors, Range Vectors, and Scalars

PromQL has four expression types, but for day-to-day work three of them dominate:

TypeWhat it isExampleWhen you use it
Instant vectorA set of time series, each containing one sample at the evaluation timestamphttp_requests_totalMost query results, dashboard panels, alert conditions
Range vectorA set of time series, each containing a range of samples going back in timehttp_requests_total[5m]Input to rate, increase, *_over_time functions
ScalarA single numeric value (no labels, no time series)0.99, time()Thresholds, quantile arguments, arithmetic constants
StringA literal string (rarely used outside label_replace)"prod"Function arguments only

The crucial rule: most functions and operators that you think of as “PromQL math” require an instant vector. The comparison operator >, the binary operator +, and aggregation operators all reject range vectors. Range vectors exist almost exclusively to feed time-windowed functions like rate(), avg_over_time(), or increase().

# Instant vector: one sample per series at "now"
http_requests_total

# Range vector: every sample in the past 5 minutes per series
http_requests_total[5m]

# Scalar: just a number
0.95

# Functions transform range vectors back into instant vectors
rate(http_requests_total[5m])    # instant vector again

Analogy: an instant vector is a single Polaroid snapshot of all your series right now. A range vector is a flip-book of snapshots covering the last five minutes. Functions like rate() are the flip-book reader that summarizes the motion into a single number per series, handing you back a new Polaroid.

Selectors and Label Matchers

A selector chooses which series to retrieve. Every PromQL query starts with one. The selector has two parts: the metric name and an optional set of label matchers in {...} braces.

# All series for this metric, across every label combination
http_requests_total

# Filter by exact label match
http_requests_total{job="api", method="GET"}

# Regex match (note the =~ operator)
http_requests_total{code=~"5.."}

# Negative regex match
http_requests_total{code!~"2..|3.."}

# Negative exact match
http_requests_total{environment!="canary"}

Four matcher operators exist: = (equals), != (not equals), =~ (regex match), and !~ (regex doesn’t match). Regexes are anchored on both ends automatically — code=~"5.." matches 500, 503, and 599 but not 5000.

You can also select by the special __name__ label, which is how PromQL internally represents the metric name. This trick is occasionally useful when the metric name itself needs filtering:

{__name__=~"http_.*", job="api"}

Offset and @ Modifiers

Two modifiers let you shift queries through time. They’re the difference between “requests right now” and “requests at exactly 9:00 AM yesterday.”

The offset modifier shifts the query backwards by a relative duration:

# Current request rate
rate(http_requests_total[5m])

# Request rate from one week ago, same lookback
rate(http_requests_total[5m] offset 1w)

# Week-over-week ratio
rate(http_requests_total[5m])
  /
rate(http_requests_total[5m] offset 1w)

The @ modifier (introduced in Prometheus 2.25) pins the query to an absolute Unix timestamp, which is invaluable for reproducible alerts and post-incident analysis:

# Request rate as observed at 2026-01-15 14:30:00 UTC
rate(http_requests_total[5m] @ 1736951400)

# Combined: 5-minute rate, 1 hour before a fixed timestamp
rate(http_requests_total[5m] @ 1736951400 offset 1h)

Subqueries extend this further by letting you build a range vector out of an instant-vector expression on the fly, evaluated at a step you specify:

# Max 5-minute request rate observed over the last hour, sampled every 1m
max_over_time(
  rate(http_requests_total[5m])[1h:1m]
)

The [1h:1m] syntax says: “evaluate the inner expression every 1 minute over the last 1 hour and assemble those results into a range vector.” Subqueries are powerful but expensive — they multiply query work — so prefer recording rules for anything you run repeatedly.

Key Takeaway: Every PromQL expression has a shape: instant vector, range vector, or scalar. Most operators require instant vectors; range vectors exist to feed time-window functions like rate(). Master the shape transitions and most “why doesn’t this work?” errors disappear.


4.2 Functions and Operators

This is where PromQL goes from “selecting data” to “answering questions.” The functions in this section are the workhorses of every production dashboard and alert.

rate, irate, and increase on Counters

Counters are the most common Prometheus metric type — monotonically increasing numbers like http_requests_total or node_network_transmit_bytes_total. Raw counter values are almost never useful on their own; what you care about is how fast they’re growing. Three functions answer that question with subtly different semantics [Source: https://developers.openai.com/cookbook/examples/gpt4-1_prompting_guide].

FunctionOutputComputationBest forAvoid for
rate(v[w])Per-second average rate over window wLinear regression across all samples in the range, with extrapolation to window edgesDashboards, most alerts, trend analysisDetecting brief spikes; very short windows
irate(v[w])Instantaneous per-second rateUses only the last two samples in the rangeSpike detection, short-duration alertsLong-term graphs (too noisy)
increase(v[w])Total increments across window wEquivalent to rate(v[w]) * w_seconds, with the same extrapolationSLO budgets, “how many events” questionsWhen you actually want a rate; expecting integers

All three operate only on counters and automatically handle counter resets — when the value drops (say, from 12345 back to 0 after a pod restart), the negative jump is ignored and only post-reset increments count. Critically, none of them work correctly on gauges; for gauges, reach for avg_over_time, max_over_time, delta, or deriv.

# Smooth requests-per-second dashboard, aggregated by service
sum by (service) (
  rate(http_requests_total[5m])
)

# Spike-detection alert: instantaneous error rate above 10%
sum by (service) (irate(http_requests_total{code=~"5.."}[1m]))
  /
sum by (service) (irate(http_requests_total[1m]))
  > 0.1

# SLO accounting: total 5xx errors and total requests over 30 days
sum(increase(http_requests_total{code=~"5.."}[30d]))
  /
sum(increase(http_requests_total[30d]))

A common stumble is dividing before aggregating. The ratio of two rates is not the rate of two ratios. Always aggregate the numerator and denominator separately, then divide [Source: https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents]:

# WRONG — produces per-instance ratios that get summed nonsensically
sum by (service) (
  rate(http_requests_total{code=~"5.."}[5m])
    /
  rate(http_requests_total[5m])
)

# RIGHT — aggregate first, then divide
sum by (service) (rate(http_requests_total{code=~"5.."}[5m]))
  /
sum by (service) (rate(http_requests_total[5m]))

Choosing the window size matters too. A good rule of thumb is 3–5× the scrape interval: with a 15s scrape, use at least [1m] for rate to have meaningful samples; [5m] is the standard dashboard window because it smooths jitter without hiding real outages. Going too short produces noisy graphs; going too long ([1h]) hides brief incidents [Source: https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices].

Figure 4.2: rate vs irate vs increase over a range vector

flowchart LR
    C["Counter samples in [5m] window<br/>t-5m … t-4m … t-3m … t-2m … t-1m … t"]

    C --> R["rate(v[5m])<br/>linear regression across<br/>all samples in window"]
    C --> I["irate(v[5m])<br/>uses only the LAST<br/>two samples"]
    C --> N["increase(v[5m])<br/>= rate(v[5m]) * 300s<br/>total increments in window"]

    R --> RO["Smooth per-second rate<br/>good for dashboards and alerts"]
    I --> IO["Spiky per-second rate<br/>good for short spike detection"]
    N --> NO["Total event count over window<br/>good for SLO budgets"]

histogram_quantile and Bucket Math

Latency, response sizes, and queue depths are typically tracked with histograms, not gauges or counters. A classic Prometheus histogram exposes three families of series for each underlying metric:

The histogram_quantile() function reconstructs an approximate distribution from these buckets and linearly interpolates within the bucket that contains your target quantile [Source: https://natesnewsletter.substack.com/p/context-windows-are-a-lie-the-myth].

The math, briefly: for quantile q with cumulative counts b_i at upper bounds u_i and total count N = b_n,

  1. Compute rank r = q × N.
  2. Find the first bucket k where b_k ≥ r.
  3. Interpolate: x_q = u_{k-1} + ((r − b_{k-1}) / (b_k − b_{k-1})) × (u_k − u_{k-1}).

This assumes a uniform distribution within each bucket, which is why bucket boundary design matters so much.

The canonical p99 latency pattern looks like this:

histogram_quantile(
  0.99,
  sum by (le, service) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

Four things must be right for this to produce a meaningful answer [Source: https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/prompt-engineering]:

  1. Apply rate() to the buckets first. Buckets are counters, so you almost always want a rate of observations (or increase() for a fixed window), not the raw cumulative count since process start.
  2. Preserve the le label in aggregation. If you write sum by (service) without including le, you collapse all buckets into a single value and destroy the histogram structure.
  3. Aggregate before taking the quantile, never after. Quantiles are not linear; you cannot average p99s across instances.
  4. Ensure all instances share the same bucket boundaries. Mixing layouts produces meaningless interpolation.

Here is the wrong-vs-right comparison in code:

# WRONG — compute per-instance p99, then sum/avg quantiles (statistically invalid)
sum by (service) (
  histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
)

# WRONG — drops the `le` label, breaking histogram reconstruction
histogram_quantile(
  0.99,
  sum by (service) (rate(http_request_duration_seconds_bucket[5m]))
)

# RIGHT — aggregate buckets preserving `le`, then take one quantile
histogram_quantile(
  0.99,
  sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
)

A related gotcha: if the true p99 falls within the +Inf bucket, histogram_quantile() returns +Inf (or, depending on bucket bounds, flattens at the largest finite upper bound). The fix is better bucket design — clustering boundaries tightly around your SLO threshold. If your SLO is 300ms p99, you want buckets like 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 1.0 rather than the default 0.005, 0.01, ..., 10.

For average latency, combine _sum and _count:

sum by (service) (rate(http_request_duration_seconds_sum[5m]))
  /
sum by (service) (rate(http_request_duration_seconds_count[5m]))

Native histograms (introduced as experimental in Prometheus 2.40+ and stabilized later) eliminate the le label entirely by storing the bucket structure compactly inside a single series. Querying them is simpler:

# Native histogram p99 — no `le`, no _bucket suffix
histogram_quantile(
  0.99,
  sum by (service) (
    rate(http_request_duration_seconds[5m])
  )
)
AspectClassic histogramNative histogram
Bucket representationOne series per le valueOne compact value per series
Cardinality costHigh (n buckets × m label combinations)Low (single series per label combination)
Aggregation for quantilessum by (le, ...)sum by (...)
Tooling maturityUniversalNewer, some tools assume le exists

Aggregation Operators

Aggregation operators collapse many series into fewer series. They’re the workhorse of every dashboard panel. PromQL has a fixed set: sum, avg, min, max, count, count_values, stddev, stdvar, topk, bottomk, quantile, and group.

Each can be modified by by (label_list) (keep only these labels) or without (label_list) (drop these labels, keep the rest).

# Total requests per second across the whole fleet
sum(rate(http_requests_total[5m]))

# Grouped by service — one result series per service
sum by (service) (rate(http_requests_total[5m]))

# Equivalent: drop the instance and pod labels, keep everything else
sum without (instance, pod) (rate(http_requests_total[5m]))

# Top 5 noisiest services by error rate
topk(5,
  sum by (service) (rate(http_requests_total{code=~"5.."}[5m]))
)

# Average CPU usage per node
avg by (node) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))

A subtle but important point: by and without are complementary. sum by (service) keeps only the service label and drops everything else. sum without (pod, instance) drops just those two labels and keeps everything else. In high-cardinality environments, without is often safer — it doesn’t accidentally hide labels you forgot to mention.

Key Takeaway: Use rate() for smooth dashboards, irate() for spike alerts, and increase() for “how many events” questions. For histograms, always rate() the buckets first, preserve le in aggregation, and compute the quantile after aggregating. For ratios, aggregate numerator and denominator separately, then divide.


4.3 Recording and Alerting Rules

PromQL queries can be slow. A histogram_quantile() over a 30-day window touching millions of series can take seconds to evaluate — far too slow for a dashboard that refreshes every 10s or an alert that fires every 30s. The fix is rules: queries that Prometheus evaluates on a fixed schedule and stores the results as either new metrics (recording rules) or alert states (alerting rules) [Source: https://sre.google/sre-book/service-best-practices/].

Recording Rules for Expensive Queries

A recording rule pre-computes a query and writes the result back into Prometheus as a new time series. The first dashboard load is slow; every subsequent load reads a single pre-computed series.

groups:
  - name: http_slos
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))

      - record: job:http_error_ratio:5m
        expr: |
          job:http_errors:rate5m
            /
          job:http_requests:rate5m

Notice the naming convention: level:metric:operation. The level (job, namespace, cluster) identifies the aggregation scope; the metric identifies what’s being measured; the operation describes the transformation (rate5m, histogram_quantile99). This convention is part of the Prometheus operational vocabulary and pays dividends in dashboards and downstream rules [Source: https://www.dynatrace.com/news/blog/site-reliability-done-right/].

# Without recording rule — slow, runs every dashboard refresh
histogram_quantile(0.99,
  sum by (le, service) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

# With recording rule — fast lookup of a single pre-computed series
service:http_request_duration_seconds:histogram_quantile99_5m

A few rules-of-thumb for rule chains:

Alerting Rule Syntax and the for Clause

Alerting rules look like recording rules but produce alert states instead of new time series. The expr field must return an instant vector; any series in that vector with a non-zero, non-NaN value triggers an alert.

groups:
  - name: http_alerts
    interval: 30s
    rules:
      - alert: HighRequestErrorRate
        expr: job:http_error_ratio:5m > 0.05
        for: 5m
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: |
            Error ratio is {{ $value | humanizePercentage }} on job
            {{ $labels.job }} (threshold 5%). See runbook for triage.
          runbook_url: "https://runbooks.example.com/HighRequestErrorRate"

The for clause is the single most important alerting-rule feature for reducing noise. It requires the alert condition to be continuously true for the specified duration before the alert fires. With for: 5m, a one-minute blip in error rate won’t page anyone; a sustained five-minute problem will.

Guidelines for for durations [Source: https://www.splunk.com/en_us/blog/learn/sre-metrics-four-golden-signals-of-monitoring.html]:

Alert categoryTypical forReasoning
User-visible symptom (errors, slow latency)1–5mFast page; users already see it
Resource saturation (CPU, memory, disk)5–15mAvoid paging on transient spikes
SLO burn rate (fast window)2–5mCatches rapid budget burn
SLO burn rate (slow window)30–60mLong, sustained budget drift
Capacity / “filling up” trends1h+Days-ahead warnings, not pages

A critical anti-pattern: don’t use for to mask noisy query design. If your expression is flapping because the underlying rate window is too short, fix the query (longer rate window, more aggregation) rather than lengthening for. The for clause delays alerts; it doesn’t make them more accurate.

Best Practices for Rule Organization

At scale, rules become code. Treating them like code is the single biggest lever for keeping alerting sane [Source: https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started].

A sample rule-organization layout might look like:

rules/
├── recording/
│   ├── http_aggregates.yaml      # job:http_requests:rate5m, etc.
│   ├── slo_aggregates.yaml       # service:availability:ratio_rate30d
│   └── infra_aggregates.yaml     # node:cpu:usage_ratio_avg5m
├── alerts/
│   ├── slo_burn.yaml             # multi-window SLO alerts
│   ├── infra_saturation.yaml     # disk, memory, CPU
│   └── platform_health.yaml      # control-plane symptoms
└── tests/
    └── *.test.yaml                # promtool test rules fixtures

Key Takeaway: Recording rules turn expensive queries into single-series lookups; alerting rules turn those queries into pages. Use the level:metric:operation naming convention, keep rule chains shallow, and tune the for clause to alert category — never to mask noisy queries. Treat rules as code: Git, CI lint, runbook annotations.

Figure 4.3: Recording-rule chain feeding an alert

sequenceDiagram
    participant T as "Target /metrics"
    participant P as "Prometheus scraper"
    participant R1 as "Recording rule (base)<br/>job:http_requests:rate5m"
    participant R2 as "Recording rule (ratio)<br/>job:http_error_ratio:5m"
    participant A as "Alerting rule<br/>HighRequestErrorRate"
    participant AM as "Alertmanager"

    T->>P: "scrape every 15s"
    P->>P: "write samples to TSDB"
    Note over R1: "every 30s eval interval"
    P->>R1: "read raw counters"
    R1->>R1: "sum by (job) (rate(...))"
    R1->>P: "write new series"
    Note over R2: "every 30s eval interval"
    R1->>R2: "read rate5m series"
    R2->>R2: "errors / requests"
    R2->>P: "write ratio series"
    Note over A: "every 30s eval interval"
    R2->>A: "read ratio series"
    A->>A: "expr > 0.05 sustained for 5m"
    A->>AM: "fire alert with labels + annotations"

4.4 Common Pitfalls

PromQL is a small language with a lot of footguns. The pitfalls below are the ones that show up most often in incident postmortems.

Counter Resets Across Restarts

Counters should never decrease, but processes restart. When a counter resets from 12345 back to 0, every rate-family function (rate, irate, increase) detects the drop and treats it as a reset, counting only the post-reset increase. This works automatically and is one of PromQL’s most pleasant surprises.

But there are edge cases:

# WRONG — sum first, then rate. Reset on one instance corrupts the sum.
rate(sum by (job)(http_requests_total)[5m:])

# RIGHT — rate per series first, then aggregate
sum by (job)(rate(http_requests_total[5m]))

Staleness Markers and Missing Samples

Prometheus 2.0+ writes an explicit staleness marker when a target disappears or a series stops being reported. Five minutes after the last sample (the default lookback delta), instant queries return no result for that series rather than the last known value. This is usually what you want — but it has consequences:

# Detect when a critical metric is missing
absent(up{job="payments-api"})

# Default to zero when no data
sum by (service)(rate(payment_failures_total[5m])) or vector(0)

The lookback-delta setting (default 5 minutes) controls how far back PromQL searches for the most recent sample of a series. Don’t change this without strong reason — it has cascading effects on every query in your environment.

Cardinality Explosions from High-Dimensional Labels

Cardinality — the number of unique time series — is the single largest scalability constraint in Prometheus. Each unique combination of metric name and label values is one series; each series consumes memory, disk, and query CPU. A metric with 10 labels each having 100 possible values can in principle produce 10^20 series. In practice, a few hundred thousand series per Prometheus instance is healthy; a few million is painful; tens of millions usually crashes.

Cardinality explodes when a label can take values from an unbounded set:

LabelCardinalitySafe in metrics?
method (GET, POST, PUT, …)~10Yes
status_code (200, 404, 500, …)~40Yes
servicetensYes
podhundreds–thousands; changes constantlyRisky
request_path (raw URL)unboundedNo
user_idmillionsNo
trace_idevery requestAbsolutely not
emailmillionsAbsolutely not

A common real-world failure: an exporter that uses raw request paths as labels generates one series per (method, status, path) tuple. Add /users/{id}/posts/{id} and you have one series per user per post — millions of series before lunch. The fix is to normalize at the source: bucket paths into templates (/users/:id/posts/:id), drop high-cardinality labels before exposing them, or move that information into logs and traces (where cardinality is cheap) instead of metrics.

For recording rules, the discipline tightens. A recording rule that retains pod as a grouping label persists one new series per pod per evaluation interval — a hidden multiplier. Standard practice [Source: https://www.dynatrace.com/news/blog/site-reliability-done-right/]:

# RISKY — preserves pod label, creates pod-cardinality series
- record: pod:http_requests:rate5m
  expr: sum by (pod, service)(rate(http_requests_total[5m]))

# SAFER — aggregate pod away in the recording rule
- record: service:http_requests:rate5m
  expr: sum by (service)(rate(http_requests_total[5m]))

Tools for diagnosis:

Key Takeaway: Three pitfalls dominate PromQL incidents: counter resets handled by aggregating after rate() (not before); staleness handled by absent()/or vector(0) for missing data; and cardinality controlled by aggregating away unbounded labels in recording rules and never exposing raw user/path/trace IDs as label values.

Figure 4.4: How label dimensions multiply series count

graph LR
    M["http_requests_total<br/>1 metric, no labels<br/>1 series"]
    M --> L1["+ method<br/>(~10 values)<br/>10 series"]
    L1 --> L2["+ status_code<br/>(~40 values)<br/>400 series"]
    L2 --> L3["+ pod<br/>(~500 churning values)<br/>200,000 series"]
    L3 --> L4["+ request_path raw URL<br/>(unbounded, 50k+)<br/>10,000,000+ series<br/>Prometheus OOM"]
    L4 --> FIX["Fix: normalize path to template<br/>route=&quot;/users/:id/posts/:id&quot;<br/>or drop label entirely"]

Chapter Summary

PromQL is the language of operational truth in a Prometheus-based observability stack. In this chapter you learned to:

The next chapter takes the data we now know how to query and shows how to push it into dashboards, alert receivers, and downstream systems — bringing PromQL into the operational loop of the SRE team.


Key Terms

TermDefinition
instant vectorA set of time series each containing a single sample at the evaluation timestamp; the default result type of most PromQL expressions.
range vectorA set of time series each containing a range of samples going back in time; produced by appending [duration] to a selector and consumed by functions like rate().
rateA function returning the per-second average rate of a counter across a range vector, computed via linear regression with extrapolation to window edges.
histogram_quantileA function that estimates a quantile by linearly interpolating within the bucket of a classic or native histogram that contains the target rank.
recording ruleA configured PromQL expression that Prometheus evaluates on a fixed schedule, writing the result back as a new time series for fast retrieval.
alerting ruleA configured PromQL expression whose non-zero, non-NaN result series produce alert events; tunable with a for duration to require sustained truth.
stalenessThe Prometheus behavior of treating a series as gone after a lookback delta (default 5m) with no new samples or after an explicit staleness marker.
cardinalityThe total number of unique time series; the primary scaling constraint of Prometheus, driven mostly by the cross-product of label-value combinations.

[Source: https://sre.google/sre-book/service-best-practices/] [Source: https://www.dynatrace.com/news/blog/site-reliability-done-right/] [Source: https://natesnewsletter.substack.com/p/context-windows-are-a-lie-the-myth]


Chapter 5: OpenTelemetry Architecture: API, SDK, and Collector

OpenTelemetry is often described as a single project, but in practice it is three loosely coupled layers working in concert: a small, stable API that instrumentation code calls; a configurable SDK that turns those calls into real spans, metrics, and log records; and an out-of-process Collector that receives, processes, and forwards telemetry. Understanding how these layers fit together — and where the seams are — is the foundation for every decision you will make later about samplers, exporters, deployment topologies, and vendor choice.

This chapter zooms out from individual signals (Chapters 2–4) and looks at OpenTelemetry as a system. By the end, you should be able to read an OpenTelemetry architecture diagram, predict where a given piece of configuration belongs, and choose a Collector deployment topology appropriate for the workload in front of you.

Learning Objectives

By the end of this chapter, you will be able to:

Figure 5.1: OpenTelemetry three-layer architecture and OTLP data flow

flowchart TD
    subgraph App["Application Process"]
        Lib[Library Code<br/>depends on API only]
        Code[Application Code<br/>depends on API only]
        API[OpenTelemetry API<br/>Tracer / Meter / Logger interfaces]
        SDK[OpenTelemetry SDK<br/>samplers + processors + exporters + resource]
        Lib --> API
        Code --> API
        API --> SDK
    end
    SDK -->|"OTLP/gRPC :4317 or OTLP/HTTP :4318"| Col
    subgraph Col["OpenTelemetry Collector (out-of-process)"]
        Recv[Receivers<br/>OTLP, Prometheus, filelog]
        Proc[Processors<br/>batch, memory_limiter, k8sattributes, tail_sampling]
        Exp[Exporters<br/>OTLP, vendor-specific]
        Recv --> Proc --> Exp
    end
    Exp -->|"OTLP or vendor protocol"| BE[("Backend<br/>Prometheus / Tempo / Loki / Vendor SaaS")]

1. API vs SDK vs Collector

OpenTelemetry’s most important architectural decision is splitting instrumentation surface from pipeline implementation, and then separating both from the out-of-process telemetry agent. Each layer has a distinct audience, a distinct release cadence, and a distinct dependency footprint [Source: https://opentelemetry.io/docs/concepts/components/].

1.1 The API: a stable interface for instrumentation

The API is what library authors and application code import. It defines types like TracerProvider, Tracer, Span, MeterProvider, Meter, LoggerProvider, and Logger, along with global access points (GlobalOpenTelemetry.getTracer(...) in Java, opentelemetry.trace.get_tracer(...) in Python). Crucially, the API defines interfaces only — it does not know about exporters, samplers, batching, OTLP, or any backend [Source: https://opentelemetry.io/docs/specs/otel/].

Think of the API as the plug on the back of an appliance. The shape of the plug is standardized and changes very slowly. Whether the wall socket is connected to a hydroelectric dam, a solar panel, or nothing at all is not the appliance’s problem.

A library that emits a span looks like this in Java:

import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.api.trace.Span;

private static final Tracer tracer =
    GlobalOpenTelemetry.getTracer("com.example.library", "1.0.0");

void doWork() {
    Span span = tracer.spanBuilder("doWork").startSpan();
    try {
        span.setAttribute("foo", "bar");
        // library logic
    } finally {
        span.end();
    }
}

Note what is missing: no exporter, no endpoint, no sampling rate, no environment variable parsing. The library cannot accidentally pull in a vendor SDK, gRPC client, or HTTP exporter as a transitive dependency [Source: https://opentelemetry.io/docs/languages/java/instrumentation/].

If no SDK is registered, the API returns no-op implementations. Spans are created but never recorded; metric updates evaporate; log records are dropped. The cost is a few function calls and a small allocation — safe to leave instrumentation enabled even in latency-sensitive paths [Source: https://opentelemetry.io/docs/specs/otel/].

1.2 The SDK: a configurable pipeline implementation

The SDK is what application developers wire up at startup. It replaces the API’s no-op providers with concrete implementations that actually record data, apply samplers, run processors, and call exporters [Source: https://opentelemetry.io/docs/specs/otel/].

A minimal Java SDK initialization looks like this:

OtlpGrpcSpanExporter exporter = OtlpGrpcSpanExporter.builder()
    .setEndpoint("http://collector:4317")
    .build();

SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
    .addSpanProcessor(BatchSpanProcessor.builder(exporter).build())
    .setResource(Resource.getDefault().toBuilder()
        .put("service.name", "my-service")
        .build())
    .build();

OpenTelemetrySdk.builder()
    .setTracerProvider(tracerProvider)
    .buildAndRegisterGlobal();

The SDK owns four moving parts:

1.3 Why the split matters

The API/SDK split is not bureaucratic — it directly enables vendor neutrality:

ConcernAPI onlySDK
AudienceLibrary authors, app codeApplication operators, platform teams
Release cadenceSlow, stableFaster, more features
DependenciesTiny (interfaces)Heavy (exporters, gRPC, processors)
Default behaviorNo-opRecords and exports data
Vendor couplingNoneChoose any exporter

A widely used HTTP client library depending only on opentelemetry-api adds essentially zero weight and zero opinion about your backend. The same library can be used in an app that exports to Jaeger, in an app that exports to a SaaS vendor, and in an app that disables telemetry entirely — without recompilation [Source: https://opentelemetry.io/docs/concepts/components/].

1.4 The Collector: an out-of-process pipeline

The Collector is a separate binary written in Go that runs outside your application. It speaks OTLP (and many other protocols) on its receivers, runs processors on the in-memory pipeline, and sends data out through exporters [Source: https://opentelemetry.io/docs/collector/].

receivers  →  processors  →  exporters

Figure 5.4: Collector pipeline anatomy

flowchart LR
    subgraph Receivers
        R1[OTLP gRPC :4317]
        R2[OTLP HTTP :4318]
        R3[Prometheus scrape]
        R4[filelog tail]
    end
    subgraph Processors
        P1[memory_limiter<br/>backpressure]
        P2[k8sattributes<br/>add pod/namespace]
        P3[tail_sampling<br/>keep errors + slow]
        P4[batch<br/>group for efficiency]
        P1 --> P2 --> P3 --> P4
    end
    subgraph Exporters
        E1[OTLP to backend]
        E2[Vendor exporter]
        E3[debug / logging]
    end
    R1 --> P1
    R2 --> P1
    R3 --> P1
    R4 --> P1
    P4 --> E1
    P4 --> E2
    P4 --> E3

Why move logic out of the application?

Key Takeaway: OpenTelemetry deliberately separates a stable instrumentation surface (API) from a configurable in-process pipeline (SDK) from an out-of-process aggregator (Collector). Libraries depend only on the API, applications own the SDK, and operations teams own the Collector — each layer can evolve independently without breaking the others.

2. Cross-Language Architecture

OpenTelemetry promises a consistent mental model across more than a dozen languages. The architecture above repeats almost identically in Java, Python, Go, .NET, Node.js, Ruby, PHP, Rust, C++, Swift, and others. What ties them together is a shared specification, a shared set of semantic conventions, and a shared wire protocol (OTLP) [Source: https://opentelemetry.io/docs/specs/otel/].

2.1 Language support matrix and stability levels

Each language SIG (Special Interest Group) implements the spec at its own pace. The OpenTelemetry project tracks stability per signal per language: a language might be GA for traces, beta for metrics, and experimental for logs. This matters when you adopt OpenTelemetry in a polyglot environment — a Java service might emit fully GA telemetry while a sibling Node.js service is still using a beta logs SDK [Source: https://opentelemetry.io/docs/languages/].

A high-level snapshot (consult the docs for current status):

SignalJavaPythonGo.NETNode.js
TracesStableStableStableStableStable
MetricsStableStableStableStableStable
LogsStableBeta/StableBetaStableBeta

The pattern: traces stabilized first, metrics followed, logs are the most recent and still maturing in some languages.

2.2 Semantic conventions: the lingua franca

If every team picks its own attribute names — http.statusCode vs http_status vs httpResponse.code — your “vendor-neutral” telemetry becomes useless. Semantic conventions are the OpenTelemetry project’s standardized vocabulary for resource and span attributes [Source: https://opentelemetry.io/docs/concepts/semantic-conventions/].

Examples:

The payoff: a dashboard, alert, or query written against http.response.status_code works identically whether the data came from a Java service, a Go service, or a Python service. Backends like Grafana, Datadog, Honeycomb, and Tempo can build out-of-the-box visualizations because they know exactly what to look for.

Analogy: semantic conventions are to telemetry what HTTP status codes are to the web. Without them, every server could invent its own “the page worked” signal; with them, every browser, proxy, and dashboard knows what 200 means.

2.3 OTLP: the wire protocol that ties it together

OTLP (OpenTelemetry Protocol) is the bridge between SDKs, Collectors, and OTLP-compatible backends. It defines:

The structure of a trace export request is layered:

ExportTraceServiceRequest
  └── ResourceSpans            (one per Resource, e.g., per service)
       ├── Resource            (attributes like service.name)
       └── ScopeSpans          (one per instrumentation library)
            ├── InstrumentationScope (name, version)
            └── Span[]         (trace_id, span_id, attributes, events, status)

The same proto messages are used across all three transports — only the encoding and HTTP/gRPC framing differ [Source: https://github.com/open-telemetry/opentelemetry-proto].

2.4 gRPC vs HTTP/protobuf vs HTTP/JSON

AspectOTLP/gRPCOTLP/HTTP/protobufOTLP/HTTP/JSON
Default port431743184318
EncodingProtobuf (binary)Protobuf (binary)JSON (text)
TransportgRPC over HTTP/2HTTP/1.1 or HTTP/2HTTP/1.1 or HTTP/2
MultiplexingYes (HTTP/2 streams)Depends on HTTP versionDepends
Wire overheadLowestLowHighest (text, verbose)
Proxy/LB friendlinessNeeds HTTP/2 + gRPC-aware LBsStandard HTTP infraStandard HTTP infra
DebuggabilityHardest (binary + gRPC)Medium (binary)Easiest (curl-able)
Browser supportNo (gRPC-Web is different)YesYes

Practical guidance:

Default endpoints for OTLP/HTTP:

A common configuration mistake is mismatching protocol and port:

# WRONG: gRPC port with HTTP exporter
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

This produces errors like “unexpected response code” or “transport error.” Use grpc with 4317 or http/protobuf with 4318, not a cross [Source: https://opentelemetry.io/docs/specs/otlp/].

Figure 5.3: OTLP export request flow across transport variants

sequenceDiagram
    participant SDK as SDK Exporter
    participant Col as Collector OTLP Receiver
    Note over SDK,Col: OTLP/gRPC on :4317
    SDK->>Col: HTTP/2 frame: TraceService.Export(ExportTraceServiceRequest, protobuf)
    Col-->>SDK: ExportTraceServiceResponse (may include partial_success)
    Note over SDK,Col: OTLP/HTTP/protobuf on :4318
    SDK->>Col: "POST /v1/traces  Content-Type: application/x-protobuf"
    Col-->>SDK: "200 OK  protobuf body (partial_success)"
    Note over SDK,Col: OTLP/HTTP/JSON on :4318
    SDK->>Col: "POST /v1/traces  Content-Type: application/json"
    Col-->>SDK: "200 OK  JSON body (partial_success)"
    Note over SDK,Col: Retry only on UNAVAILABLE / 5xx / 429 with backoff

2.5 Partial success and retry semantics

Both gRPC and HTTP OTLP support partial success: the response carries a partial_success field with rejected_spans (or rejected data points / log records) and an error_message. The transport-level status is still success — rejected items are usually permanently bad (e.g., rate-limited, malformed) and should not be retried [Source: https://opentelemetry.io/docs/specs/otlp/].

Retry policy is signaled by the transport status:

Key Takeaway: OpenTelemetry’s cross-language consistency rests on three pillars: the specification (which defines API/SDK semantics per signal), semantic conventions (a shared vocabulary so any backend can interpret data from any language), and OTLP (the wire protocol that carries telemetry between SDK, Collector, and backend across three transport variants).

3. Collector Deployment Topologies

The Collector is the most operationally flexible piece of OpenTelemetry. The same binary, with different configuration, can be deployed as a sidecar inside a pod, as a DaemonSet per node, or as a centralized fleet of pods behind a service. Most production Kubernetes environments combine more than one of these patterns [Source: https://opentelemetry.io/docs/collector/deployment/].

Figure 5.2: Agent, gateway, and hybrid Collector topologies

flowchart TD
    subgraph AgentMode["Agent topology (DaemonSet)"]
        direction TB
        A_App1[App Pod<br/>Node 1]
        A_App2[App Pod<br/>Node 2]
        A_Ag1[Collector Agent<br/>Node 1]
        A_Ag2[Collector Agent<br/>Node 2]
        A_BE[(Backend)]
        A_App1 -->|"OTLP localhost"| A_Ag1
        A_App2 -->|"OTLP localhost"| A_Ag2
        A_Ag1 --> A_BE
        A_Ag2 --> A_BE
    end
    subgraph GatewayMode["Gateway topology (Deployment)"]
        direction TB
        G_App1[App Pod<br/>Node 1]
        G_App2[App Pod<br/>Node 2]
        G_GW[Gateway Collectors<br/>centralized Deployment + Service]
        G_BE[(Backend)]
        G_App1 -->|"OTLP cluster Service"| G_GW
        G_App2 -->|"OTLP cluster Service"| G_GW
        G_GW --> G_BE
    end
    subgraph HybridMode["Hybrid topology (recommended)"]
        direction TB
        H_App1[App Pod<br/>Node 1]
        H_App2[App Pod<br/>Node 2]
        H_Ag1[Agent<br/>Node 1]
        H_Ag2[Agent<br/>Node 2]
        H_GW[Gateway Collectors<br/>tail sampling + tenant routing + auth]
        H_BE[(Backend)]
        H_App1 -->|"OTLP localhost"| H_Ag1
        H_App2 -->|"OTLP localhost"| H_Ag2
        H_Ag1 -->|"OTLP cross-node"| H_GW
        H_Ag2 -->|"OTLP cross-node"| H_GW
        H_GW --> H_BE
    end

3.1 Agent mode (sidecar or DaemonSet)

In agent mode, a Collector lives next to the workload:

Agents excel at:

A minimal agent receivers/processors/exporters chain:

receivers:
  otlp:
    protocols:
      grpc:
      http:
  filelog:
    include: [ /var/log/containers/*.log ]
  kubeletstats:
    collection_interval: 10s
processors:
  memory_limiter:
    limit_percentage: 75
  k8sattributes:
    auth_type: serviceAccount
  batch:
    timeout: 5s
    send_batch_size: 8192
exporters:
  otlp:
    endpoint: otel-gateway.observability.svc.cluster.local:4317
    tls:
      insecure: true
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [otlp]

3.2 Gateway mode (centralized service)

In gateway mode, one or more Collectors run as a Deployment behind a Kubernetes Service (or external load balancer). Apps — or, more commonly, agents — send telemetry to this central fleet [Source: https://opentelemetry.io/docs/collector/deployment/].

Gateways excel at:

For correct tail sampling across multiple gateway replicas, you need trace-ID-aware load balancing so that all spans for a given trace ID hit the same gateway pod. The loadbalancing exporter or an L7 load balancer with consistent hashing is the typical solution.

A gateway pipeline with tail sampling:

processors:
  memory_limiter:
  batch:
  tail_sampling:
    decision_wait: 5s
    num_traces: 50000
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: default
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

3.3 Agent vs gateway: trade-off summary

ConcernAgent (sidecar / DaemonSet)Gateway (centralized)
Resource overheadLinear with nodes/pods; isolated impactFewer instances; better CPU/memory efficiency
Batching efficiencySmaller per-instance volumeAggregates → large batches, better compression
Tail-based samplingLimited to local view → broken tracesGlobal view → correct decisions
Network topologyApp → localhost or node-localApp → cluster Service (cross-node traffic)
Auth to backendsSecrets distributed across nodesSecrets centralized at gateway
Multi-tenancyHard to enforce centrallyNatural policy enforcement point
ScalabilityScales naturally with nodesNeeds HPA / sharding for stateful processors
ReliabilityNo central SPOF; per-node blast radiusChoke point; mitigated by replicas + HPA

3.4 Hybrid: agent + gateway

For most non-trivial Kubernetes clusters, the recommended pattern is DaemonSet agent + Deployment gateway:

  1. DaemonSet agent receives OTLP from local pods, scrapes Prometheus and kubelet endpoints, tails container logs, adds Kubernetes metadata, and forwards OTLP to the gateway. Resilient, node-local, and cheap.
  2. Deployment gateway receives OTLP from agents (and any clients that prefer to bypass the agent), runs tail sampling and tenant routing, authenticates to backends, and handles retries and backpressure.

This hybrid topology gives you the best of both worlds: node-local enrichment plus host signals from the agent, and centralized sampling, auth, and routing from the gateway. It also gives you a natural choice of where to bound failure: if the gateway is overloaded, agents buffer locally (memory or file_storage extension) until pressure subsides [Source: https://opentelemetry.io/docs/collector/deployment/].

3.5 The OpenTelemetry Operator

On Kubernetes, the OpenTelemetry Operator provides CRDs that let you declare topology declaratively [Source: https://github.com/open-telemetry/opentelemetry-operator]:

The Operator lets you treat Collector topology as declarative configuration, just like Deployments and Services.

Key Takeaway: Agent mode optimizes for node-local collection, host signals, and isolated failure domains; gateway mode optimizes for tail sampling, centralized auth, and multi-tenant routing. Most production Kubernetes clusters end up running both — a DaemonSet agent for cheap node-local work and a Deployment gateway for heavy centralized work.

4. Distributions and Builds

The Collector is more than one binary. The OpenTelemetry project ships distributions — pre-built bundles of receivers, processors, exporters, and extensions — and provides a builder tool for assembling your own [Source: https://opentelemetry.io/docs/collector/].

4.1 otelcol vs otelcol-contrib

The two flagship distributions:

DistributionComponentsImage sizeUse case
otelcolMinimal, core onlySmallStable production, OTLP-only
otelcol-contribHundredsLargeMost real-world deployments needing vendor or specialty components
Vendor (e.g., Datadog Agent, AWS Distro for OpenTelemetry)Curated for vendorVariesTight vendor integration, vendor support contracts
Custom (built with ocb)Exactly what you chooseSmallest possibleProduction hardening, supply-chain control

For most teams getting started, otelcol-contrib is the practical default — almost every production pipeline ends up needing at least one component that lives in contrib (k8sattributes, tail_sampling, filelog, etc.).

4.2 Building custom distributions with ocb

The OpenTelemetry Collector Builder (ocb) is a CLI that lets you assemble your own distribution from a manifest. The motivation:

A builder manifest (manifest.yaml) looks like this:

dist:
  name: my-otelcol
  description: Custom OpenTelemetry Collector for ACME Corp
  output_path: ./dist
  otelcol_version: 0.95.0

receivers:
  - gomod: go.opentelemetry.io/collector/receiver/otlpreceiver v0.95.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver v0.95.0

processors:
  - gomod: go.opentelemetry.io/collector/processor/batchprocessor v0.95.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor v0.95.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor v0.95.0

exporters:
  - gomod: go.opentelemetry.io/collector/exporter/otlpexporter v0.95.0
  - gomod: github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter v0.95.0

Then:

ocb --config manifest.yaml

The result is a single Go binary containing exactly the components listed — nothing more.

Analogy: otelcol-contrib is like a Linux distribution shipped with every package preinstalled. Convenient, but most servers only need a handful. A custom build with ocb is the equivalent of apt install only the packages you need, on a minimal base image.

4.3 Vendor-specific distributions

Several vendors ship their own Collector distributions, sometimes called agents or contribs:

Vendor distributions still respect the API/SDK/OTLP boundary. Your applications continue to emit standard OTLP; only the Collector itself is vendor-flavored. Switching vendors is largely a Collector configuration change — instrumentation code stays put. This is the practical payoff of the architecture in section 1.

4.4 Choosing a distribution

A decision matrix for picking a distribution:

If you need…Start with
OTLP-only, simple pipelineotelcol (core)
Most real-world Kubernetes pipelinesotelcol-contrib
Tight vendor integration + support contractVendor distribution
Minimal binary, supply-chain controlCustom build with ocb
Quick local experimentationotelcol-contrib Docker image

Key Takeaway: OpenTelemetry distributions are pre-assembled bundles of Collector components. Use otelcol-contrib for most production pipelines, vendor distributions when you want first-party support, and ocb-built custom distributions when supply-chain hygiene or binary size matter. In all cases, applications continue to emit standard OTLP — the distribution choice is a Collector concern, not an instrumentation concern.

Chapter Summary

OpenTelemetry is three layers, not one: a stable API that instrumentation code calls, a configurable SDK that turns those calls into real telemetry, and an out-of-process Collector that aggregates, processes, and forwards data. The API/SDK split is what makes vendor-neutral instrumentation possible — libraries can depend only on the API, and applications choose backends at deployment time without recompiling.

Across more than a dozen languages, three pillars give OpenTelemetry consistency: the specification that defines API/SDK semantics, semantic conventions that standardize attribute names so any backend can interpret any signal, and OTLP — the wire protocol with three transport variants (gRPC on 4317, HTTP/protobuf on 4318, HTTP/JSON on 4318) — that carries telemetry between SDKs, Collectors, and backends.

Collector topology is the most operationally flexible knob. Agents (sidecars or DaemonSets) excel at node-local enrichment, host signals, and isolated failures. Gateways (centralized Deployments) excel at tail-based sampling, centralized auth, and multi-tenant routing. Most non-trivial Kubernetes deployments combine both — a DaemonSet agent for cheap local work and a Deployment gateway for heavy centralized work, often managed by the OpenTelemetry Operator.

Finally, distributions package the Collector for different audiences: otelcol for minimal stable pipelines, otelcol-contrib for the practical default, vendor distributions for first-party support, and ocb-built custom distributions for supply-chain hygiene. Regardless of distribution, applications emit standard OTLP — keeping the seam between application code and operations exactly where the architecture intends.

In the next chapter, we’ll dive deeper into instrumenting applications: auto-instrumentation versus manual instrumentation, language-specific patterns, and how to evolve from zero-touch coverage to high-value custom spans and metrics.

Key Terms

TermDefinition
OpenTelemetry APIThe stable, vendor-neutral interface (Tracer/Meter/Logger and their providers) that libraries and application code use to emit telemetry. Default behavior is no-op when no SDK is configured.
OpenTelemetry SDKThe concrete in-process implementation of the API that adds samplers, processors, exporters, and resource detection. Applications own SDK configuration.
OpenTelemetry CollectorAn out-of-process Go binary that receives, processes, and exports telemetry. Pipeline = receivers → processors → exporters.
OTLPOpenTelemetry Protocol; the vendor-neutral wire protocol between SDKs, Collectors, and backends. Defined in protobuf with three transport variants.
OTLP/gRPCDefault OTLP transport: protobuf over gRPC/HTTP2 on port 4317. Lowest overhead; preferred in modern infrastructure.
OTLP/HTTP/protobufOTLP over HTTP/1.1 or HTTP/2 with binary protobuf body, port 4318. Friendly to traditional HTTP proxies and load balancers.
OTLP/HTTP/JSONOTLP over HTTP with JSON body. Easiest to debug; required for browser telemetry.
Semantic conventionsStandardized attribute names (e.g., service.name, http.response.status_code, k8s.pod.name) that let any backend interpret telemetry from any language uniformly.
Agent deploymentCollector running per-node (DaemonSet) or per-pod (sidecar). Node-local collection, host signals, isolated failures.
Gateway deploymentCollector running as a centralized Deployment behind a Service. Enables tail sampling, centralized auth, multi-tenant routing.
DistributionA pre-built bundle of Collector components (otelcol, otelcol-contrib, vendor builds, or custom ocb builds).
ocb (OpenTelemetry Collector Builder)A CLI that builds a custom Collector binary from a manifest listing exactly the receivers, processors, and exporters you want.
ResourceOTLP/SDK construct describing the entity producing telemetry (service.name, host.name, deployment.environment, cloud attributes).
Partial successOTLP response indicating that some items were rejected by the backend (e.g., rate-limited). Rejected items should not be retried.

Chapter 6: Instrumentation: Manual, Automatic, and Zero-Code

Instrumentation is the act of teaching your code to talk about itself. Before any dashboard can be drawn, before any alert can fire, before any trace can be visualized, some agent — your code, a runtime hook, or the Linux kernel — must produce a signal. OpenTelemetry recognizes three broad strategies for producing those signals: manual instrumentation, where developers explicitly emit spans, metrics, and logs; automatic instrumentation, where libraries are patched at runtime to emit telemetry on your behalf; and zero-code instrumentation, where an external observer (often eBPF in the kernel, or a Kubernetes Operator injecting agents) generates telemetry with no awareness from the application itself.

Think of the three approaches as the difference between an author writing a memoir, a transcriptionist sitting beside them, and a hidden microphone in the ceiling. Each captures a story; each captures it differently; each has a place.

Learning Objectives

By the end of this chapter, you will be able to:

Figure 6.1: Three instrumentation approaches compared

graph TD
    I[Application Telemetry]
    I --> M[Manual<br/>developer writes<br/>tracer.start_span]
    I --> A[Automatic<br/>runtime agent<br/>wraps libraries]
    I --> Z[Zero-Code<br/>eBPF kernel probes<br/>or Operator injection]
    M -->|effort: high| Q1[business attributes<br/>tenant.id, order.id]
    A -->|effort: low| Q2[broad HTTP/DB/RPC<br/>library coverage]
    Z -->|effort: none| Q3[polyglot, no rebuild<br/>kernel-wide visibility]

Section 1: Manual Instrumentation

Manual instrumentation puts the developer in direct control. You acquire a Tracer, Meter, or Logger from the OpenTelemetry SDK, then explicitly call methods to start spans, record measurements, or write structured log events. Manual code is verbose, but it is the only way to express domain context — concepts like order_id, tenant_id, payment.status, and feature_flag.variant that no auto-instrumenter could ever guess.

Acquiring Tracers, Meters, and Loggers

The OpenTelemetry SDK exposes three top-level provider objects: a TracerProvider, a MeterProvider, and a LoggerProvider. From each provider you obtain a named, versioned instance scoped to your library or module. The name is conventionally the import path of the instrumented package; it becomes the instrumentation.scope.name on every signal you emit, letting backends filter by which code produced the data [Source: https://opentelemetry.io/docs/specs/semconv/db/database-spans/].

// Java: acquire scoped tracer and meter
import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.api.metrics.Meter;

Tracer tracer = GlobalOpenTelemetry.getTracer("com.acme.payments", "1.4.0");
Meter meter   = GlobalOpenTelemetry.getMeter("com.acme.payments");
# Python: acquire scoped tracer and meter
from opentelemetry import trace, metrics

tracer = trace.get_tracer("acme.payments", "1.4.0")
meter  = metrics.get_meter("acme.payments")
// Node.js: acquire scoped tracer and meter
const { trace, metrics } = require('@opentelemetry/api');

const tracer = trace.getTracer('acme-payments', '1.4.0');
const meter  = metrics.getMeter('acme-payments');

Creating Spans and Recording Attributes

A span is the basic building block of a trace: a named, timed operation with a start, an end, attributes, events, and a status. The idiomatic pattern is to wrap a unit of work in a span block so it closes automatically, even on exceptions.

// Java: a span around a business operation
Span span = tracer.spanBuilder("authorize_payment")
    .setSpanKind(SpanKind.INTERNAL)
    .setAttribute("payment.method", "card")
    .setAttribute("tenant.id", tenantId)
    .startSpan();
try (Scope scope = span.makeCurrent()) {
    boolean ok = gateway.authorize(amount);
    span.setAttribute("payment.outcome", ok ? "approved" : "declined");
    if (!ok) span.setStatus(StatusCode.ERROR, "gateway declined");
} catch (Exception e) {
    span.recordException(e);
    span.setStatus(StatusCode.ERROR);
    throw e;
} finally {
    span.end();
}
# Python: idiomatic context-manager span
with tracer.start_as_current_span("authorize_payment") as span:
    span.set_attribute("payment.method", "card")
    span.set_attribute("tenant.id", tenant_id)
    try:
        approved = gateway.authorize(amount)
        span.set_attribute("payment.outcome",
                           "approved" if approved else "declined")
    except Exception as exc:
        span.record_exception(exc)
        span.set_status(trace.StatusCode.ERROR, str(exc))
        raise
// Node.js: span around an async function
await tracer.startActiveSpan('authorize_payment', async (span) => {
  span.setAttribute('payment.method', 'card');
  span.setAttribute('tenant.id', tenantId);
  try {
    const approved = await gateway.authorize(amount);
    span.setAttribute('payment.outcome', approved ? 'approved' : 'declined');
  } catch (err) {
    span.recordException(err);
    span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
    throw err;
  } finally {
    span.end();
  }
});

Attributes are key–value pairs. Their job is twofold: provide search and grouping handles in your trace UI, and feed dashboards through trace-to-metric pipelines. Use dot-namespaced keys (payment.method, not paymentMethod), follow semantic conventions when one exists, and treat your custom namespace (acme.*, payment.*) the way you would treat a public API — once dashboards depend on it, you cannot freely rename it.

Custom Metric Instruments

Where spans tell stories of individual requests, metrics tell stories of aggregate behavior. OpenTelemetry provides four core synchronous instruments and several asynchronous ones. Choose the instrument that matches the semantics of what you are counting, not just the dashboard panel you want.

InstrumentDirectionAggregationTypical Use
CounterMonotonic upSumTotal requests, errors, bytes sent
UpDownCounterUp or downSumActive connections, queue depth, pool size
HistogramRecords observationsBucketed distributionRequest latency, payload size
Gauge (observable)SampledLast valueCPU utilization, current temperature
# Python: a Counter and a Histogram for HTTP server work
http_requests = meter.create_counter(
    name="http.server.requests",
    description="Number of HTTP requests served",
    unit="1",
)

http_latency = meter.create_histogram(
    name="http.server.request.duration",
    description="HTTP server request latency",
    unit="s",
)

start = time.monotonic()
try:
    response = handle(request)
    http_requests.add(1, {
        "http.request.method": request.method,
        "http.response.status_code": response.status,
        "http.route": request.matched_route,
    })
finally:
    http_latency.record(time.monotonic() - start, {
        "http.request.method": request.method,
        "http.route": request.matched_route,
    })

Three rules deserve special emphasis. First, units matter: declare seconds (s), bytes (By), or 1 for dimensionless counts so backends can convert and label correctly. Second, histogram bucket boundaries are usually picked by the SDK — override them only when you know the latency profile of your service. Third, every attribute you attach to a metric multiplies cardinality; we return to that in Section 4.

Key Takeaway: Manual instrumentation gives you the only path to business-meaningful telemetry. Acquire scoped tracers and meters once per module, wrap units of work in spans with carefully chosen attributes, and pick metric instruments by semantics — Counter for monotonic totals, UpDownCounter for things that ebb and flow, Histogram for distributions, Gauge for current values.

Section 2: Automatic Instrumentation

Automatic instrumentation is the OpenTelemetry community’s answer to a hard question: “How do I get traces from libraries I did not write?” The answer differs by runtime, because each language exposes different hooks for intercepting library code without changing it.

Bytecode Injection: Java Agent and .NET Profiler

In Java, the opentelemetry-javaagent.jar attaches to the JVM at startup using the -javaagent flag. Internally, it registers a premain method via the Java Instrumentation API and uses a bytecode library such as ByteBuddy to rewrite classes as the classloader loads them, weaving span start/end logic around methods of interest [Source: https://javapro.io/wp-content/uploads/2026/02/JAVAPRO_01-2026.pdf].

OTEL_SERVICE_NAME=checkout-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
OTEL_TRACES_EXPORTER=otlp \
OTEL_LOGS_EXPORTER=otlp \
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=prod,service.version=2.3.1 \
java -javaagent:/opt/otel/opentelemetry-javaagent.jar -jar /app/checkout.jar

The agent ships with dozens of instrumentation modules for Servlet, Spring MVC and WebFlux, JAX-RS, gRPC, OkHttp, Apache HttpClient, JDBC, R2DBC, Hibernate, Mongo, Cassandra, Kafka, RabbitMQ, JMS, and many more. Because the rewriting happens at class load time, it works without source changes; however, the agent must attach at JVM start — you generally cannot retrofit a running JVM — and exotic custom classloaders sometimes need extra configuration [Source: https://javapro.io/wp-content/uploads/2026/02/JAVAPRO_01-2026.pdf].

.NET uses a conceptually similar mechanism via a CLR profiler, registered through CORECLR_ENABLE_PROFILING=1 and a companion DLL that injects IL into managed methods at JIT time.

Monkey-Patching: Python and Node.js

Dynamic languages do not need bytecode rewriting — they let you replace functions at runtime. Python’s auto-instrumentation uses the opentelemetry-instrument CLI to run a bootstrap before your application code; that bootstrap loads every installed opentelemetry-instrumentation-* package, each of which monkey-patches the relevant library at import time [Source: https://lumigo.io/opentelemetry/].

pip install opentelemetry-distro opentelemetry-exporter-otlp \
            opentelemetry-instrumentation-requests \
            opentelemetry-instrumentation-psycopg2 \
            opentelemetry-instrumentation-flask

OTEL_SERVICE_NAME=orders-api \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
opentelemetry-instrument gunicorn orders.wsgi:application

For example, the requests instrumentation replaces requests.Session.request with a wrapper that opens a client span, records HTTP attributes, calls the original function, captures the response, and ends the span. The instrumentation must run before the first import of the patched library; otherwise the cached reference will be the unpatched one.

Node.js relies on require hooks. @opentelemetry/auto-instrumentations-node registers handlers for require() (via require-in-the-middle or similar) and patches each module’s exports as it is loaded.

// tracing.js — must be required before any other module
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({}),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
NODE_OPTIONS="--require ./tracing.js" node app.js

Coverage and the Cross-Language Comparison

All three approaches expose the same observable surface — HTTP servers and clients, gRPC, SQL and NoSQL clients, message queues — but each has a different “blast radius” and quirks.

Figure 6.2: Java agent bytecode injection lifecycle

sequenceDiagram
    participant OS as OS / shell
    participant JVM as JVM
    participant Agent as otel-javaagent.jar
    participant CL as ClassLoader
    participant App as Application code
    participant Col as OTLP Collector
    OS->>JVM: java -javaagent:otel.jar -jar app.jar
    JVM->>Agent: invoke premain(Instrumentation)
    Agent->>JVM: register ClassFileTransformer (ByteBuddy)
    JVM->>App: start main()
    App->>CL: load HttpServlet, JdbcDriver, ...
    CL->>Agent: transform(class bytes)
    Agent-->>CL: rewritten bytes with span hooks
    App->>App: first request enters Servlet.service()
    App->>Col: OTLP span exported<br/>(http.server.request)
AspectJavaPythonNode.js
Primary mechanism-javaagent bytecode rewriteMonkey-patching at importrequire hook + export patching
Entry pointJVM flagopentelemetry-instrument CLINODE_OPTIONS=--require
Runtime hookJava Instrumentation API + ByteBuddyDynamic attribute assignmentrequire-in-the-middle
Code changesNoneNoneOne bootstrap file
Context propagationThread-locals + executor wrapperscontextvars + async wrappersAsync hooks integrated per library
ConfigurationOTEL_* env varsOTEL_* env vars + CLI flagsOTEL_* env vars + NodeSDK options
Common pitfallCustom classloadersImport order before patchBundlers/serverless hide require

A shared environment-variable contract spans every language: OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_TRACES_EXPORTER, OTEL_METRICS_EXPORTER, OTEL_LOGS_EXPORTER, OTEL_EXPORTER_OTLP_PROTOCOL, OTEL_TRACES_SAMPLER, OTEL_PROPAGATORS, and OTEL_RESOURCE_ATTRIBUTES. Operators love this: one ConfigMap, one set of variables, and every workload — whether Java, Python, or Node — speaks the same dialect [Source: https://lumigo.io/opentelemetry/].

When something goes wrong, two debugging axioms apply. No traces at all usually means an exporter is set to none, the endpoint protocol is mismatched (grpc vs. http/protobuf), or the bootstrap is loading too late. Duplicate spans almost always mean a library is being captured by both auto- and manual instrumentation; disable one for that library.

Key Takeaway: Auto-instrumentation gives you HTTP, gRPC, database, and queue spans for free, but how it gets injected depends on the runtime. Java rewrites bytecode at class load; Python and Node monkey-patch at import or require. All three share the same OTEL_* configuration vocabulary, which is what makes mixed-language fleets tractable.

Section 3: Zero-Code Instrumentation

“Zero-code” is the marketing label for a stronger promise than auto-instrumentation: not only does the developer not write tracing code, the developer’s build artifact is not modified at all. Two distinct technologies live under this umbrella: eBPF agents that observe processes from the Linux kernel, and the OpenTelemetry Operator that injects auto-instrumentation agents into Kubernetes pods at admission time without changing container images.

eBPF-Based Auto-Instrumentation

eBPF (extended Berkeley Packet Filter) lets you load safe, sandboxed bytecode into the Linux kernel and attach it to kernel events — syscalls, function entry/exit, tracepoints, network events — at runtime, without recompiling the kernel [Source: https://ebpf.io/what-is-ebpf/]. An eBPF observability agent typically does the following [Source: https://logz.io/glossary/what-is-ebpf/] [Source: https://www.sysdig.com/blog/the-art-of-writing-ebpf-programs-a-primer]:

  1. Attaches kprobes to network kernel functions like tcp_sendmsg, tcp_cleanup_rbuf, sys_enter_sendto, and sys_enter_recvfrom to observe every byte that crosses TCP.
  2. Attaches uprobes to user-space functions in shared libraries — SSL_read / SSL_write in libssl, the Go runtime’s HTTP handlers, JVM JNI entry points — to see data before encryption or after decryption.
  3. Writes structured records into eBPF maps that a user-space agent drains at high frequency.
  4. Reconstructs requests in user space — matching send/recv into request–response pairs, parsing HTTP headers, gRPC HTTP/2 framing, and SQL handshakes — to produce L7 metrics and OTLP spans [Source: https://www.groundcover.com/ebpf].

Figure 6.3: eBPF zero-code instrumentation dataflow

flowchart TD
    subgraph US[User space]
        A1[App A<br/>Go binary]
        A2[App B<br/>Java JVM]
        A3[App C<br/>Python]
        SSL[libssl.so]
        DS[Beyla / Pixie DaemonSet<br/>user-space agent]
    end
    subgraph K[Linux kernel]
        KP1[kprobe: tcp_sendmsg]
        KP2[kprobe: tcp_cleanup_rbuf]
        KP3[tracepoint: sys_enter_sendto]
        UP[uprobe: SSL_read / SSL_write]
        MAP[(eBPF map<br/>ring buffer)]
    end
    A1 -->|syscalls| KP1
    A2 -->|syscalls| KP3
    A3 -->|TLS calls| SSL
    SSL --> UP
    KP1 --> MAP
    KP2 --> MAP
    KP3 --> MAP
    UP --> MAP
    MAP --> DS
    DS -->|OTLP spans + RED metrics| COL[OpenTelemetry Collector]

Because the hook points are in the kernel and in shared libraries, eBPF works for every language on the host — Go, Rust, Java, Python, Node, C++, even closed-source binaries — without touching their code [Source: https://www.contrastsecurity.com/glossary/ebpf]. The output is typically the four golden signals per service (latency, traffic, errors, saturation) plus distributed traces for common protocols [Source: https://newrelic.com/blog/observability/what-is-ebpf].

Tool landscape:

ToolFocusOutput
Grafana BeylaZero-code OTel auto-instrumentation for HTTP/gRPC/DBOTLP traces + RED metrics
PixieK8s deep debugging, full request bodies, PxL scriptsIn-cluster live data, dashboards
Cilium TetragonRuntime security and policy enforcementProcess/file/network events; can block
OdigoseBPF + SDK hybrid OTel platformOTLP routed by policy

OpenTelemetry Operator and Auto-Instrumentation CRDs

For Kubernetes workloads, the OpenTelemetry Operator offers a different flavor of zero-code: it lets the cluster itself inject the auto-instrumentation agents we saw in Section 2, with no changes to your container images [Source: https://lumigo.io/opentelemetry/]. The Operator defines an Instrumentation Custom Resource that describes how to instrument, then a mutating admission webhook applies that recipe when pods are annotated for injection.

# 1. The Instrumentation CRD: a reusable recipe per language
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: default-instrumentation
  namespace: production
spec:
  exporter:
    endpoint: http://otel-collector.observability:4317
  propagators:
    - tracecontext
    - baggage
  sampler:
    type: parentbased_traceidratio
    argument: "0.1"
  resource:
    attributes:
      deployment.environment: prod
      service.namespace: payments
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
    env:
      - name: OTEL_INSTRUMENTATION_JDBC_ENABLED
        value: "true"
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest
# 2. The Deployment opts in via pod annotations
apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-java: "production/default-instrumentation"
    spec:
      containers:
        - name: app
          image: registry.example.com/checkout:2.3.1

When a pod with that annotation is created, the webhook injects an init container that copies the Java agent JAR into a shared volume, then patches the application container with JAVA_TOOL_OPTIONS=-javaagent:/otel-auto-instrumentation/javaagent.jar and the appropriate OTEL_* environment variables. Python and Node.js use analogous mechanisms — wrapping the entry command or injecting startup hooks — so the application image stays untouched [Source: https://lumigo.io/opentelemetry/].

This is the easiest path to fleet-wide instrumentation in Kubernetes: write the Instrumentation CRD once, label deployments by language, and every new pod is born observable.

Limits of Zero-Code Approaches

Zero-code is not a free lunch. Compare the three strategies:

CapabilityManualAuto (SDK)eBPF Zero-Code
Captures HTTP/gRPC/DB callsIf codedYes, broadYes, broad
Captures business attributes (order_id)YesNoNo
Works on closed-source binariesNoNoYes
Sees TLS-encrypted in-process trafficYesYesOnly via libssl uprobes
Works on Windows/macOSYesYesLinux only
Custom binary protocolsYesSometimesRarely
Operational rollout effortHighLowVery low (DaemonSet)
Privilege requiredApp identityApp identityCAP_SYS_ADMIN/CAP_BPF

eBPF loses when business context matters (it has no idea that the request it just saw belongs to tenant=acme-corp paying order=ORD-9182), when traffic is encrypted with libraries the agent doesn’t know how to probe, when protocols are custom or binary, or when the platform isn’t Linux [Source: https://www.stackrox.io/blog/what-is-ebpf/]. The Operator-based path inherits all the limits of the SDK auto-instrumentation it ships — no domain attributes, library-version compatibility risk — but it is fantastic for getting wide coverage quickly.

Hybrid is the production-grade answer. Run eBPF for horizontal, language-agnostic baseline coverage of every workload on every node; use the Operator to inject SDK auto-instrumentation on every K8s pod; and add manual instrumentation on the critical business flows where you need tenant_id, feature_flag, payment.outcome, and the like to debug or to model SLOs.

Key Takeaway: Zero-code means two different things: eBPF probes in the kernel that see every process on a node, and the OpenTelemetry Operator injecting SDK agents into K8s pods at admission. Both eliminate code changes; neither captures business meaning. Combine them with manual instrumentation on the flows that matter.

Section 4: Semantic Conventions in Practice

Instrumentation that nobody can query is just expensive noise. Semantic conventions are OpenTelemetry’s contract that names things the same way everywhere, so a dashboard written against http.response.status_code works whether the data came from a Java agent, a Python monkey-patch, a Beyla eBPF probe, or your own manual code [Source: https://opentelemetry.io/docs/specs/semconv/db/database-spans/].

Attributes for HTTP, RPC, Database, Messaging

The conventions divide into stable attributes (safe to anchor dashboards on) and experimental ones (subject to change). The most heavily used stable attributes:

DomainAttributeValueUse
HTTPhttp.request.methodGET, POST, …Method dimension on RED metrics
HTTPhttp.response.status_code200, 404, 503Error-rate panels, SLO burn
HTTPhttp.route/orders/{id}Path grouping without ID explosion
HTTPurl.fullhttps://api/...?token=…Debugging (sensitive — see hygiene below)
HTTPserver.addressapi.acme.ioBackend grouping
HTTPuser_agent.originalraw UA stringClient breakdown
RPCrpc.systemgrpc, connect_rpcFilter by RPC family
RPCrpc.service / rpc.methodPaymentService / AuthorizeEndpoint heatmaps
DBdb.systempostgresql, mysql, mongodbEngine breakdown
DBdb.operationSELECT, INSERTLatency by operation
DBdb.statementfull textSlow-query debugging (sensitive)
DBdb.namelogical DB/schemaPer-schema metrics
Messagingmessaging.systemkafka, rabbitmqBroker breakdown
Messagingmessaging.destination.nametopic/queue namePer-topic throughput
Messagingmessaging.operationpublish, receive, processLifecycle staging

A clean dashboard query — “p95 HTTP server latency by route and status, last 30 minutes” — is just groupby(http.route, http.response.status_code) of histogram(http.server.request.duration). Because every service emits those attribute keys, the same panel works across the entire fleet, and across vendors that ingest OTLP natively [Source: https://lumigo.io/opentelemetry/].

Resource Attributes for Service Identity

Attributes describe a single signal; resource attributes describe the emitter of every signal it produces. They live in the OTLP Resource and are typically set once, at SDK initialization, via OTEL_RESOURCE_ATTRIBUTES or runtime resource detectors.

Stable resource attributes you should always set:

OTEL_SERVICE_NAME=checkout-api
OTEL_RESOURCE_ATTRIBUTES=\
service.namespace=payments,\
service.version=2.3.1,\
service.instance.id=checkout-api-7d4f-x9w2,\
deployment.environment=prod,\
k8s.namespace.name=production,\
k8s.deployment.name=checkout-api,\
k8s.pod.name=checkout-api-7d4f-x9w2,\
cloud.provider=aws,\
cloud.region=us-east-1

The OpenTelemetry Operator can fill many of these for you automatically by reading the pod’s downward API; in Kubernetes you should rarely need to set Kubernetes resource attributes by hand.

Avoiding Label and Attribute Cardinality Bombs

Cardinality is the silent killer of observability platforms. Every unique combination of attribute values produces a distinct time series for metrics and a distinct index entry for traces. Pricing, retention, query speed, and even cluster stability degrade with cardinality. The single most important instrumentation discipline is asking, before you attach an attribute, “How many distinct values can this take?”

Figure 6.4: OpenTelemetry Operator pod injection workflow

sequenceDiagram
    participant Dev as Developer
    participant API as kube-apiserver
    participant Op as OTel Operator
    participant WH as Mutating Webhook
    participant Init as Init container
    participant App as App container
    participant Col as Collector
    Dev->>API: kubectl apply Instrumentation CR<br/>(java/python/nodejs recipes)
    Op->>API: watch + cache Instrumentation CR
    Dev->>API: apply Deployment with annotation<br/>inject-java: "ns/instr-name"
    API->>WH: AdmissionReview (Pod create)
    WH->>WH: read annotation + CR recipe
    WH-->>API: patched Pod spec<br/>(init container + JAVA_TOOL_OPTIONS + OTEL_* env)
    API->>Init: schedule init container
    Init->>App: copy javaagent.jar to shared volume
    App->>App: JVM starts with -javaagent
    App->>Col: OTLP spans, metrics, logs

Figure 6.5: Cardinality explosion from a single attribute

graph LR
    subgraph BASE[Safe baseline]
        B1[method ~10]
        B2[status ~60]
        B3[route ~200]
        B1 --> BX[10 x 60 x 200<br/>= 120K series]
        B2 --> BX
        B3 --> BX
    end
    subgraph BAD[Add user.id]
        U[user.id<br/>~1,000,000]
        BX --> E[120K x 1M<br/>= 120 billion series]
        U --> E
    end
    E -->|TSDB OOM, cost spike| X[Cardinality bomb]

Rules of thumb:

Attribute candidateCardinalityUse?
http.request.method~10Yes
http.response.status_code~60Yes
http.route (templated)~hundredsYes
db.operation~10Yes
service.version~tensYes
tenant.id (large SaaS)~thousands+Carefully — often spans only
url.full with raw path~unboundedNo on metrics; redact on spans
user.id~unboundedSpan attribute only; not on metrics
request.id / trace.idper-requestSpan only — never a metric label
db.statement rawper-callSpan only, redacted/parameterized

Three practical mitigations:

  1. Use templates, not raw values. Push http.route=/orders/{id} to your metric labels, leaving url.full for span-only attributes you debug with.
  2. Drop or hash at the Collector. If you cannot prevent a high-cardinality attribute at the source, the OpenTelemetry Collector’s attributes, transform, and redaction processors can drop, truncate, or one-way-hash before export [Source: https://www.honeycomb.io/blog/opentelemetry-best-practices-data-prep-cleansing].
  3. Separate metric and span schemas. It is fine — and good — for a span to carry tenant.id and order.id while the metric derived from those spans carries only tenant.tier and payment.method. Spans are sampled and indexed; metrics are aggregated forever.

The same mitigations double as PII hygiene controls. url.full, url.query, client.address, network.peer.address, and db.statement may all contain personal data — email addresses, search terms, session tokens. The Honeycomb best-practices guide recommends a layered strategy: redact at the SDK where possible, allow-list at the Collector for everything you cannot vouch for, and hash where you need to preserve cardinality without preserving identity [Source: https://www.honeycomb.io/blog/opentelemetry-best-practices-data-prep-cleansing].

Stability and Evolution

OpenTelemetry semantic conventions evolve under a three-state model: Experimental, Stable, and Deprecated. The migration from http.method (old) to http.request.method (new) is a real-world example: both names exist for a period, the new name is preferred, and Collector transform processors can normalize older signals so your dashboards survive the transition [Source: https://opentelemetry.io/docs/specs/semconv/db/database-spans/]. Anchor dashboards on Stable attributes; treat Experimental ones as opt-in extras.

Key Takeaway: Semantic conventions are what make OpenTelemetry portable. Use stable attribute names (http.request.method, http.response.status_code, db.system, db.operation) on every signal, set resource attributes once for service identity, and treat cardinality and PII as instrumentation-time concerns — once they reach the Collector, the damage is harder to contain.

Chapter Summary

This chapter mapped the three pillars of OpenTelemetry instrumentation. Manual instrumentation — acquiring named tracers and meters, creating spans, choosing the right metric instrument — is where business meaning enters telemetry; nothing else can capture tenant_id, feature_flag, or payment.outcome. Automatic instrumentation runs in three flavors depending on the runtime: Java’s -javaagent bytecode rewriting, Python’s opentelemetry-instrument monkey-patching, and Node.js’s require-hook patching, all sharing a common OTEL_* environment-variable contract. Zero-code instrumentation comes in two forms: eBPF agents that watch kernel and library hooks for every process on a host, and the OpenTelemetry Operator’s Instrumentation CRD that injects SDK agents into Kubernetes pods at admission. Each strategy has a sweet spot: eBPF for fast, broad, polyglot coverage; the Operator for fleet-wide K8s rollouts; manual for the business-critical paths. Finally, semantic conventions — stable attribute names for HTTP, RPC, database, and messaging, plus resource attributes for service identity, plus disciplined cardinality and PII hygiene — are what turn instrumentation into vendor-portable, durable observability.

In Chapter 7 we will follow the signals after they leave the application: the OpenTelemetry Collector, its receivers, processors, and exporters, and how to build telemetry pipelines that can normalize, sample, and route data from every workload in your environment to the backends that consume it.

Key Terms

TermDefinition
TracerNamed, versioned SDK object used to create spans for a given library or module.
MeterNamed, versioned SDK object used to create metric instruments (Counter, UpDownCounter, Histogram, Gauge).
InstrumentA typed metric primitive (Counter for monotonic sums, UpDownCounter for ebbing values, Histogram for distributions, Gauge for sampled current values).
Auto-instrumentationRuntime patching of common libraries to emit telemetry without source changes — Java agent, Python monkey-patching, Node.js require hooks.
OpenTelemetry OperatorKubernetes operator that manages Collectors and uses an Instrumentation CRD + mutating webhook to inject auto-instrumentation agents into annotated pods.
eBPFLinux kernel facility for loading sandboxed bytecode attached to kprobes, uprobes, and tracepoints; enables language-agnostic zero-code observability.
Semantic ConventionsOpenTelemetry-defined standard attribute names and meanings for HTTP, RPC, database, messaging, and other domains; the basis of vendor-portable dashboards.
Resource AttributesAttributes describing the emitter of telemetry — service.name, service.version, deployment.environment, Kubernetes identifiers — set once per SDK instance.

Chapter 7: Distributed Tracing with OpenTelemetry

In a monolithic application, when a user clicks “Place Order” and the page hangs, a developer can attach a debugger and walk through the call stack. The execution path is linear, the variables are local, and the entire story of the request lives in one process. In a cloud-native system, that same click might cross a dozen services, two message brokers, three databases, and a handful of language runtimes — each with its own logs, clocks, and failure modes. The stack trace is gone. What replaces it is the distributed trace: a stitched-together view of how a single request flowed through the system, who called whom, how long each hop took, and where things went wrong.

OpenTelemetry (OTel) is the open standard that makes those traces portable. It defines a data model for traces, a set of propagation formats for carrying trace identity over the wire, and APIs/SDKs for emitting trace data from instrumented applications. This chapter explains how OpenTelemetry traces are structured, how trace context is propagated across service and protocol boundaries, how to instrument code so traces are useful (and not just voluminous), and how to visualize and analyze trace data in tools like Jaeger and Grafana Tempo.

Learning Objectives

By the end of this chapter, you should be able to:


7.1 Trace Data Model

A trace is, formally, a directed acyclic graph of spans that share a common TraceId. Each span represents one unit of work — an HTTP handler, a database query, a queue publish — and carries a name, a start/end timestamp, a parent reference, attributes, events, status, and a kind. A trace is what you get when you collect all the spans for one request and arrange them in causal order.

The mental model worth carrying is this: a span is to a trace what a stack frame is to a stack trace, except spans cross process boundaries and overlap in time when work happens in parallel.

Figure 7.1: Parent-child span tree for a single trace

flowchart TD
    A["SERVER<br/>POST /checkout<br/>checkout-svc<br/>0 - 480ms"] --> B["CLIENT<br/>payment.charge<br/>checkout-svc<br/>20 - 310ms"]
    A --> C["CLIENT<br/>inventory.reserve<br/>checkout-svc<br/>20 - 180ms"]
    B --> D["SERVER<br/>POST /charge<br/>payments-svc<br/>30 - 300ms"]
    C --> E["SERVER<br/>POST /reserve<br/>inventory-svc<br/>30 - 170ms"]
    D --> F["CLIENT<br/>db.query users<br/>payments-svc<br/>50 - 110ms"]
    D --> G["CLIENT<br/>POST gateway<br/>payments-svc<br/>120 - 290ms"]
    E --> H["CLIENT<br/>db.update stock<br/>inventory-svc<br/>40 - 160ms"]

    classDef server fill:#1f3a5f,stroke:#58a6ff,color:#fff
    classDef client fill:#3a2f5f,stroke:#a78bfa,color:#fff
    class A,D,E server
    class B,C,F,G,H client

TraceId, SpanId, and TraceFlags

Three identifiers form the backbone of every span context:

Together, these three fields form the SpanContext, the minimal envelope of identity that must travel between services for a trace to remain coherent. Everything else — names, attributes, events, status — is local to each span and is exported to a backend; only the SpanContext crosses the wire.

A useful analogy: think of TraceId as the conference badge color (everyone at the same event shares it), SpanId as the individual badge number (each person is unique), and TraceFlags as whether the photographer is allowed to publish your photo (sampled or not).

Span Kinds: SERVER, CLIENT, PRODUCER, CONSUMER, INTERNAL

SpanKind is a small enum that tells backends what role the span plays in a distributed conversation. Without it, a backend cannot tell whether a span represents the inbound side or the outbound side of an RPC, and dependency graphs become guesswork.

Span KindRoleTypical ExamplePairs With
SERVERSynchronous inbound request handlerHTTP handler, gRPC server methodCLIENT (caller)
CLIENTSynchronous outbound call to a remote servicehttp.Client.Do, gRPC client stub, JDBC querySERVER (callee)
PRODUCERAsynchronous send onto a queue/topickafka.Producer.Send, SQS publishCONSUMER (later)
CONSUMERAsynchronous receive/process from a queue/topicKafka consumer loop, SQS poll-and-processPRODUCER (earlier)
INTERNALLocal work, not a network hopBusiness-logic function, JSON parsing, an expensive loopn/a

The default kind is INTERNAL. Setting the kind correctly is what allows Tempo’s service-graph processor (covered in §7.4) to pair CLIENT spans with downstream SERVER spans and build a dependency map [Source: https://grafana.com/docs/loki/latest/query/log_queries/].

Beyond identity and kind, a span carries three additional structures that turn raw timing data into a diagnosis:

Figure 7.2: Span kinds and how they pair across a distributed call

graph LR
    subgraph svcA["Service A"]
        A1["SERVER<br/>POST /checkout"]
        A2["INTERNAL<br/>validate_cart"]
        A3["CLIENT<br/>POST /charge"]
        A4["PRODUCER<br/>orders.created publish"]
        A1 --> A2
        A1 --> A3
        A1 --> A4
    end

    subgraph svcB["Service B"]
        B1["SERVER<br/>POST /charge"]
        B2["CLIENT<br/>db.query"]
        B1 --> B2
    end

    subgraph svcC["Kafka + Consumer"]
        C1["CONSUMER<br/>orders.created process"]
    end

    A3 -.->|"HTTP<br/>traceparent"| B1
    A4 -.->|"Kafka<br/>traceparent header"| C1

    classDef server fill:#1f3a5f,stroke:#58a6ff,color:#fff
    classDef client fill:#3a2f5f,stroke:#a78bfa,color:#fff
    classDef producer fill:#1f5f3a,stroke:#34d399,color:#fff
    classDef consumer fill:#5f3a1f,stroke:#fbbf24,color:#fff
    classDef internal fill:#2a2a2a,stroke:#888,color:#fff
    class A1,B1 server
    class A3,B2 client
    class A4 producer
    class C1 consumer
    class A2 internal

Key Takeaway: A trace is a graph of spans tied together by a shared TraceId; each span carries a SpanId, a kind (SERVER/CLIENT/PRODUCER/CONSUMER/INTERNAL), a status, attributes, events, and optional links. Setting these correctly is what turns raw timing into a diagnosable picture of a request.


7.2 Context Propagation

A trace only works if every service in the request path reads, preserves, and forwards the SpanContext. That cross-process handoff is called context propagation, and it is implemented by propagators — small objects that know how to inject context into outbound carriers (HTTP headers, gRPC metadata, Kafka record headers) and extract context from inbound carriers.

OpenTelemetry’s default wire format for HTTP is the W3C Trace Context standard, but the SDKs also ship propagators for B3 (Zipkin) and Jaeger to interoperate with legacy systems.

W3C Trace Context: traceparent and tracestate

The W3C spec defines two HTTP headers [Source: https://www.w3.org/TR/trace-context/]:

The traceparent header is mandatory and has a fixed, dash-separated format:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             |  |                                |                |
             |  |                                |                +-- trace-flags (01 = sampled)
             |  |                                +-- parent span-id (16 hex)
             |  +-- trace-id (32 hex, globally unique per trace)
             +-- version (currently "00")

The four fields are:

  1. version00 today. Implementations that see a future version should follow version-specific rules; for 00, ignore anything after the four parts.
  2. trace-id — 32 lowercase hex characters (16 bytes). Maps to OTel’s TraceId.
  3. span-id — 16 lowercase hex characters (8 bytes). This is the sender’s span — the parent of any span the receiver creates.
  4. trace-flags — 2 hex characters. Only bit 0 is defined; 01 is sampled, 00 is not.

The tracestate header is optional and carries an ordered list of vendor-specific key–value pairs:

tracestate: ot=foo:bar,ro=1,congo=t61rcWkgMzE

Rules worth knowing [Source: https://www.w3.org/TR/trace-context/]:

tracestate is where vendors stash routing hints, custom sampling decisions, or legacy correlation IDs without disturbing the standardized traceparent.

B3 and Jaeger Legacy Formats

Many production systems pre-date W3C Trace Context. OpenTelemetry includes propagators for two legacy formats so new W3C-aware services can interoperate with them.

B3 (Zipkin) has two variants. The multi-header form uses separate headers:

X-B3-TraceId: 4bf92f3577b34da6a3ce929d0e0e4736
X-B3-SpanId: 00f067aa0ba902b7
X-B3-ParentSpanId: 5e0c63257de34c92
X-B3-Sampled: 1
X-B3-Flags: 0

The single-header form packs everything into one header:

b3: 4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-1-5e0c63257de34c92

The X-B3-Flags: 1 value (or the equivalent in the single-header form) signals debug, which forces the trace to be sampled regardless of X-B3-Sampled.

Jaeger uses a single header named uber-trace-id:

uber-trace-id: 4bf92f3577b34da6a3ce929d0e0e4736:00f067aa0ba902b7:5e0c63257de34c92:1

The fields are colon-separated: trace-id : span-id : parent-span-id : flags, where flags is a decimal number — bit 1 (value 1) = sampled, bit 2 (value 2) = debug.

The three formats encode the same logical SpanContext but differ in surface syntax, sampling semantics, and how they handle debug traces:

AspectW3C Trace ContextB3 Multi-headerB3 Single-headerJaeger (uber-trace-id)
Header(s)traceparent, tracestateX-B3-TraceId, X-B3-SpanId, X-B3-Sampled, X-B3-Flags, X-B3-ParentSpanIdb3uber-trace-id
TraceId length128-bit (32 hex)64 or 128-bit64 or 128-bit64 or 128-bit
Field separator-(separate headers)-:
Sampled flagtrace-flags bit 0X-B3-Sampled: 1/03rd field: 1/0/dflags bit 1
Debug flagnot definedX-B3-Flags: 1 → force sampled3rd field: d → force sampledflags bit 2
Vendor extensionstracestatenonenonenone
OTel default?Yes (paired with Baggage)Opt-in propagatorOpt-in propagatorOpt-in propagator

OpenTelemetry SDKs let you compose multiple propagators so a single service can accept and emit several formats at once. In Go, that looks like [Source: https://www.w3.org/TR/trace-context/]:

otel.SetTextMapPropagator(
    propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},   // W3C traceparent + tracestate
        propagation.Baggage{},        // W3C baggage
        b3.New(b3.WithSingleHeader()),
        jaeger.Jaeger{},
    ),
)

On extract, each propagator tries its own header; the first one that succeeds wins (typically W3C). On inject, every enabled propagator writes its format, so the outbound request carries traceparent, b3, and uber-trace-id simultaneously. That redundancy is the migration trick: roll out W3C alongside B3/Jaeger, let downstream services read whichever they understand, then remove legacy formats once the fleet is fully W3C-aware.

The mapping between sampled flags is straightforward but must be preserved: B3 X-B3-Flags=1 (debug) and Jaeger flags & 0x02 both force-sample and should map to W3C trace-flags=01; B3 X-B3-Sampled=1 and Jaeger flags & 0x01 map directly to W3C bit 0.

Figure 7.3: traceparent propagation across a composite-propagator chain

sequenceDiagram
    participant C as Client
    participant A as Service A<br/>(W3C + B3)
    participant B as Service B<br/>(W3C only)
    participant D as Service C<br/>(B3 only)

    C->>A: HTTP request<br/>traceparent: 00-{trace-id}-{span-C}-01
    A->>A: extract context,<br/>start SERVER span {span-A}
    A->>B: HTTP request<br/>traceparent: 00-{trace-id}-{span-A}-01<br/>b3: {trace-id}-{span-A}-1<br/>baggage: user.id=12345
    B->>B: extract traceparent,<br/>start SERVER span {span-B}
    B->>D: HTTP request<br/>traceparent: 00-{trace-id}-{span-B}-01<br/>b3: {trace-id}-{span-B}-1
    D->>D: extract b3 header,<br/>start SERVER span {span-D}
    D-->>B: response
    B-->>A: response
    A-->>C: response

    Note over C,D: Same trace-id flows through<br/>all four hops despite mixed formats

Baggage for Cross-Cutting Attributes

Baggage is a separate W3C specification that travels alongside (but independently of) trace context. It is a set of key–value pairs stored on the Context — not on any one span — and propagated via the baggage HTTP header [Source: https://blog.nimblepros.com/blogs/otel/].

baggage: user.id=12345, tenant=acme-corp, feature.checkout_v2=enabled

Baggage is for cross-cutting, request-scoped context that every service might want to attach to its own spans, logs, or metrics: user ID, tenant ID, feature-flag variant, geographic region, support-case ID. Set it once at the edge, and every downstream hop can read it.

The distinction from span attributes is important:

AspectSpan attributesBaggage
Lives onA single spanThe context (independent of any span)
Propagated downstreamNo — only the span’s own service sees itYes — automatically injected into every outbound request
Typical useDescribe this operation (db.statement)Cross-cutting request data (user.id, tenant.id)
Visible toOnly that span in the traceAll spans, logs, and metrics in the request flow
Auto-copied to spans?n/aNo — instrumentation must opt-in to copy baggage to spans

A critical security note: untrusted clients can send any baggage header they want. Edge services should sanitize incoming baggage and apply an allowlist of accepted keys, and outbound calls to third parties should strip internal baggage to avoid leaking identifiers [Source: https://blog.nimblepros.com/blogs/otel/]. Never put secrets, tokens, or PII into baggage; treat it as data that may be stored long-term and visible to any service in the path.

Key Takeaway: Propagation is what stitches local spans into a global trace. W3C Trace Context (traceparent + tracestate) is OpenTelemetry’s default; composite propagators let you co-emit B3 and Jaeger headers for backward compatibility; and W3C Baggage carries cross-cutting request data — but never secrets — alongside the trace.


7.3 Building Useful Traces

Instrumenting a codebase is the easy part: auto-instrumentation libraries will produce spans for every HTTP request and database call out of the box. Producing traces that engineers actually use during an outage takes more thought. The difference between a noisy trace and a debuggable one usually comes down to span names, attribute hygiene, error recording, and knowing when not to create a span.

Naming Spans for Searchability

A span name is the primary identifier users see in Jaeger or Tempo. It should be low-cardinality (so backends can index and aggregate it) but descriptive enough to identify the operation.

The OpenTelemetry semantic conventions provide good defaults:

A simple test: if you imagine 10,000 spans being created, how many distinct names should appear? Tens or low hundreds, not millions. If your span name embeds an order ID or a UUID, it is too specific.

Attributes vs. Events vs. Status

Once a span is named, the question becomes what to attach to it. The three OTel facilities serve different purposes:

Information shapeUseExample
Stable property of the operationAttributehttp.method=GET, db.system=postgresql, messaging.system=kafka
Cardinality is bounded and useful for filteringAttributehttp.status_code=503, feature_flag.checkout_v2=enabled
Timestamped moment within the spanEventcache.miss, retry.attempt, circuit_breaker.opened
Exception / errorEvent + statusrecordException(e) + setStatus(ERROR, "payment declined")
Pass/fail outcomeStatusOK, ERROR

Follow the OTel semantic conventions for attribute names religiously: http.method, http.route, http.status_code, db.system, db.statement, rpc.system, rpc.service, rpc.method, messaging.system, messaging.destination. Consistent naming is what allows backends to render request panels, build service graphs, and correlate signals across the LGTM stack [Source: https://newrelic.com/blog/log/enrich-logs-with-opentelemetry-collector].

Recording Exceptions and Error Status

When an instrumented function throws, two things should happen: the exception is recorded as an event, and the span’s status is set to ERROR. Most language SDKs provide a single helper:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("charge_payment") as span:
    span.set_attribute("payment.amount_cents", amount)
    try:
        gateway.charge(card, amount)
    except PaymentDeclined as e:
        span.record_exception(e)
        span.set_status(trace.StatusCode.ERROR, "payment declined")
        raise

Three habits to internalize:

  1. Record then re-raise unless you are deliberately swallowing the exception. Recording without re-raising can hide bugs.
  2. Status ERROR is the signal that backends use to color spans red and that Tempo’s metrics-generator uses to count errors in RED metrics (§7.4). If you forget to set it, your error rate dashboards will lie.
  3. HTTP 4xx is not automatically an error on SERVER spans — the server worked correctly; the client sent a bad request. Reserve ERROR for 5xx or unhandled exceptions on the server side. On CLIENT spans, both 4xx and 5xx are typically errors from the caller’s perspective.

Avoiding Span Explosion in Tight Loops

The most common instrumentation mistake is creating a span per iteration of a loop. Imagine processing 5,000 Kafka messages in a single poll:

# Wrong — 5,001 spans per batch, blows up trace storage and indexing
with tracer.start_as_current_span("process_batch") as batch_span:
    for msg in batch:
        with tracer.start_as_current_span("process_message") as msg_span:
            msg_span.set_attribute("messaging.message_id", msg.id)
            handle(msg)

Most messages are uninteresting, but they all get the same span treatment. Better patterns:

A useful guideline: a span should represent a unit of work big enough that you might one day look at it in a UI. If you would never click on it, do not create it.

Figure 7.4: Span explosion vs. disciplined instrumentation

flowchart TB
    subgraph wrong["Wrong: 5001 spans per batch"]
        W1["process_batch SERVER"]
        W2["process_message x 5000<br/>uniform child spans<br/>blows up trace storage"]
        W1 --> W2
    end

    subgraph right["Right: 1 span + events + metrics"]
        R1["process_batch SERVER<br/>messaging.batch.size=5000"]
        R2["handle_failed_message<br/>(child span, only on error) x 3"]
        R3["events: cache.miss,<br/>retry.attempt, dlq.send"]
        R4["counter: messages_processed_total<br/>(metrics, not spans)"]
        R1 --> R2
        R1 -.-> R3
        R1 -.-> R4
    end

    classDef bad fill:#5f1f1f,stroke:#f87171,color:#fff
    classDef good fill:#1f5f3a,stroke:#34d399,color:#fff
    classDef neutral fill:#1f3a5f,stroke:#58a6ff,color:#fff
    class W1,W2 bad
    class R1,R2 good
    class R3,R4 neutral

Key Takeaway: Useful traces follow the semantic conventions, use low-cardinality span names, distinguish attributes (stable properties) from events (timestamped moments), set ERROR status deliberately, and avoid emitting a span per loop iteration — events, counters, or links usually serve better.


7.4 Trace Visualization and Analysis

Generated traces are only valuable if engineers can find and read them. Two open-source backends dominate the OpenTelemetry ecosystem: Jaeger, the original CNCF tracing project, and Grafana Tempo, the high-scale, object-storage-backed tracing backend in the Grafana LGTM stack. Both ingest OTLP, both render traces as Gantt-style waterfalls, and both produce RED-style metrics from spans — but the way they store, scale, and integrate differs.

Jaeger UI

Jaeger’s UI is a focused trace explorer. The core views are:

Jaeger stores traces in Cassandra, Elasticsearch, or OpenSearch (with experimental support for object storage), and is well-suited to mid-scale deployments.

Grafana Tempo

Tempo takes a different design tack: it stores spans in object storage (S3, GCS, Azure Blob) and indexes only the TraceId, making per-trace lookup cheap but full-text span search expensive. Tempo’s bet is that most trace queries come from exemplars — a metric or log line that already gives you the TraceId — and that for ad-hoc search you can use a separate index or the TraceQL query language.

Tempo’s signature feature is the metrics-generator, a component that reads spans from the ingest pipeline and emits Prometheus metrics in real time [Source: https://grafana.com/docs/loki/latest/query/log_queries/]. It runs two processors:

A minimal Tempo configuration:

metrics_generator:
  processor:
    service_graphs:
      enabled: true
      wait: 10s
      max_items: 10000
      peer_attributes:
        - peer.service
        - db.name
        - messaging.system
    span_metrics:
      enabled: true
      dimensions:
        - http.method
        - http.status_code
        - rpc.system
      include_span_kinds:
        - server
        - consumer

Service Maps and Dependency Graphs

A service map is the aggregate dependency graph derived from many traces. Both Jaeger and Tempo can render one. The accuracy of the map depends entirely on the correctness of:

When all three are right, the service map shows a directed graph with edges colored by error rate, throughput, or p95 latency — a near-real-time view of system topology that is impossible to maintain by hand.

Figure 7.5: Service map derived from trace data with RED metrics on each edge

flowchart LR
    web["web<br/>SERVER"]
    gw["api-gateway<br/>SERVER + CLIENT"]
    orders["orders<br/>SERVER + CLIENT"]
    pay["payments<br/>SERVER + CLIENT"]
    inv["inventory<br/>SERVER + CLIENT"]
    db[("db<br/>peer.service")]

    web -->|"rate 920/s<br/>err 0.1%<br/>p95 95ms"| gw
    gw -->|"rate 880/s<br/>err 0.2%<br/>p95 180ms"| orders
    gw ===>|"rate 412/s<br/>err 3.4%<br/>p95 820ms"| pay
    orders -->|"rate 720/s<br/>err 0.1%<br/>p95 60ms"| inv
    orders -->|"rate 720/s<br/>err 0.0%<br/>p95 40ms"| db
    pay -->|"rate 410/s<br/>err 0.1%<br/>p95 35ms"| db
    inv -->|"rate 720/s<br/>err 0.0%<br/>p95 30ms"| db

    classDef ok fill:#1f5f3a,stroke:#34d399,color:#fff
    classDef hot fill:#5f1f1f,stroke:#f87171,color:#fff
    classDef store fill:#3a2f5f,stroke:#a78bfa,color:#fff
    class web,gw,orders,inv ok
    class pay hot
    class db store

Trace-Based Metrics: RED and USE Generation

The RED method — Rate, Errors, Duration — is the de facto SLI vocabulary for request-driven services. Tempo’s span_metrics processor emits exactly the time series needed to compute RED via PromQL:

Rate per service:

rate(tempo_span_calls_total{span_kind="server"}[5m]) by (service_name)

Error rate per service:

rate(
  tempo_span_calls_total{
    span_kind="server",
    status_code!="OK"
  }[5m]
) by (service_name)

p95 duration per service:

histogram_quantile(
  0.95,
  sum by (service_name, le) (
    rate(tempo_span_duration_seconds_bucket{span_kind="server"}[5m])
  )
)

The service_graphs processor emits parallel metrics keyed by (client, server) so you can ask the same questions per edge rather than per service — useful when a problem isn’t in a service but in a particular dependency between two services.

The USE method (Utilization, Saturation, Errors) applies to resources rather than requests, but trace spans can contribute to USE too. A span on a database client carries db.system and peer.service attributes; aggregating its duration and error counts gives you per-database errors and saturation indicators. Resource-level utilization (CPU, memory) still comes from Prometheus exporters, but traces give you USE from the consumer’s perspective: how much of a downstream resource each caller is using.

Caveats for Trace-Derived Metrics

Two pitfalls deserve emphasis [Source: https://grafana.com/docs/loki/latest/query/log_queries/]:

  1. Sampling distorts rate. Head-based sampling at 10% means trace-derived metrics report roughly 1/10 of true request rate. Tail-based sampling that preferentially keeps errors and slow traces over-represents errors in the metric stream. Many teams treat trace-derived metrics as a correlation tool and keep direct application metrics as the SLO source of truth.
  2. Cardinality. Each dimension (http.method, http.status_code) becomes a Prometheus label. Adding user_id or request_id to span metrics will blow up cardinality and crash your TSDB. Stick to bounded labels: service, operation, status, method, coarse path.

Jaeger SPM vs. Tempo Metrics-Generator

The two systems converge on the same goal — RED metrics from spans — but differ in placement and tightness of integration:

AspectJaeger SPMTempo Metrics-Generator
ImplementationOTel Collector spanmetrics processor in front of JaegerBuilt into Tempo as a first-class component
Service graphsOften a separate processor or external toolNative service_graphs processor in metrics-generator
Storage backendCassandra, Elasticsearch, OpenSearchObject storage (S3/GCS/Azure Blob)
Metrics destinationPrometheusPrometheus / Mimir
Grafana integrationGoodNative — designed alongside Grafana
Multi-tenancyLimitedFirst-class per-tenant isolation
Best fitExisting Jaeger deployments, on-prem Cassandra/ES stacksCloud-native, object-storage-backed, LGTM stack adopters

Either choice still benefits from running an OTel Collector in front of the tracing backend: the Collector handles batching, retries, tail-based sampling, attribute scrubbing, and multi-backend fan-out, leaving the backend to do storage and query.

Key Takeaway: Jaeger and Tempo both render traces as Gantt waterfalls and derive RED metrics from spans, but Tempo couples object storage with a built-in metrics-generator that emits both per-service and per-edge metrics into Prometheus. Trace-derived metrics are excellent for exploration and service maps, but sampling and cardinality limits mean teams should keep direct application metrics as the SLO source of truth.


Chapter Summary

Distributed tracing makes the invisible visible: it stitches the journey of one request across services, queues, and databases into a single causal graph. OpenTelemetry defines that graph as a tree of spans sharing a TraceId, each span carrying a SpanId, a kind (SERVER/CLIENT/PRODUCER/CONSUMER/INTERNAL), a status, attributes, events, and optional links. Setting the kind correctly is what enables backends to build dependency maps; setting status and recording exceptions is what makes errors searchable.

Context propagation is the wire-level mechanism that keeps spans in the same trace. The W3C Trace Context spec defines the traceparent header (version, trace-id, parent span-id, sampled flag) and the optional tracestate for vendor-specific data, and is OpenTelemetry’s default propagator. Legacy B3 (multi-header or single-header b3:) and Jaeger (uber-trace-id) propagators are available for interoperability, and composite propagators let services emit and accept multiple formats simultaneously — the foundation of a gradual migration to W3C. Baggage is a separate, complementary propagator for cross-cutting request data (user.id, tenant, feature flags) that lives on the context rather than on any one span; it must never carry secrets and should be sanitized at trust boundaries.

Useful traces require discipline: low-cardinality span names following the OTel semantic conventions, attributes for stable properties, events for timestamped moments, deliberate ERROR status on real failures, and the willingness to not emit a span per loop iteration. Events, counters, and links usually serve the high-cardinality cases better than per-iteration spans.

Visualization closes the loop. Jaeger is the classic, focused trace explorer with strong search and dependency-graph features. Grafana Tempo stores spans cheaply in object storage and ships with a metrics-generator that turns spans into Prometheus RED metrics — per service via span_metrics and per dependency edge via service_graphs. Both tools support the RED method directly and contribute to USE-style resource views. Trace-derived metrics are powerful for correlation and dependency analysis but should not replace direct application metrics for SLO accounting, because sampling and cardinality choices distort the signal.

A well-instrumented system gives an on-call engineer three things at 3 a.m.: a metric that says “errors are up,” a log line with a TraceId, and a trace that points at the exact span where the request died. Everything in this chapter exists to make that handoff work.


Key Terms

TermDefinition
TraceId128-bit identifier (32 hex chars) shared by every span in a single logical request; globally unique per trace.
SpanId64-bit identifier (16 hex chars) that uniquely identifies one span within a trace.
span kindEnum (SERVER, CLIENT, PRODUCER, CONSUMER, INTERNAL) describing the span’s role in a distributed conversation; required for accurate service graphs.
traceparentW3C header carrying version, trace-id, parent span-id, and trace-flags: 00-<trace-id>-<span-id>-<flags>. Mandatory for W3C Trace Context.
tracestateOptional W3C header carrying an ordered list of vendor-specific key=value entries; up to ~32 entries, ~512 chars total.
trace-flags8-bit field in traceparent; bit 0 is the sampled/recorded flag (01 = sampled).
baggageKey–value data on the distributed context (W3C baggage header) propagated across services for cross-cutting request data like user.id, tenant.id, feature flags.
W3C Trace ContextThe W3C standard for HTTP trace propagation defining traceparent and tracestate; OpenTelemetry’s default propagator.
B3Legacy Zipkin propagation format using either multi-headers (X-B3-TraceId, X-B3-SpanId, X-B3-Sampled, X-B3-Flags) or a single b3: header.
uber-trace-idLegacy Jaeger propagation header: <trace>:<span>:<parent>:<flags>, where flags encodes sampled (bit 1) and debug (bit 2).
composite propagatorOTel construct that runs multiple propagators in sequence; first to extract wins, all enabled formats inject simultaneously — enables gradual format migration.
span attributesKey–value pairs attached to a single span; describe that operation; not propagated downstream.
span eventsTimestamped, named annotations within a span; the OTel-native way to record exceptions and intra-span moments.
span linksReferences from one span to other related SpanContext values; the right tool for fan-in patterns like batch processing.
JaegerCNCF open-source tracing backend with a trace explorer, dependency graph view, and Service Performance Monitoring (SPM) via OTel Collector spanmetrics.
TempoGrafana’s object-storage-backed tracing backend with a built-in metrics-generator producing per-service span_metrics and per-edge service_graphs Prometheus metrics.
RED methodRate, Errors, Duration — the standard SLI vocabulary for request-driven services, derivable from span metrics via PromQL.
metrics-generatorTempo component that reads spans and emits Prometheus metrics in real time; runs span_metrics and service_graphs processors.

Chapter 8: Metrics Pipeline: Bridging OpenTelemetry and Prometheus

OpenTelemetry (OTel) and Prometheus were designed by different communities to solve overlapping but distinct problems. Prometheus grew up around pull-based scraping of cumulative counters with a strict text exposition format. OpenTelemetry was conceived as a vendor-neutral push pipeline with a rich metric data model that includes deltas, exponential histograms, and full resource attribution. In production observability stacks, these two worlds must coexist — usually because teams already operate Prometheus dashboards and alerts, but want to instrument applications with OTel’s polyglot SDKs.

This chapter walks the metric all the way from the instrument call inside your application through the OpenTelemetry SDK, across the wire (OTLP or Prometheus exposition), and into a Prometheus-compatible backend. By the end, you will be able to choose between the three common bridging patterns, configure aggregation temporality correctly for each, and translate OTel attributes into Prometheus labels without quietly losing data.

Learning Objectives

By the end of this chapter, you will be able to:


8.1 OpenTelemetry Metrics Data Model

Before you can bridge OTel metrics into Prometheus, you have to understand what OTel actually produces. OTel’s metrics data model is richer than Prometheus’ exposition format, which is precisely why the bridge is nontrivial.

Figure 8.1: OpenTelemetry metrics data model from Meter to Exporter

flowchart TD
    Meter[Meter]
    Meter --> Sync[Synchronous Instruments]
    Meter --> Obs[Observable Instruments]
    Sync --> C[Counter]
    Sync --> UDC[UpDownCounter]
    Sync --> H[Histogram]
    Obs --> OC[ObservableCounter]
    Obs --> OUDC[ObservableUpDownCounter]
    Obs --> OG[ObservableGauge]
    C --> View[View<br/>rename, filter,<br/>change aggregation]
    UDC --> View
    H --> View
    OC --> View
    OUDC --> View
    OG --> View
    View --> Agg[Aggregation<br/>Sum / LastValue /<br/>Histogram / ExpHistogram]
    Agg --> DP[Data Point<br/>value + attributes +<br/>timestamp + temporality]
    DP --> Exp[Exporter<br/>OTLP / Prometheus / stdout]

Instruments: The Six Core Shapes

An instrument is the API surface your application code calls. OpenTelemetry defines six standard instrument types, organized along two axes: synchronous versus observable, and monotonic versus non-monotonic [Source: https://www.groundcover.com/opentelemetry/opentelemetry-metrics].

InstrumentSynchronous?Monotonic?Typical use case
CounterYesYes (only adds)Requests served, bytes sent, errors raised
UpDownCounterYesNo (can go down)In-flight requests, queue depth, open connections
HistogramYesn/a (distribution)Request duration, payload size
ObservableCounterNo (callback)YesCPU seconds, GC bytes — anything you read from an OS counter
ObservableUpDownCounterNo (callback)NoMemory in use, thread pool size
ObservableGaugeNo (callback)n/a (last value)Temperature, queue saturation, current load average

Synchronous instruments are recorded inline on the hot path: requestCounter.Add(ctx, 1, attrs). Observable instruments register a callback the SDK invokes during each collection cycle: useful when reading the underlying value is cheap and you do not want to pay for it on every request.

By analogy: synchronous instruments are like ringing a bell every time something happens, while observable instruments are like a thermostat that the SDK reads on a schedule. Both produce time series, but the cost model is very different.

Views: Customizing Aggregation at the SDK

A View is an SDK configuration mechanism that lets you intercept measurements from a specific instrument and change how they are aggregated, named, or attributed before export. Views are how you do things like:

Views are essential for the OTel-to-Prometheus bridge because they let you operate two pipelines from one instrument: an explicit-bucket histogram for the Prometheus scrape path, and an exponential histogram for an OTLP backend that supports them [Source: https://www.groundcover.com/opentelemetry/opentelemetry-metrics].

// Go SDK: configure a View to use an exponential histogram
sdkmetric.NewView(
    sdkmetric.Instrument{Name: "request_duration_seconds"},
    sdkmetric.Stream{
        Aggregation: sdkmetric.AggregationExponentialHistogram{
            MaxSize:  160,
            MaxScale: 10,
        },
    },
)

Exponential Histograms

OTel ExponentialHistogram is a compact, base-2 exponential bucket representation. Instead of fixing bucket boundaries up front, it uses a scale parameter s such that buckets approximate [2^(i/2^s), 2^((i+1)/2^s)). A higher scale gives more buckets per power of 2 — that is, more resolution.

Exponential histograms have three useful properties:

  1. Automatic dynamic range. If observed values span microseconds to minutes, you do not need to pre-pick boundaries. The aggregator chooses scale automatically.
  2. Bounded memory. When the bucket count would exceed MaxSize, the aggregator downscales — lowering s and merging neighboring buckets. Memory stays constant; precision degrades gracefully.
  3. Separate positive, negative, and zero buckets. Useful for instruments that can take negative values (rare, but well-defined).

Compare this to a classic explicit-bucket histogram, where if your latency suddenly grows tenfold past your last bucket boundary, everything piles into the +Inf overflow bucket and your quantiles become useless. Exponential histograms degrade much more gracefully.

Key Takeaway: OpenTelemetry’s metric model expresses the shape of a measurement (counter, gauge, histogram) and the cadence (synchronous vs. observable) separately from how it is aggregated for export. Views are the configuration knob that lets you serve multiple backends — including Prometheus — from a single instrumentation point.


8.2 Aggregation Temporality

Aggregation temporality is the single most common source of OTel-to-Prometheus bugs. It is the difference between a counter that goes up forever and one that resets every export — and Prometheus’ query language assumes the former.

What Temporality Means

Temporality describes what time window each exported data point represents [Source: https://www.groundcover.com/opentelemetry/opentelemetry-metrics]. There are two options:

Temporality applies to sums (counters) and histograms. Gauges and last-value metrics are instantaneous snapshots and do not use temporality in the same way — they are simply “the current value at observation time.”

Crucially, temporality is a property of the exported time series, not the instrument. The same Counter can be exported as cumulative to Prometheus and delta to an OTLP backend, possibly through the same Collector.

Cumulative Temporality

A cumulative Counter at time T reports the total events since the SDK started. A cumulative Histogram bucket at time T reports the total observations that fell into that bucket since the SDK started.

The backend computes deltas by subtracting successive samples:

delta = value(T2) - value(T1)

This is exactly how Prometheus’ rate() and increase() work. Pros and cons:

Delta Temporality

A delta Counter at time T reports events since the previous export. Each exported point is already a per-interval increment.

Side-by-Side Comparison

AspectCumulativeDelta
Value at time TTotal since startChange since last export
Lost exportBackend just computes a longer-window rateData for that interval is lost forever
Process restartBackend must detect resetEach export independent — no reset concept
Backend rate computationBackend differences samplesPer-interval value already given
Aligns with PrometheusNatural fitNot supported natively
Typical OTLP push guidanceSupported, less commonStrongly favored for many OTel backends

When Each Is the Right Choice

Use cumulative when:

Use delta when:

The Collector as Temporality Translator

In a real pipeline, you rarely want to pick one and force every backend to live with it. The dominant 2025 pattern is to let the OpenTelemetry Collector convert temporality per exporter:

  1. Application SDK exports OTLP with AggregationTemporality = DELTA for sums and histograms.
  2. Collector receives delta points.
  3. Collector forwards delta to an OTLP vendor backend that prefers deltas.
  4. Collector accumulates delta into cumulative for the prometheusremotewrite exporter or the /metrics endpoint Prometheus scrapes.

Figure 8.2: Collector as temporality translator — one input, two temporalities out

flowchart LR
    App[Application SDK]
    App -->|OTLP delta| Coll[OpenTelemetry Collector]
    Coll -->|cumulative| PromExp[prometheus exporter]
    Coll -->|delta| OTLPExp[otlp exporter]
    PromExp -->|scrape| Prom[(Prometheus)]
    OTLPExp -->|push| Vendor[(Vendor OTLP backend)]

Cumulative versus delta in the time domain — same underlying event stream, two export shapes:

Figure 8.3: Cumulative vs delta temporality over time

graph LR
    subgraph Cumulative
        C1["t1: 5"] --> C2["t2: 12"] --> C3["t3: 18"] --> C4["t4: 25"]
    end
    subgraph Delta
        D1["t1: +5"] --> D2["t2: +7"] --> D3["t3: +6"] --> D4["t4: +7"]
    end

The Prometheus exporters inside the Collector maintain internal state per series so they can sum deltas into a running total. This is what makes the “delta-from-apps, cumulative-to-Prometheus” pattern work end to end.

Common Temporality Bugs

If you misconfigure temporality, the symptoms in Prometheus are distinctive:

The fix is always the same: ensure the exporter feeding Prometheus is configured to produce cumulative, regardless of what temporality the SDK and Collector use internally.

Key Takeaway: Prometheus is built around cumulative monotonic counters; OTLP push pipelines often favor delta. Pick temporality per exporter — usually delta from the SDK, cumulative at the Prometheus boundary — and let the Collector translate between them.


8.3 Bridging to Prometheus

There are four practical ways to get OpenTelemetry metrics into a Prometheus-based observability stack. Each has tradeoffs around coupling, push-versus-pull semantics, and how many moving parts you operate.

Figure 8.4: Four bridge pipelines from OTel-instrumented apps to Prometheus-compatible storage

flowchart TD
    App1[App with OTel SDK<br/>+ Prometheus exporter]
    App2[App with OTel SDK]
    App3[App with OTel SDK]
    App4[App with OTel SDK]
    Coll2[OTel Collector<br/>prometheus exporter]
    Coll3[OTel Collector<br/>prometheusremotewrite]
    App1 -->|expose /metrics| P1Slash[/metrics endpoint/]
    P1Slash -->|scrape| Prom1[(Prometheus)]
    App2 -->|OTLP push| Coll2
    Coll2 -->|expose /metrics| P2Slash[/metrics endpoint/]
    P2Slash -->|scrape| Prom2[(Prometheus)]
    App3 -->|OTLP push| Coll3
    Coll3 -->|remote_write push| Remote[(Mimir / Cortex /<br/>Thanos / VictoriaMetrics)]
    App4 -->|OTLP push| Prom4[(Prometheus<br/>OTLP receiver)]

Option 1: Prometheus Exporter in the SDK

The simplest path is to attach a Prometheus exporter directly to the OTel SDK inside your application. The SDK accumulates measurements internally and exposes them on an HTTP /metrics endpoint in the Prometheus text format.

import (
    "net/http"
    "go.opentelemetry.io/otel"
    sdkmetric "go.opentelemetry.io/otel/sdk/metric"
    "go.opentelemetry.io/otel/exporters/prometheus"
)

func initMeter() {
    exporter, err := prometheus.New()
    if err != nil {
        panic(err)
    }
    provider := sdkmetric.NewMeterProvider(
        sdkmetric.WithReader(exporter),
    )
    otel.SetMeterProvider(provider)
    http.Handle("/metrics", exporter)
    go http.ListenAndServe(":9464", nil)
}

Prometheus scrapes the app directly:

scrape_configs:
  - job_name: 'myapp'
    static_configs:
      - targets: ['myapp:9464']

Pros: Native Prometheus experience, no Collector required, operationally familiar.

Cons: You are not really using OTLP — resource attributes get flattened into labels (or lost), and you couple every app to Prometheus’ wire format. Exponential histograms get down-converted to explicit-bucket form.

Best for: Small Prometheus-first shops; gradual migrations.

Option 2: Prometheus Receiver in the Collector

A subtle but useful inversion: the OpenTelemetry Collector can scrape existing Prometheus exporters (anything exposing /metrics) and ingest the result into OTLP pipelines. This is the prometheus receiver, not exporter. It uses an embedded Prometheus scrape engine [Source: https://www.groundcover.com/opentelemetry/opentelemetry-metrics].

This receiver is the bridge that lets you bring legacy Prometheus-instrumented services (node_exporter, kube-state-metrics, third-party apps) into a unified OTLP pipeline alongside OTel-native apps.

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'kube-state-metrics'
          static_configs:
            - targets: ['kube-state-metrics:8080']

The Collector then routes these scraped metrics to any exporter — OTLP, prometheusremotewrite, or back out via the prometheus exporter for re-scraping.

Option 3: Prometheus Remote Write Exporter

The Collector’s prometheusremotewrite exporter pushes metrics over the Prometheus remote-write protocol. This is the dominant pattern for shipping metrics into Prometheus-compatible long-term stores like Cortex, Mimir, Thanos Receive, and VictoriaMetrics [Source: https://www.groundcover.com/opentelemetry/opentelemetry-metrics].

Important: a vanilla Prometheus server is a remote-write client, not a receiver. You cannot push remote-write into stock Prometheus. The prometheusremotewrite exporter is for remote-write-capable backends.

receivers:
  otlp:
    protocols:
      http:
      grpc:
processors:
  batch:
    timeout: 5s
    send_batch_size: 10000
exporters:
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
    external_labels:
      cluster: prod-cluster
      source: otel
    send_metadata: true
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]

This path can preserve OTel exponential histograms by mapping them to Prometheus native histograms over the wire — the highest-fidelity OTel-to-Prometheus path available in 2025.

Option 4: OTLP-Native Ingestion into Prometheus

Modern Prometheus versions (2.47+, more complete in 3.x) include an OTLP receiver that accepts pushed OTLP metrics directly into the Prometheus TSDB:

global:
  scrape_interval: 15s
otlp:
  http:
    endpoint: 0.0.0.0:4318
  grpc:
    endpoint: 0.0.0.0:4317

Configure the OTel SDK to push to Prometheus’ OTLP endpoint:

export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_ENDPOINT=http://prometheus:4318
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

This collapses the pipeline to two components — but you lose Prometheus’ pull-based service-discovery model, and OTLP-specific semantics (resource attributes, exemplars, exponential histograms) are mapped to Prometheus equivalents with varying maturity.

Comparison of the Four Options

OptionDirectionExtra componentModelBest forMain limitation
SDK Prometheus exporterApp → PromNonePullSmall/medium Prometheus shopsCouples apps to Prometheus wire format
Collector prometheus exporterApp → Collector → PromCollectorPush-to-Collector, pull-from-CollectorProm shops adopting OTelExtra hop
Collector prometheusremotewriteApp → Collector → remote storeCollectorPush to backendLarge multi-cluster Cortex/MimirCannot target vanilla Prometheus
Prometheus OTLP receiverApp → Prom (OTLP)NonePush into PromOTel-first, fewer componentsNewer; push semantics; less mature mappings

The dominant recommendation for teams adopting OTel while keeping Prometheus is Option 2 with Collector in the middle: applications push OTLP to a Collector, the Collector hosts a prometheus exporter on a port, and Prometheus scrapes the Collector. You get OTLP from your apps, Prometheus’ familiar pull model at the storage boundary, and a central place to filter, rename, and rate-limit metrics.

# Collector
exporters:
  prometheus:
    endpoint: "0.0.0.0:9464"
    namespace: otel
    const_labels:
      source: otel-collector
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
# Prometheus
scrape_configs:
  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:9464']

Key Takeaway: For most teams in 2025, the recommended bridge is App (OTLP) → OpenTelemetry Collector → prometheus exporter → Prometheus scrape. It combines OTel-native push from applications with Prometheus-native pull at the storage layer, and gives you a Collector chokepoint to fix temporality, naming, and cardinality.


8.4 Naming and Label Mapping

Once the wire path is sorted out, the next failure mode is semantic: OTel and Prometheus have different naming conventions, different valid character sets, and different ways of expressing units. A faithful bridge has to translate without quietly dropping meaning.

OTel Metric Name → Prometheus Metric Name

OpenTelemetry metric names are dotted and case-sensitive: http.server.request.duration. Prometheus metric names traditionally allow only [a-zA-Z_:][a-zA-Z0-9_:]* — no dots. The standard mapping is:

  1. Replace . with _: http.server.request.durationhttp_server_request_duration.
  2. Replace other invalid characters with _: hyphens, slashes, etc.
  3. Append unit suffix based on the instrument’s unit (next subsection).

Worked example:

OTel metric nameUnitInstrumentPrometheus name
http.server.request.durationsHistogramhttp_server_request_duration_seconds
http.server.active_requests{request}UpDownCounterhttp_server_active_requests
http.client.request.body.sizeByHistogramhttp_client_request_body_size_bytes
process.cpu.timesObservableCounterprocess_cpu_time_seconds_total
system.memory.usageByObservableUpDownCountersystem_memory_usage_bytes

Note _total is appended to monotonic counters per Prometheus convention; the exporter does this automatically based on the instrument type.

Figure 8.5: OTel metric name to Prometheus name conversion pipeline

flowchart LR
    Start["OTel name:<br/>http.server.request.duration<br/>unit: ms<br/>type: Histogram"]
    Start --> S1[Step 1<br/>Replace dots<br/>with underscores]
    S1 --> N1["http_server_request_duration"]
    N1 --> S2[Step 2<br/>Sanitize other<br/>invalid characters]
    S2 --> N2["http_server_request_duration"]
    N2 --> S3[Step 3<br/>Convert unit to<br/>Prometheus base unit]
    S3 --> N3["values multiplied by 0.001<br/>suffix: _seconds"]
    N3 --> S4[Step 4<br/>Append _total<br/>if monotonic counter]
    S4 --> Final["http_server_request_duration_seconds"]

Unit Suffixes and Base Units

OpenTelemetry uses UCUM unit codes: s (seconds), ms (milliseconds), By (bytes), Ki (kibibytes), 1 (dimensionless ratio), {request} (annotation, no physical unit). Prometheus by convention uses base units in suffixes:

OTel unitPrometheus suffixConversion at exporter
s_secondsNone
ms_secondsMultiply by 0.001
us / μs_secondsMultiply by 0.000001
By_bytesNone
KiBy_bytesMultiply by 1024
1(none)None
{request}, {job}, …(none)Annotation only

The Prometheus exporter is responsible for converting values to base units and applying the suffix. If you instrument in milliseconds but Prometheus dashboards expect _seconds, the exporter does the math for you — provided you set the OTel unit correctly. Setting the wrong unit (or none) is a common silent bug that yields metrics that look right but are off by 1,000×.

Attributes to Labels

OTel attributes (key-value pairs on each measurement) and OTel resource attributes (key-value pairs on the entire SDK, like service.name and service.version) both become Prometheus labels — but with caveats.

Attribute keys go through the same dot-to-underscore conversion: http.response.status_code becomes the label http_response_status_code.

Resource attributes are typically attached as labels to every series the SDK emits. The Prometheus exporter often promotes service.name to a label named job and service.instance.id to instance, mirroring Prometheus’ service-discovery model. Other resource attributes (k8s.namespace.name, cloud.region, etc.) become labels too — which can explode cardinality if you are not careful.

OTel attributePrometheus label
service.namejob (and/or service_name)
service.instance.idinstance
http.response.status_codehttp_response_status_code
k8s.namespace.namek8s_namespace_name
net.peer.namenet_peer_name

UTF-8 Metric Names in Prometheus 3.x

A significant 2024–2025 development: Prometheus 3.x supports UTF-8 metric and label names when the OpenMetrics or native protocols are used. This means the OTel .-separated form can in principle be preserved end-to-end without the underscore mangling, by quoting the metric name:

{"http.server.request.duration", method="GET"} 1.23

In practice, dashboards, alert rules, and recording rules written against legacy underscored names mean most teams still use the classic conversion. Treat UTF-8 names as a forward-looking option: useful for greenfield deployments where every component (Prometheus, query layer, dashboards) supports them, but expect translation back to underscored names anywhere a tool was written before 2024.

Common Naming and Mapping Pitfalls

Key Takeaway: OTel-to-Prometheus naming is a deterministic transform: dots become underscores, units become base-unit suffixes, attributes become labels, and monotonic counters get _total. Get the unit right at the instrument level; everything downstream depends on it.


Chapter Summary

The metrics pipeline between OpenTelemetry and Prometheus has to reconcile two designs: OTel’s flexible push-oriented data model and Prometheus’ cumulative pull-oriented exposition format. The bridge has three moving parts.

The data model. OpenTelemetry exposes six instruments (Counter, UpDownCounter, Histogram, ObservableCounter, ObservableUpDownCounter, ObservableGauge) plus an aggregation layer customizable via Views. Exponential histograms give you compact, dynamic-range distributions that gracefully degrade memory through auto-downscaling. Prometheus’ wire format is less expressive, which is why bridging requires deliberate choices about which OTel features to preserve and which to flatten.

Aggregation temporality. Cumulative means “total since start”; delta means “change since last export.” Prometheus only natively supports cumulative — its rate() and increase() functions assume it. OTel push pipelines often favor delta because deltas aggregate cleanly across many producers. The dominant pattern is to use delta from apps to the Collector, and have the Collector accumulate to cumulative before exposing metrics for Prometheus scraping.

The wire bridge. Four options exist: SDK Prometheus exporter, Collector prometheus exporter, Collector prometheusremotewrite exporter, and Prometheus’ native OTLP receiver. For most Prometheus shops adopting OTel, the App → OTLP → Collector → prometheus exporter → Prometheus scrape pattern is the recommended path. It preserves OTel push from apps and Prometheus pull at the storage layer, and gives you a single chokepoint to handle temporality, naming, and cardinality.

Naming and labels. OTel metric names use dots and are paired with explicit units; Prometheus uses underscores, base units, and the conventional _total suffix on counters. Resource attributes become labels — be deliberate about which ones, or your Prometheus cardinality will explode. Prometheus 3.x’s UTF-8 name support offers a forward-looking simplification but is not yet universally compatible with existing tooling.

Get these four layers right and OTel and Prometheus coexist smoothly: applications speak OTel, dashboards keep working, and you preserve the option to add OTLP-native backends without re-instrumenting.


Key Terms

TermDefinition
Aggregation temporalityWhether each exported metric point represents a total since process start (cumulative) or a change since the last export (delta).
CumulativeTemporality in which each data point is the total since a fixed start time; the natural fit for Prometheus.
DeltaTemporality in which each data point is the change since the previous export; favored for many OTLP push pipelines.
Exponential histogramOTel histogram aggregation using base-2 exponential buckets controlled by a scale parameter, with automatic downscaling to bound memory.
ViewOTel SDK configuration that customizes how measurements from an instrument are aggregated, named, attributed, or filtered before export.
InstrumentAPI surface for recording measurements: Counter, UpDownCounter, Histogram, and their Observable variants.
Prometheus exporterAn OTel component (in the SDK or Collector) that exposes a /metrics endpoint in Prometheus text format for scraping.
Prometheus receiverA Collector component that scrapes existing Prometheus /metrics endpoints and ingests the result into the OTLP pipeline.
Remote writePrometheus’ push protocol for shipping samples to remote storage backends like Cortex, Mimir, Thanos, and VictoriaMetrics.
prometheusremotewrite exporterCollector exporter that pushes metrics over the Prometheus remote-write protocol; cannot target vanilla Prometheus.
OTLPOpenTelemetry Protocol; the push-based gRPC/HTTP wire format for traces, metrics, and logs.
OTLP receiver (Prometheus)A feature in recent Prometheus versions that accepts pushed OTLP metrics directly into the TSDB.
Resource attributesKey-value pairs that describe the entity producing telemetry (e.g., service.name); typically promoted to Prometheus labels.
Native histogram (Prometheus)Prometheus’ base-2 exponential histogram type, transported via remote-write and roughly analogous to OTel ExponentialHistogram.
ExemplarSampled raw measurement attached to a histogram bucket, optionally carrying trace/span context for telemetry correlation.

Chapter 9: Logs, Events, and Cross-Signal Correlation

Logs are the oldest and most stubborn signal in observability. Long before metrics dashboards and distributed traces, engineers grepped through /var/log to figure out what went wrong. That habit has not gone away — it has merely accumulated layers. Today a typical production stack carries application logs in JSON, container stdout collected by Fluent Bit, kernel messages in journald, audit events going to a SIEM, and increasingly OpenTelemetry log records flowing as OTLP. The interesting question is no longer “how do I store logs?” but “how do I make logs participate in observability alongside traces and metrics?”

This chapter walks through what OpenTelemetry says a log is, how to get logs into a collector pipeline whether your application is brand-new or twenty years old, and — most importantly — how to make a trace_id in a log line behave like a hyperlink to the actual trace.

Learning Objectives

By the end of this chapter you will be able to:

9.1 OpenTelemetry Logs Data Model

If traces describe what one operation did and metrics describe what is true in aggregate, logs describe what happened, at this exact moment, in detail. OpenTelemetry’s logs signal formalizes that intuition into a portable data structure so logs can be shipped, transformed, and queried the same way regardless of language or backend [Source: https://opentelemetry.io/docs/specs/otel/logs/].

9.1.1 The LogRecord Structure

A LogRecord is the atomic unit of the OpenTelemetry logs signal. It lives inside a hierarchy familiar from traces and metrics: a Resource (the entity that produced the data — a service, host, or pod) contains one or more InstrumentationScopes (the library or module that emitted the log), which in turn contain LogRecords [Source: https://opentelemetry.io/docs/specs/otel/logs/data-model/].

Each LogRecord carries the following core fields:

FieldPurposeExample
timestampWhen the event actually occurred (nanoseconds since epoch)1717603200123456789
observed_timestampWhen the collector/agent first saw the event1717603200223456789
severity_numberNumeric severity (1-24, normalized across systems)17 (ERROR)
severity_textOriginal textual severity"ERROR"
bodyThe main payload, string or structured"Payment failed" or {"message": "...", "reason": "card_declined"}
attributesKey-value dimensions for filtering and grouping{"http.route": "/pay", "user.id": "12345"}
trace_id32-hex-character correlation ID matching the trace"8f1b5fe2d5de4a51b8884f8f4cdde3f5"
span_id16-hex-character correlation ID matching the span"d2a41c3ff7a1b0ce"
trace_flagsSampling/flag bits inherited from trace context01
resourceService-level attributes (inherited){"service.name": "payments-api"}

The separation between timestamp and observed_timestamp is subtle but useful: if a log line sits in a buffer for thirty seconds before reaching the Collector, both moments are preserved. This makes it possible to diagnose lag in the log pipeline itself.

Figure 9.1: LogRecord hierarchy — Resource, InstrumentationScope, and LogRecord fields

graph TD
    R["Resource<br/>service.name=payments-api<br/>deployment.environment=prod<br/>k8s.pod.name=payments-7c584fd87f-jc6xg"]
    R --> S1["InstrumentationScope<br/>com.example.payments"]
    R --> S2["InstrumentationScope<br/>runtime"]
    S1 --> L1["LogRecord<br/>severity=ERROR<br/>body=Payment charge failed<br/>trace_id=8f1b...e3f5<br/>span_id=d2a4...b0ce"]
    S1 --> L2["LogRecord<br/>severity=INFO<br/>body=Charge created<br/>attributes.amount_cents=4200"]
    S2 --> L3["LogRecord<br/>severity=WARN<br/>body=GC pause 250ms"]
    S2 --> L4["LogRecord<br/>severity=INFO<br/>body=Heap resized"]

9.1.2 Severity, Body, and Attributes

OpenTelemetry maps the chaotic world of log levels onto a single severity scale of 1-24, divided into ranges: TRACE (1-4), DEBUG (5-8), INFO (9-12), WARN (13-16), ERROR (17-20), FATAL (21-24). The mapping lets you ask “show me everything WARN or above” across services that use Python’s logging, Java’s Logback, Node’s pino, and .NET’s ILogger — even though each of those frameworks invents its own level names.

The body field deserves attention. It can be a plain string (for legacy apps), but it can also hold a structured object. The recommendation in 2025 is:

Think of body as the headline and attributes as the metadata you’d want to slice by. A useful analogy: if logs were emails, body is the subject line and attributes are the headers (From, To, Date) that make the email searchable.

Here is a complete LogRecord serialized as JSON, ready for OTLP:

{
  "timestamp": "2025-03-10T10:15:30.123Z",
  "observed_timestamp": "2025-03-10T10:15:30.156Z",
  "severity_number": 17,
  "severity_text": "ERROR",
  "body": {
    "message": "Payment charge failed",
    "reason": "card_declined"
  },
  "attributes": {
    "http.method": "POST",
    "http.route": "/pay",
    "http.status_code": 402,
    "user.id": "12345",
    "payment.amount_cents": 4200,
    "exception.type": "StripeCardException"
  },
  "trace_id": "8f1b5fe2d5de4a51b8884f8f4cdde3f5",
  "span_id": "d2a41c3ff7a1b0ce",
  "trace_flags": "01",
  "resource": {
    "service.name": "payments-api",
    "service.namespace": "checkout",
    "deployment.environment": "prod",
    "k8s.namespace.name": "payments",
    "k8s.pod.name": "payments-7c584fd87f-jc6xg"
  }
}

That record carries enough context to answer three different questions: what happened (body), who did it (resource + user attribute), and which trace does it belong to (trace_id/span_id).

9.1.3 Maturity: Logs Compared to Traces and Metrics

The honest 2025 picture: the logs data model is stable, OTLP logs are stable, and the Collector’s log pipeline is mature. What still varies is native SDK support per language [Source: https://opentelemetry.io/docs/specs/otel/logs/].

LanguageLogs API/SDK statusProduction posture
JavaStable Logs API/SDK; Logback & Log4j2 appenders; agent integrationProduction-ready
.NETStable via ILogger provider; first-class OTLP exporterProduction-ready
PythonSDK exists; some surface area still volatileProduction-usable with caution
GoSDK in experimental form; many teams still inject IDs manuallyEarly-adopter
Node.js/JSNo unified mature logs SDK; use existing logger + manual injectionHybrid approach
C++/RustPartial/experimental; varies by libraryEvaluate per project

The practical implication is that for Java and .NET services you can confidently say “logs go through OpenTelemetry.” For Python, you can do it but should isolate the OTel logging setup behind your own thin abstraction so SDK churn does not propagate. For Go and Node.js, the realistic posture is: keep using your favorite logger (zap, zerolog, pino, winston) and ensure it includes trace_id/span_id fields — the OTel Collector will accept the resulting JSON lines just as happily.

A log bridge is a small adapter that listens to log events from an existing framework and converts them into OpenTelemetry LogRecords. The framework keeps its familiar API (developers still write logger.error("...")); the bridge handles the translation, enriches the record with resource attributes and current trace context, and ships it over OTLP.

The pattern, in three steps regardless of language:

  1. Application code logs as it always has.
  2. A bridge — an appender, handler, or logger provider — translates each event into an OTel LogRecord.
  3. The OTel SDK exports LogRecords to the Collector via OTLP.

Java with Logback — Add the opentelemetry-logback-appender dependency, declare an OTel appender in logback.xml, and attach it to the root logger. The appender copies MDC entries to attributes and grabs trace_id/span_id from the current span automatically [Source: https://opentelemetry.io/docs/languages/java/instrumentation/].

<appender name="OTEL" class="io.opentelemetry.instrumentation.logback.appender.v1_0.OpenTelemetryAppender">
  <captureMdcAttributes>*</captureMdcAttributes>
  <captureCodeAttributes>true</captureCodeAttributes>
</appender>
<root level="INFO">
  <appender-ref ref="OTEL"/>
  <appender-ref ref="CONSOLE"/>
</root>

Python with logging — Install opentelemetry-sdk and wire up a LoggerProvider, a BatchLogRecordProcessor, and the OTLP exporter:

import logging
from opentelemetry._logs import set_logger_provider
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter

provider = LoggerProvider()
set_logger_provider(provider)
provider.add_log_record_processor(
    BatchLogRecordProcessor(OTLPLogExporter(endpoint="http://otel-collector:4317", insecure=True))
)

handler = LoggingHandler(level=logging.INFO, logger_provider=provider)
logging.getLogger().addHandler(handler)
logging.getLogger().setLevel(logging.INFO)

Wrap that setup in your own observability/logging.py module so the inevitable SDK version bump only changes one file.

.NET with ILogger — Configuration is one fluent call:

builder.Logging.AddOpenTelemetry(options =>
{
    options.IncludeFormattedMessage = true;
    options.IncludeScopes = true;
    options.SetResourceBuilder(ResourceBuilder.CreateDefault().AddService("payments-api"));
    options.AddOtlpExporter(o => o.Endpoint = new Uri("http://otel-collector:4317"));
});

After that, any _logger.LogInformation("Order {OrderId} placed", id) produces a LogRecord with OrderId as a structured attribute and the current trace_id/span_id already attached.

Node.js with winston / pino — Because the JS logs SDK is still maturing, the common pattern is to keep winston or pino, configure JSON output, and add trace_id/span_id from the active context:

const { trace, context } = require('@opentelemetry/api');
const pino = require('pino');

const logger = pino({
  formatters: {
    log(obj) {
      const span = trace.getSpan(context.active());
      if (span) {
        const ctx = span.spanContext();
        obj.trace_id = ctx.traceId;
        obj.span_id = ctx.spanId;
      }
      return obj;
    }
  }
});

The resulting JSON lines flow to stdout, are picked up by the Collector’s filelog receiver, and join the OTLP pipeline.

Key Takeaway: OpenTelemetry’s LogRecord gives you a single, stable schema (timestamp, severity, body, attributes, trace_id/span_id, resource) that any logging framework can be bridged into — keeping developer ergonomics in each language while normalizing the wire format downstream.

9.2 Collecting Logs at the Edge

The data model is portable; the act of capturing logs is not. Some logs come from applications that speak OTLP. Some come from third-party software writing to files. Some come from systemd via journald. Some are already being collected by a Fluent Bit DaemonSet your platform team installed five years ago. The OpenTelemetry Collector is designed to absorb all of these without forcing a single pattern.

9.2.1 The filelog Receiver

The filelog receiver is the Collector’s answer to legacy log files. It tails files on disk, optionally parses each line (regex, JSON, CSV), and emits OTel LogRecords downstream [Source: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver].

receivers:
  filelog:
    include:
      - /var/log/pods/*/*/*.log
    include_file_path: true
    start_at: beginning
    operators:
      - type: json_parser
        parse_from: body
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
        severity:
          parse_from: attributes.level

The operators list is a small log-transformation pipeline that runs per receiver. The json_parser operator parses each line as JSON and promotes fields to LogRecord attributes. The timestamp and severity sub-blocks tell the operator which attributes to use to populate the canonical fields. Other useful operators include regex_parser for unstructured logs, multiline for stack traces that span multiple lines, and recombine for entries split by a logging framework.

9.2.2 The journald Receiver

For host-level events on systemd-based Linux machines — sshd logins, OOM kills, cron runs — the journald receiver reads directly from the systemd journal binary format, preserving all structured fields:

receivers:
  journald:
    directory: /var/log/journal
    units:
      - sshd
      - cron
    priority: info

This is invaluable for SRE teams who want a single observability backend for both application telemetry and host events.

9.2.3 Migrating from Fluent Bit, Fluentd, or Vector

Most clusters running for more than a year already have a log shipper deployed. The pragmatic migration path is not rip-and-replace. It is layered coexistence.

Figure 9.2: Two-track log pipeline during Fluent Bit to OTel Collector migration

flowchart LR
    subgraph Legacy["Legacy track (Phase 1 - keep running)"]
        A1["Legacy services<br/>stdout/files"] --> FB["Fluent Bit<br/>DaemonSet"]
        FB --> Splunk["Splunk / ELK"]
    end
    subgraph New["OTel track (Phase 2-3 - growing)"]
        A2["New services<br/>OTLP logs"] --> OC["OTel Collector<br/>DaemonSet<br/>filelog + OTLP receiver"]
        A3["JSON file logs"] --> OC
        OC --> Loki["Loki"]
        OC --> Tempo["Tempo / OTLP backend"]
    end
    A1 -. "service adopts OTel SDK<br/>or JSON logs" .-> A2
    FB -. "optional: Fluent Forward<br/>during transition" .-> OC

Concretely:

  1. Phase 1 — Leave Fluent Bit in place. Deploy the OTel Collector alongside it. Have new services emit OTLP logs directly to the Collector. Existing services keep flowing through Fluent Bit.
  2. Phase 2 — Standardize log format. Whether your shipper is Fluent Bit, Fluentd, or Vector, configure it to emit JSON with trace_id, span_id, and OTel-style resource attributes. This makes the eventual switchover lossless.
  3. Phase 3 — Either reconfigure Fluent Bit to forward to the OTel Collector via OTLP (Fluent Bit 2.0+ supports this), or replace it with the Collector’s filelog receiver. The destination backend (Loki, ELK, Splunk) can stay the same.

The OTel Collector also speaks the Fluent Forward protocol natively as a receiver, so you can point an existing Fluent Bit at the Collector during the transition without changing the agent’s output plugin.

9.2.4 Kubernetes Log Collection Patterns

In Kubernetes, container stdout/stderr is written to /var/log/pods/<namespace>_<pod>_<uid>/<container>/0.log on each node. There are three established patterns for collecting these:

PatternMechanismProsCons
DaemonSet with filelogOne Collector per node tailing /var/log/podsNo app changes; works for any languageRequires hostPath mount; parsing burden in Collector
Sidecar OTel SDKEach pod sends OTLP directlyStructured at source; trace correlation automaticPod resource overhead; harder for 3rd-party images
Stdout + Kubernetes API enrichmentDaemonSet tails stdout, calls k8s API for pod metadataRich Kubernetes attributes (labels, owners)Extra API load; permissions complexity

The most common production layout is a DaemonSet + filelog + the k8sattributes processor, which enriches every LogRecord with pod, namespace, deployment, and node attributes pulled from the Kubernetes API:

processors:
  k8sattributes:
    auth_type: serviceAccount
    extract:
      metadata:
        - k8s.pod.name
        - k8s.namespace.name
        - k8s.deployment.name
        - k8s.node.name
      labels:
        - tag_name: app
          key: app.kubernetes.io/name

That processor alone is often the difference between “logs I can search” and “logs I can join with metrics and traces.”

Key Takeaway: The Collector’s filelog, journald, and fluentforward receivers, plus the k8sattributes processor, let you ingest logs from new OTLP-native services, legacy file-based apps, host daemons, and existing Fluent Bit deployments — all into one normalized OTel log pipeline, without a big-bang migration.

9.3 Cross-Signal Correlation

Structured logs and OTLP transport are means to an end. The end is correlation — the ability to be looking at a trace in Tempo, click a button, and land in the precise log lines that the failing span produced; or to find an ERROR log in Loki, click the trace_id, and land in the corresponding distributed trace.

9.3.1 Stamping trace_id and span_id on Log Records

There are two ways to get trace IDs onto a log record [Source: https://grafana.com/docs/grafana/latest/datasources/loki/configure-loki-data-source/#derived-fields]:

  1. Automatic, via SDK or bridge. The OTel logging API reads the active span from context (the same context used by tracing instrumentation) and copies its trace_id and span_id onto the LogRecord. This is the path Logback, Log4j2, ILogger, and the Python OTel handler all take.
  2. Manual, via logger enrichment. In languages where the logs SDK is immature, you fetch the active span yourself and inject the IDs as structured fields on every log. Patterns: a pino formatter (Node.js), a zap option (Go), a Serilog enricher (.NET classic), a logrus hook.

Either way the result must be the same: every log line carries the exact 32-hex-character trace_id and 16-hex-character span_id that the tracer would have sent to the trace backend. A mismatch — extra dashes, wrong case, truncation — silently breaks correlation.

A common 2025 pitfall is mismatched casing: some loggers serialize trace IDs as uppercase, but Tempo expects lowercase. The Collector’s transform processor can normalize this:

processors:
  transform/normalize-trace-ids:
    log_statements:
      - context: log
        statements:
          - set(attributes["trace_id"], ConvertCase(attributes["trace_id"], "lower"))
          - set(attributes["span_id"], ConvertCase(attributes["span_id"], "lower"))

9.3.2 Unified Resource Attributes

Trace-log linking via trace_id is half the story. The other half is resource attribute consistency. Both your traces and your logs should carry the same service.name, service.namespace, deployment.environment, and k8s.namespace.name. Without that, Grafana cannot construct a sensible “Show logs” query when you click on a span.

In an OTel SDK setup, resource attributes are configured once and shared by all three signals:

# Environment / config
OTEL_RESOURCE_ATTRIBUTES=service.name=payments-api,service.namespace=checkout,deployment.environment=prod

That single environment variable populates the Resource section of every trace, metric, and LogRecord emitted by the process. The Collector can further upsert cluster-level attributes that the application cannot know about:

processors:
  resource/cluster:
    attributes:
      - key: k8s.cluster.name
        value: prod-us-east-1
        action: upsert
      - key: cloud.region
        value: us-east-1
        action: upsert

9.3.3 The Loki + Tempo + Grafana Pivot

Grafana Loki and Tempo are the open-source duo that pioneered the trace-to-logs UX. The high-level flow:

Figure 9.3: Trace-to-logs and logs-to-trace pivot in Grafana

sequenceDiagram
    participant App as Application
    participant OC as OTel Collector
    participant Loki as Loki (Logs)
    participant Tempo as Tempo (Traces)
    participant G as Grafana
    participant U as User
    App->>OC: OTLP logs + traces (shared trace_id)
    OC->>Loki: Logs pipeline
    OC->>Tempo: Traces pipeline
    Note over U,G: Trace -> Logs pivot
    U->>G: Open trace in Tempo
    G->>Tempo: Fetch spans
    Tempo-->>G: Spans with trace_id
    U->>G: Click "Show logs"
    G->>Loki: LogQL {service="payments-api"} <br/>| json | trace_id="8f1b..."
    Loki-->>G: Matching log lines
    G-->>U: Render trace + logs timeline
    Note over U,G: Logs -> Trace pivot
    U->>G: Click trace_id in log line
    G->>Tempo: Lookup trace by trace_id
    Tempo-->>G: Full trace
    G-->>U: Render trace view

The Collector configuration:

exporters:
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    default_labels_enabled:
      exporter: false
      job: false
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource/cluster]
      exporters: [otlp/tempo]
    logs:
      receivers: [otlp, filelog]
      processors: [k8sattributes, resource/cluster, transform/normalize-trace-ids, batch]
      exporters: [loki]

Then on the Grafana side, the Loki data source defines a derived field that converts the trace_id attribute of any log line into a clickable link to the Tempo data source:

The Tempo data source gets the reverse mapping under Trace to logs: pick Loki as the logs source, list the tags whose values should match (service.name, k8s.namespace.name, deployment.environment), and define the label mapping (service.name → service).

A critical performance note: never index trace_id as a Loki label. Each unique trace becomes a label value, exploding cardinality and devastating Loki’s index. Keep trace_id as a structured field within the log body, and rely on derived fields plus | json filtering at query time [Source: https://grafana.com/docs/tempo/latest/configuration/grafana-agent/]. Loki labels should be bounded-cardinality attributes only: service, environment, namespace, pod.

The resulting Grafana query, generated automatically when a user clicks “Show logs” on a span:

{service="payments-api", env="prod"} | json | trace_id = "8f1b5fe2d5de4a51b8884f8f4cdde3f5"

Key Takeaway: Cross-signal correlation requires three aligned pieces: every log carries the same trace_id/span_id as the trace, both signals share service.name-style resource attributes, and Grafana is configured with Loki derived fields and Tempo trace-to-logs — without all three, the UX silently degrades to “open two tabs and grep.”

9.4 Events and Span Events

The final corner of the logs story is the most confusing for newcomers: OpenTelemetry has two things both called “events.” A LogRecord in the logs signal is one. A span event in the traces signal is the other. They look superficially similar — a timestamp, a name or message, attributes — but they live in different pipelines and obey different rules.

9.4.1 Span Events vs. LogRecords

A span event is a timestamped annotation attached inside a span. It has no independent identity; it is shipped as part of its parent span over the trace pipeline, inheriting that span’s trace_id and span_id automatically. Span events have a name (such as "exception" or "retry") and attributes, but no severity level [Source: https://opentelemetry.io/docs/specs/otel/logs/data-model/].

A LogRecord is a first-class log entry. It has its own severity, its own body, and may exist without any span context at all (startup messages, cron job output, background workers). When it does have trace context, that context is carried as explicit trace_id/span_id fields.

Here is a comparison table that captures the distinctions most likely to trip people up:

AspectLogRecordSpan Event
SignalLogsTraces
Independent identityYes — exists on its ownNo — lives inside a span
SeverityYes (severity_number, severity_text)No
BodyYes (text or structured)No (just name + attributes)
Trace correlationOptional, via explicit trace_id/span_idAutomatic — inherits from parent span
PipelineLogs pipeline (OTLP logs, Loki, ELK, Splunk)Trace pipeline (OTLP traces, Tempo, Jaeger)
Affected by trace samplingNo (separate sampling)Yes — dropped if span is dropped
Volume profileHigh; designed for log-scale backendsLow; embedded in spans
RetentionTypically days to monthsTypically hours to days
Best for”What is the app doing over time?""What happened inside this operation?”

The sampling row is the most operationally important. If your tracing pipeline samples at 1%, 99% of span events disappear. If you need to keep an event around for postmortems no matter what, it must be a LogRecord.

Figure 9.4: Span events vs. LogRecords — different pipelines, joined by trace_id

flowchart LR
    subgraph Traces["Trace pipeline (sampled)"]
        SP["Span: GET /orders/{id}<br/>trace_id=8f1b...e3f5<br/>span_id=d2a4...b0ce"]
        SP -.- E1["Span event: exception<br/>exception.type=NPE"]
        SP -.- E2["Span event: cache.miss<br/>cache.key=user:42"]
        SP --> Tempo["Tempo"]
    end
    subgraph Logs["Logs pipeline (unsampled)"]
        L1["LogRecord<br/>severity=ERROR<br/>body=Order lookup failed<br/>trace_id=8f1b...e3f5"]
        L2["LogRecord<br/>severity=INFO<br/>body=checkout.completed"]
        L1 --> Loki["Loki"]
        L2 --> Loki
    end
    L1 -. "shared trace_id<br/>(cross-signal join)" .-> SP

Figure 9.5: Decision tree — LogRecord vs. span event

flowchart TD
    Start["New piece of information<br/>to capture"] --> Q1{"Must survive even if<br/>trace is sampled out?"}
    Q1 -->|Yes| LR["Emit as LogRecord"]
    Q1 -->|No| Q2{"Describes a moment<br/>inside one span's operation?"}
    Q2 -->|Yes| Q3{"Searched in logs backend?<br/>(audit, business, security)"}
    Q2 -->|No| LR
    Q3 -->|Yes| Both["Emit BOTH:<br/>LogRecord + span event"]
    Q3 -->|No| SE["Add as span event<br/>(exception / retry / state)"]
    LR --> Loki2["Logs pipeline -> Loki"]
    SE --> Tempo2["Trace pipeline -> Tempo"]
    Both --> Loki2
    Both --> Tempo2

9.4.2 Span Events: Annotating Operations

Span events shine when you want to capture moments within an operation without inflating your tracing schema with extra spans. Typical uses:

A Java example, adding a retry event:

Span span = Span.current();
span.addEvent("retry", Attributes.builder()
    .put("retry.count", attempt)
    .put("retry.delay_ms", backoffMs)
    .put("retry.reason", "timeout")
    .build());

9.4.3 Domain Events: Product Analytics in the Logs Pipeline

A third use of the term “event” comes from product analytics: a “user.signed_up” or “checkout.completed” record meant for product dashboards. In OpenTelemetry, these are best modeled as LogRecords with a specific naming convention (event.name attribute, often event.domain to namespace them):

{
  "timestamp": "2025-03-10T10:15:30.123Z",
  "severity_number": 9,
  "severity_text": "INFO",
  "body": "Checkout completed",
  "attributes": {
    "event.name": "checkout.completed",
    "event.domain": "commerce",
    "order.id": "ord_8f3a",
    "order.total_cents": 4200,
    "user.id": "12345",
    "service.name": "checkout-api"
  },
  "trace_id": "8f1b5fe2d5de4a51b8884f8f4cdde3f5",
  "span_id": "d2a41c3ff7a1b0ce"
}

The advantage: domain events ride the logs pipeline (long retention, durable, unsampled) while still carrying trace context for forensic analysis when something goes wrong.

9.4.4 Decision Heuristic: Which Signal?

For a given piece of information, ask:

QuestionLean toward
”Will I search this in the logs backend?”LogRecord
”Does it describe something inside a span’s operation?”Span event
”Do I need this even if the trace is sampled out?”LogRecord
”Is it a retry, exception, or state transition in a specific operation?”Span event
”Is it a high-volume cross-cutting signal (audit, security, business)?”LogRecord
”Is it a rare, notable moment only meaningful within a single trace?”Span event

For the most important incidents — a 500 from a payment service — emit both: a LogRecord captures the cross-service debugging detail and survives sampling; a span event on the relevant span keeps the in-trace view rich for engineers diving into Tempo. The two are correlated automatically by shared trace_id/span_id.

Key Takeaway: Use span events for in-operation milestones (exceptions, retries, state changes) that benefit from automatic trace correlation, and use LogRecords for cross-cutting, high-volume, or sampling-resistant signals — and for the most important events, emit both, knowing they will join up automatically via shared trace context.

Chapter Summary

Logs in 2025 are no longer a separate world from traces and metrics — they are the third stable signal in OpenTelemetry, with a portable LogRecord schema and a mature Collector pipeline. The headline shifts from this chapter are:

  1. The LogRecord schema (timestamp, severity, body, attributes, trace_id/span_id, resource) is stable. Native SDK support is strong in Java and .NET, usable in Python, and still experimental in Go and Node.js — for the latter, inject trace_id/span_id manually into your existing logger.
  2. Log bridges keep developer ergonomics intact. Code keeps using Logback, logging, ILogger, or winston; the bridge converts events to OTel LogRecords with resource and trace context attached.
  3. Edge collection does not require a rip-and-replace. The Collector’s filelog, journald, and fluentforward receivers, plus the k8sattributes processor, let new OTLP-native services coexist with Fluent Bit/Fluentd/Vector for as long as needed.
  4. Cross-signal correlation demands three aligned things: identical trace_id/span_id on every log line, shared resource attributes across signals, and Grafana data sources configured with Loki derived fields and Tempo “Show logs.”
  5. Cardinality discipline: keep trace_id as a log field, never a Loki label. Labels are for bounded-cardinality dimensions (service, environment, namespace).
  6. Span events vs LogRecords: span events are in-trace annotations that disappear under sampling; LogRecords are durable, queryable, severity-bearing entries. Use both for important incidents.

The recurring theme is that observability is not three siloed pipelines but one cross-referenced data graph, and the trace_id/span_id pair is the glue that holds it together.

Key Terms

TermDefinition
LogRecordThe atomic unit of the OpenTelemetry logs signal, with fields for timestamp, severity, body, attributes, trace_id/span_id, and resource context
Log bridgeAn adapter (appender, handler, provider) that converts a logging framework’s events into OpenTelemetry LogRecords and exports them via OTLP
Filelog receiverAn OpenTelemetry Collector receiver that tails log files on disk, parses each line, and emits LogRecords — essential for legacy or third-party applications
Span eventA timestamped annotation inside a span (no independent identity, no severity), used to capture milestones such as retries or exceptions within an operation
Trace IDA 128-bit identifier (32 hex characters) that ties all signals related to a single distributed operation together
Span IDA 64-bit identifier (16 hex characters) for a single span within a trace, used along with trace_id to correlate logs to specific operations
LokiGrafana’s log aggregation system, designed around indexed labels (bounded cardinality) plus full-text and JSON field search at query time
Structured loggingThe practice of emitting logs as machine-parseable key-value records (typically JSON) rather than free-form text, enabling reliable filtering and correlation
Resource attributesService-level metadata (service.name, deployment.environment, k8s.namespace.name) shared across traces, metrics, and logs to make cross-signal queries possible
Derived fieldA Grafana Loki data source feature that uses a regex to extract a value (like trace_id) from log lines and turn it into a clickable link to another data source
Trace-to-logsA Grafana Tempo data source feature that builds a Loki query from a span’s resource attributes and trace_id, allowing one-click pivot from trace to logs

Chapter 10: The OpenTelemetry Collector in Depth

The OpenTelemetry Collector is the centerpiece of any non-trivial observability deployment. It is a vendor-neutral, pluggable data plane that receives telemetry from applications and infrastructure, transforms it on the fly, and exports it to one or more backends — Prometheus, Loki, Tempo, Jaeger, SaaS observability vendors, or all of the above simultaneously. Think of the Collector as the USB-C hub of observability: a single device that lets dozens of input cables (receivers) flow through programmable adapters (processors) into dozens of output ports (exporters), without forcing applications or backends to know about one another.

This chapter takes a deep look at how the Collector is composed, how to shape data with processors like transform and tail_sampling, which receivers and exporters you will reach for most often, and how to operate it reliably under load.

Learning Objectives

By the end of this chapter you will be able to:


10.1 Pipeline Architecture

The Collector is not a single black box; it is a configurable pipeline engine built from four kinds of components plus optional extensions. Every piece of telemetry flowing through the Collector takes the same conceptual journey: it enters through a receiver, passes through a chain of processors, and leaves through one or more exporters. Pipelines are declared per signal type (traces, metrics, logs), and the service block is what actually wires the components together into runnable pipelines.

Figure 10.0: Canonical Collector pipeline anatomy

flowchart LR
    R1[OTLP receiver] --> P1[memory_limiter]
    R2[Prometheus receiver] --> P1
    P1 --> P2[k8sattributes / resource]
    P2 --> P3[filter / tail_sampling]
    P3 --> P4[transform OTTL]
    P4 --> P5[batch]
    P5 --> E1[OTLP exporter]
    P5 --> E2[prometheusremotewrite]
    P5 --> E3[debug]

Figure 10.1: Multi-signal pipeline topology

graph TD
    subgraph Extensions
        HC[health_check]
        PP[pprof]
        ZP[zpages]
    end

    subgraph "traces pipeline"
        TR[otlp receiver] --> TML[memory_limiter]
        TML --> TK8S[k8sattributes]
        TK8S --> TB[batch]
        TB --> TE[otlp/tempo exporter]
    end

    subgraph "metrics pipeline"
        MR[otlp receiver] --> MML[memory_limiter]
        MML --> MK8S[k8sattributes]
        MK8S --> MB[batch]
        MB --> ME[prometheusremotewrite exporter]
    end

    subgraph "logs pipeline"
        LR[otlp receiver] --> LML[memory_limiter]
        LML --> LK8S[k8sattributes]
        LK8S --> LB[batch]
        LB --> LE[loki exporter]
    end

10.1.1 The four component types (plus extensions)

ComponentRoleExamples
ReceiverAccepts data in (push) or pulls data from a sourceotlp, prometheus, hostmetrics, filelog, kafka
ProcessorMutates, filters, batches, samples, or enriches data in flightmemory_limiter, batch, transform, tail_sampling
ExporterSends data to one or more backendsotlp, prometheusremotewrite, loki, debug
ConnectorJoins two pipelines: acts as an exporter on one side and a receiver on the otherspanmetrics, routing, forward
ExtensionNon-pipeline capabilities (health, profiling, debugging)health_check, pprof, zpages, file_storage

Connectors are a relatively recent addition and are the cleanest way to derive one signal from another — for example, generating RED metrics (Rate, Errors, Duration) from spans by piping traces into a spanmetrics connector and out as metrics on a separate pipeline.

10.1.2 Pipelines per signal type

Each pipeline is strictly typed: a traces pipeline can only contain receivers, processors, and exporters that understand spans. The same is true for metrics and logs. You can declare multiple pipelines of the same type (traces/internal, traces/external) to apply different processing to different streams.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 800
    spike_limit_mib: 200
  batch:
    timeout: 5s
    send_batch_size: 512

exporters:
  otlp/tempo:
    endpoint: tempo-distributor.observability:4317
    tls: { insecure: true }
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
  loki:
    endpoint: http://loki-gateway/loki/api/v1/push

extensions:
  health_check:
  pprof:
  zpages:

service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

The service.pipelines section is the contract that turns isolated component definitions into a running data plane. A component declared in receivers, processors, or exporters but not referenced in service.pipelines is silently ignored.

10.1.3 Processor order matters

Processors run in the order listed in the pipeline. This is one of the most consequential, and most commonly overlooked, properties of Collector configuration. As a rule of thumb [Source: https://kodekloud.com/blog/kubernetes-best-practices-2025/]:

  1. memory_limiter always first — so back-pressure kicks in before later, more expensive processors waste CPU.
  2. Enrichment processors next (e.g., k8sattributes, resource) — so downstream filters can see the full context.
  3. Filter / sampling next — drop unwanted data before transforms touch it.
  4. transform / scrubbing — reshape what’s left.
  5. batch last — coalesce into large outbound batches just before the exporter.

Key Takeaway: A Collector pipeline is a typed chain of receivers, processors, and exporters wired together in service.pipelines. Component order, not just component choice, is what defines behavior.


10.2 Key Processors

Processors are where most of the Collector’s intelligence lives. Two of them — batch and memory_limiter — are effectively mandatory in production; the rest you reach for as your needs grow.

10.2.1 memory_limiter and batch — the mandatory pair

The memory_limiter processor measures the Collector’s own memory usage on a fixed interval and, when usage crosses configurable thresholds, refuses new data (returning errors to receivers). This is what creates back-pressure: instead of the Collector silently dying from an out-of-memory kill, upstream senders see failures, retry, and slow down [Source: https://kubernetes.io/docs/setup/best-practices/cluster-large/].

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 800       # ~80% of container memory limit
    spike_limit_mib: 200 # tolerance for short bursts

The batch processor groups telemetry into larger payloads before they reach an exporter. Larger batches mean fewer gRPC/HTTP calls and dramatically better throughput, at the cost of a few seconds of added latency.

processors:
  batch:
    timeout: 5s
    send_batch_size: 512
    send_batch_max_size: 4096

Analogy: batch is a hotel shuttle that waits up to five minutes (or until the seats are full) before driving to the airport — much more efficient than running an empty taxi for every guest. memory_limiter is the bouncer at the lobby door who turns guests away when the lobby is full, so the building never collapses.

10.2.2 attributes, resource, and transform (OTTL)

The attributes and resource processors handle straightforward add/update/delete operations on attribute keys. For anything richer — conditional logic, regex substitution, cross-field arithmetic — reach for the transform processor, which uses the OpenTelemetry Transformation Language (OTTL) [Source: https://arxiv.org/html/2501.11709v3].

OTTL is an expression-based DSL. A statement looks like a function call with an optional where clause:

set(target, value) where <boolean condition>

Statements run inside a contextspan, metric, datapoint, log, resource, or scope — and can read or modify fields like attributes["key"], name, body, severity_text, and resource.attributes["key"].

Here is a transform block that normalizes HTTP routes (so /users/42 and /users/9000 are aggregated together), scrubs PII, and tags every span emitted by the checkout service:

processors:
  transform:
    error_mode: ignore
    trace_statements:
      - context: span
        statements:
          # Collapse user IDs in URL paths so cardinality stays bounded
          - replace_pattern(attributes["http.target"], "/users/[0-9]+", "/users/:id") where attributes["http.target"] != nil
          # Remove PII before exporting
          - delete_key(attributes, "user.email")
          - delete_key(attributes, "user.id")
          # Whitelist what's allowed to leave
          - keep_keys(attributes, ["http.method", "http.target", "http.status_code", "service.name"])
          # Mark anything from the checkout service
          - set(attributes["env"], "prod") where resource.attributes["service.name"] == "checkout-service"

The error_mode knob is important: propagate (the default in some versions) can fail an entire batch when a single statement errors; ignore is the safer choice in most production pipelines.

A close cousin is the filter processor, which uses OTTL conditions to drop data outright. Filtering health-check spans is a classic use:

processors:
  filter/web:
    traces:
      span:
        exclude:
          match_type: expr
          expressions:
            - 'name == "/healthz" or attributes["http.target"] == "/healthz"'

Order matters here too. Putting filter before transform saves CPU because dropped data never gets reshaped. Putting transform before filter lets filters operate on normalized values. Pick the order that matches your intent.

10.2.3 tail_sampling and probabilistic_sampler

Sampling is the cost-control lever for traces. The probabilistic_sampler makes a quick, stateless decision (e.g., “keep 5%”) based on the trace ID. It is cheap and predictable but blind — it cannot prefer error or slow traces over normal ones.

The tail_sampling processor is fundamentally different: it buffers all spans for a trace in memory, keyed by trace ID, and decides whether to keep or drop the entire trace only after decision_wait seconds have passed or all spans have arrived [Source: https://news.ycombinator.com/item?id=44095189]. Because it sees the whole trace, it can sample based on end-to-end latency, final status code, or attributes that only appear on a leaf span.

A production-grade tail sampling policy usually combines several rules through a composite policy with priority ordering:

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    expected_new_traces_per_sec: 2000
    policies:
      - name: main
        type: composite
        composite:
          max_total_spans_per_second: 1000
          policy_order:
            - error-traces
            - slow-traces
            - premium-tenants
            - baseline
          sub_policies:
            error-traces:
              type: status_code
              status_code:
                status_codes: [ERROR]
            slow-traces:
              type: latency
              latency:
                threshold_ms: 4000
            premium-tenants:
              type: string_attribute
              string_attribute:
                key: tenant.tier
                values: ["gold", "platinum"]
            baseline:
              type: probabilistic
              probabilistic:
                sampling_percentage: 1.0

This config keeps every error trace and every trace slower than 4 seconds, keeps everything from premium tenants, and falls back to 1% random sampling for the rest — all under an overall budget of 1,000 spans/sec.

A crucial gotcha: tail sampling only works if SDKs export spans unsampled (always_on or parentbased(always_on)). If the SDK already dropped the spans before they reached the Collector, no policy can resurrect them [Source: http://susandumais.com/CHI2012-12-tailanswers-chi2012.pdf].

Figure 10.2: Tail sampling decision flow

sequenceDiagram
    participant SDK as "SDK always_on"
    participant Col as "Collector tail_sampling"
    participant Buf as "Trace buffer"
    participant Pol as "Policy evaluator"
    participant BE as "Backend Tempo"

    SDK->>Col: Span A trace T1 root
    Col->>Buf: Buffer T1 spans
    SDK->>Col: Span B trace T1 child
    Col->>Buf: Buffer T1 spans
    SDK->>Col: Span C trace T1 error
    Col->>Buf: Buffer T1 spans
    Note over Col,Buf: Wait decision_wait 10s
    Col->>Pol: Evaluate composite policies
    Pol->>Pol: error-traces matches? YES
    Pol-->>Col: KEEP trace T1
    Col->>BE: Export all T1 spans
    Note over Col: Traces matching no policy<br/>are dropped before export
AspectHead/parent-based samplertail_sampling processor
Decision pointRoot span startAfter decision_wait in Collector
Sees full traceNoYes
Can prefer errors / slow tracesNoYes
SDK overheadLow (drops at source)High (must export everything)
Collector memory & CPUMinimalSubstantial (buffers spans)

10.2.4 k8sattributes — Kubernetes enrichment

The k8sattributes processor watches the Kubernetes API and decorates incoming telemetry with metadata about the pod that sent it: namespace, deployment name, node, labels, annotations. It identifies the sender either by inbound connection IP or by an explicit k8s.pod.ip resource attribute.

processors:
  k8sattributes:
    auth_type: serviceAccount
    filter:
      node_from_env_var: K8S_NODE_NAME
    extract:
      metadata:
        - k8s.pod.name
        - k8s.namespace.name
        - k8s.node.name
        - k8s.pod.uid
        - k8s.deployment.name
    pod_association:
      - from: connection
      - from: resource_attribute
        name: k8s.pod.ip

Run k8sattributes on the agent (DaemonSet) Collector — the one that receives data directly from local pods — where connection-based association still works. Running it on a central gateway is rarely useful because the source IP it sees is the agent’s, not the application pod’s. Limit the extract.metadata list to what you actually query on; every additional field multiplies cardinality and API-server load.

Key Takeaway: memory_limiter and batch are non-negotiable in production; transform (with OTTL), filter, and tail_sampling reshape and sample data; k8sattributes enriches with Kubernetes context — but only when run close to the source.


10.3 Key Receivers and Exporters

The Collector’s strength is its plug-and-play library of receivers and exporters. A handful cover the vast majority of real-world deployments.

10.3.1 Workhorse receivers

ReceiverWhat it ingestsTypical use
otlpOTLP/gRPC and OTLP/HTTP (traces, metrics, logs)Default for SDK and Collector-to-Collector traffic
prometheusScrapes Prometheus /metrics endpointsMigrating from Prometheus, scraping exporters
hostmetricsOS-level CPU, memory, disk, network, filesystem, processNode-agent monitoring (CPU%, memory, load average)
filelogTails log files with multiline and parser supportCollecting container logs from /var/log/pods on the node
kafkaReads OTLP-encoded data from Kafka topicsDecoupling ingest from processing
jaeger, zipkinLegacy formats for migrationBrownfield environments still emitting Jaeger/Zipkin

A common DaemonSet receiver block looks like this:

receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
      memory:
      disk:
      filesystem:
      network:
      load:
  filelog:
    include: ["/var/log/pods/*/*/*.log"]
    start_at: end
    include_file_path: true
    operators:
      - type: container

The prometheus receiver is worth a special mention: it accepts native Prometheus scrape config, which means an existing prometheus.yml can be lifted into the Collector almost verbatim — a powerful migration path when teams want to start consolidating their data plane without rewriting their scrape rules.

10.3.2 Workhorse exporters

ExporterDestinationNotes
otlpAny OTLP-compatible backend (Tempo, Jaeger, vendors)Default, supports gRPC and HTTP
prometheusremotewritePrometheus, Mimir, Cortex, ThanosFor metrics fan-out into the Prometheus ecosystem
lokiGrafana LokiLogs only; attribute-to-label mapping is configurable
debugstdout of the Collector itselfReplaces the older logging exporter; invaluable for dev
kafkaKafka topic, OTLP-encodedPairs with the kafka receiver to build buffered pipelines
fileLocal file (JSON)Disaster-recovery sink, offline replay

A production gateway exporter block typically configures both queueing and retry on the otlp exporter:

exporters:
  otlp/tempo:
    endpoint: tempo-distributor.observability:4317
    tls: { insecure: true }
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 2000
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 60s
      max_elapsed_time: 0   # retry forever
  prometheusremotewrite:
    endpoint: http://mimir.observability:8080/api/v1/push
    resource_to_telemetry_conversion: { enabled: true }
  loki:
    endpoint: http://loki-gateway/loki/api/v1/push
  debug:
    verbosity: basic

resource_to_telemetry_conversion: true is a handy switch on prometheusremotewrite that promotes OTLP resource attributes (like service.name, k8s.pod.name) into Prometheus labels so they become queryable in PromQL.

10.3.3 Connectors — the inter-pipeline glue

A connector behaves as an exporter on one pipeline and as a receiver on another. The canonical example is spanmetrics, which consumes spans and emits aggregated RED metrics — exactly the kind of derived signal you want produced once, near the source, instead of repeatedly in each backend.

connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s]
    dimensions:
      - name: http.method
      - name: http.status_code

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [spanmetrics, otlp/tempo]
    metrics/spans:
      receivers: [spanmetrics]
      processors: [batch]
      exporters: [prometheusremotewrite]

Notice how the same spanmetrics instance appears as an exporter under the traces pipeline and as a receiver under metrics/spans. Other useful connectors include routing (split traffic by attribute) and forward (chain pipelines together).

Key Takeaway: OTLP and Prometheus receivers cover most ingest; OTLP, Prometheus remote-write, and Loki exporters cover most egress; connectors like spanmetrics cleanly derive one signal from another inside the same Collector.


10.4 Reliability and Operations

Once a Collector is the single conduit for an organization’s telemetry, it becomes a critical piece of infrastructure. Reliability hinges on three things: durable queues, fast feedback through extensions, and resource sizing that matches load.

10.4.1 Persistent queue and retry on failure

Every OTLP exporter — and most other exporters — supports a sending_queue and retry_on_failure. By default the sending queue is in memory: fast, but lost on restart. Pairing it with the file_storage extension makes the queue persistent across restarts, so an OOM kill or rolling deployment doesn’t drop telemetry already accepted from upstream [Source: https://www.pulumi.com/blog/kubernetes-best-practices-i-wish-i-had-known-before/].

extensions:
  file_storage:
    directory: /var/lib/otelcol/storage
    timeout: 1s

exporters:
  otlp/tempo:
    endpoint: tempo-distributor.observability:4317
    sending_queue:
      enabled: true
      storage: file_storage   # makes the queue durable
      num_consumers: 10
      queue_size: 2000
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 60s
      max_elapsed_time: 0

Figure 10.4: memory_limiter, batch, sending queue, and retry interplay

flowchart LR
    IN[Incoming spans / metrics / logs] --> ML{memory_limiter<br/>over threshold?}
    ML -- "yes: refuse" --> REJ[Return error to receiver<br/>upstream retries / backs off]
    ML -- "no: accept" --> BAT[batch<br/>timeout or send_batch_size]
    BAT --> SQ[(sending_queue<br/>file_storage backed)]
    SQ --> EXP[Exporter consumer pool<br/>num_consumers]
    EXP -- "success" --> BE[Backend]
    EXP -- "failure" --> RET{retry_on_failure<br/>exponential backoff}
    RET -- "retry" --> SQ
    RET -- "queue full" --> DROP[Drop oldest<br/>otelcol_exporter_send_failed]

Sizing tips:

10.4.2 Extensions: health_check, pprof, zpages

Extensions don’t participate in pipelines, but they are how you operate the Collector day-to-day.

ExtensionEndpoint (default)Use
health_check:13133/Kubernetes liveness/readiness probes
pprof:1777/debug/pprof/CPU and heap profiling under load
zpages:55679/debug/Live in-process views: pipelines, exporter queues, recent spans
file_storagen/a (filesystem)Backing store for persistent queues

A reasonable Kubernetes probe configuration looks like:

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: 0.0.0.0:1777
  zpages:
    endpoint: 0.0.0.0:55679

service:
  extensions: [health_check, pprof, zpages, file_storage]

In your pod spec:

livenessProbe:
  httpGet: { path: /, port: 13133 }
  initialDelaySeconds: 10
readinessProbe:
  httpGet: { path: /, port: 13133 }
  periodSeconds: 5

zpages is especially useful during incidents — it exposes per-component counters and sampled recent traces directly from the Collector’s process, so you can answer “Is data flowing? Is anything dropping?” without leaving the cluster.

Figure 10.3: Two-tier agent + gateway Collector topology

flowchart LR
    subgraph "Node 1"
        P1[App pod] --> A1["Agent Collector<br/>DaemonSet<br/>memory_limiter<br/>k8sattributes<br/>light batch"]
        P2[App pod] --> A1
    end
    subgraph "Node 2"
        P3[App pod] --> A2["Agent Collector<br/>DaemonSet<br/>memory_limiter<br/>k8sattributes<br/>light batch"]
        P4[App pod] --> A2
    end
    subgraph "Node N"
        P5[App pod] --> A3["Agent Collector<br/>DaemonSet<br/>memory_limiter<br/>k8sattributes<br/>light batch"]
    end

    A1 --> GW["Gateway Collector<br/>Deployment + HPA<br/>tail_sampling<br/>transform<br/>heavy batch<br/>persistent queue"]
    A2 --> GW
    A3 --> GW

    GW --> TEMPO[(Tempo)]
    GW --> MIMIR[(Mimir)]
    GW --> LOKI[(Loki)]

10.4.3 Sizing, throughput, and memory tuning

In Kubernetes, the recommended pattern is a two-tier topology [Source: https://kodekloud.com/blog/kubernetes-best-practices-2025/]:

Recommended starting resources:

RoleCPU requestCPU limitMemory requestMemory limit
Agent (DaemonSet)100–250m500–750m256–512 Mi512 Mi–1 Gi
Gateway (Deployment)500m–1 vCPU2–4 vCPU1–2 Gi2–4 Gi

Tune from there using observed peaks. The memory_limiter should target 70–80% of the container memory limit, with spike_limit_mib covering the largest plausible batch:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1600       # for a 2 GiB-limit gateway pod
    spike_limit_mib: 400

For the gateway, an HPA on CPU (target 60–70% utilization) and a minReplicas: 2 gives you graceful scaling and high availability [Source: https://www.gravitee.io/blog/top-5-kubernetes-deployment-strategies]:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: otel-gateway
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: otel-gateway
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65

Agents are not HPA-scaled; they scale with node count via the DaemonSet, so adjust per-agent resources instead. Tail sampling capacity follows a simple rule of thumb: num_traces ≥ expected_new_traces_per_sec × decision_wait × 2. At 2,000 traces/sec and a 10-second decision_wait, that’s num_traces: 40000 minimum — round up to 50,000 for headroom.

Monitor the Collector with… itself: every Collector exposes its own internal metrics on port 8888 (e.g., otelcol_exporter_queue_size, otelcol_processor_dropped_spans, otelcol_receiver_refused_spans). Alert on:

Key Takeaway: Run a two-tier (agent + gateway) topology, give the gateway a durable sending queue with retry, expose health_check/pprof/zpages for operability, and size memory_limiter, num_traces, and HPA targets from observed load rather than guesses.


Chapter Summary

The OpenTelemetry Collector is a pluggable, vendor-neutral data plane composed of receivers, processors, exporters, connectors, and extensions, with pipelines declared per signal type in the service block. Order is decisive: memory_limiter belongs first, enrichment and filtering come next, transforms reshape what survives, and batch is the final stop before an exporter. OTTL — the OpenTelemetry Transformation Language — gives the transform and filter processors a small, context-aware DSL for setting, deleting, whitelisting, and pattern-replacing attributes, with where clauses to scope each statement.

For cost control on traces, tail_sampling buffers spans by trace ID and decides per-trace after decision_wait, enabling policies that prefer errors, slow traces, and key tenants while a probabilistic backstop gives baseline coverage. Workhorse components — otlp, prometheus, hostmetrics, filelog, kafka receivers; otlp, prometheusremotewrite, loki, debug exporters; and connectors like spanmetrics — cover the overwhelming majority of real deployments.

Reliability comes from three habits: configuring memory_limiter and batch everywhere, enabling persistent sending queues with retry on the gateway, and exposing health_check, pprof, and zpages for fast operational feedback. In Kubernetes, deploy a DaemonSet agent for local concerns (logs, host metrics, k8sattributes) and a horizontally-scaled gateway Deployment for tail sampling, transforms, and egress. Size everything from real metrics, not hope — the Collector’s own :8888 endpoint is your honest source of truth.

Key Terms

TermDefinition
ReceiverComponent that ingests telemetry into the Collector via push (OTLP) or pull (Prometheus scrape, file tail, etc.).
ProcessorComponent that mutates, filters, batches, samples, or enriches telemetry as it flows through a pipeline.
ExporterComponent that sends telemetry from the Collector to one or more downstream backends.
ConnectorHybrid component acting as an exporter on one pipeline and a receiver on another — used to derive signals (e.g., metrics from spans).
OTTLOpenTelemetry Transformation Language — context-aware DSL used by transform (statements) and filter (conditions).
Tail samplingPer-trace sampling decision made after spans have been buffered, enabling policies based on full-trace properties.
memory_limiterProcessor that monitors Collector memory and applies back-pressure (refuses new data) above configured thresholds.
k8sattributesProcessor that enriches telemetry with Kubernetes pod metadata (namespace, deployment, labels) via the Kubernetes API.

Chapter 11: Sampling, Performance, and Cost Control

Observability is one of those engineering disciplines where the cure can become as expensive as the disease. A team gets a P1 incident, blames lack of telemetry, and overcorrects by instrumenting everything at 100% sampling, adding every available label, and shipping every line of structured log to a hot index. Three months later, the observability bill rivals the infrastructure bill, the ingestion pipeline is back-pressuring, and engineers are still no faster at finding the root cause.

The discipline this chapter teaches is the opposite reflex: deliberately keeping the right telemetry and dropping the rest, in the right place, at the right cost. We will look at sampling strategies (head vs tail), cardinality management for metrics, the real performance overhead of instrumentation in Java and Go, and how to design a cost-aware architecture that scales.

Think of telemetry like a city’s water system. You do not need to bottle and refrigerate every drop that flows through a pipe — but you do need flow sensors at junctions, alarms when a main bursts, and a sampled chemistry test now and then. Get the sampling right, and a small storage tank tells you everything you need. Get it wrong, and you are paying to refrigerate the entire reservoir.

Learning Objectives

By the end of this chapter, you will be able to:


Sampling Strategies

Sampling is the single most powerful lever you have for controlling telemetry cost and overhead. It is also the most misunderstood. Engineers reach for “100% sampling, we want to see everything,” not realizing that at high throughput this can cost more than the application itself. The trick is to sample in a way that preserves what matters — errors, slow traces, unusual tenants — while discarding the redundant majority.

In OpenTelemetry, two sampling families dominate: head-based sampling (decided in the SDK at the start of a trace) and tail-based sampling (decided in the Collector after most of the trace has been seen) [Source: https://opentelemetry.io/docs/zero-code/java/agent/performance/].

Head-based: ParentBased, TraceIdRatio, AlwaysOn/Off

Head-based sampling makes the keep-or-drop decision at the first service that creates the root span — typically the SDK in your front-door API. Because the decision is made up front, downstream services never have to record, store, or ship the spans of a dropped trace. This is the cheapest possible form of sampling.

The canonical head sampler in OTel is TraceIdRatioBased. It hashes the trace ID (or uses its high bits) to produce a deterministic sample with probability p — so two services seeing the same trace ID will always agree. Set p = 0.1 and you keep 10% of traces, uniformly at random [Source: https://opentelemetry.io/docs/zero-code/java/agent/performance/].

In isolation, TraceIdRatioBased is not enough. If Service A keeps a trace and Service B independently re-decides, you get incoherent traces where the parent is sampled but children are missing. The fix is to wrap it in ParentBased: child spans honor the parent’s decision regardless of their own configured rate. The idiomatic OTel sampler is therefore:

ParentBased(root=TraceIdRatioBased(0.1))

This means: “If I am a root span, decide using a 10% ratio. If I have a parent, do whatever the parent did.” This single line is what makes distributed sampling coherent.

The other head samplers are extremes: AlwaysOn keeps every trace (good for dev/staging), AlwaysOff drops every trace (useful as a kill switch or for ephemeral services where traces are noise).

Figure 11.1: Head sampling KEEP vs DROP across services

sequenceDiagram
    participant C as Client
    participant A as Service A<br/>(root span)
    participant B as Service B
    participant D as Service D
    participant Col as Collector
    Note over A: KEEP path (sampled in)
    C->>A: request 1
    A->>A: TraceIdRatio: keep
    A->>B: call (sampled=true)
    B->>D: call (sampled=true)
    A-->>Col: export spans
    B-->>Col: export spans
    D-->>Col: export spans
    Note over A: DROP path (sampled out)
    C->>A: request 2
    A->>A: TraceIdRatio: drop
    A->>B: call (sampled=false)
    B->>D: call (sampled=false)
    Note over A,D: no span recording,<br/>no export, no cost

Tail-based at the Collector

Head sampling is statistically unbiased but stupid: it will happily throw away the one trace where the payment service threw a 500, just because the dice rolled the wrong way. For rare-but-important events, you want to decide after you have seen the whole trace.

Tail-based sampling lives in the OpenTelemetry Collector, in the tail_sampling processor. The mechanics are:

  1. SDKs run AlwaysOn (or a generous head sample like 50%) and ship every span.
  2. The Collector buffers spans in memory, grouped by trace ID.
  3. It waits a decision_wait window (often 5–30 seconds) for the trace to “complete.”
  4. It evaluates policies — keep if any span has error=true, keep if duration > 3s, keep if tenant_id=enterprise-1234, otherwise sample 10% randomly.
  5. Selected traces are flushed to the backend; the rest are dropped.

The advantage is huge: you get full knowledge of the trace, including attributes that downstream services added (status codes, latency, tenant). The disadvantage is equally large: you must buffer everything in collector memory until the decision window closes.

The collector’s memory cost scales linearly with throughput and the wait window:

buffered_bytes ≈ spans/sec × avg_spans/trace × avg_span_size × wait_window_seconds

Plug in realistic numbers: 10,000 spans/sec × 20 spans/trace × 1 KB/span × 10s window = 2 GB of in-memory spans [Source: https://opentelemetry.io/docs/zero-code/java/agent/performance/]. Double the window, double the memory. This is why tail sampling typically requires sharding collectors by trace ID (so all spans of a given trace land on the same collector) and aggressive timeout tuning.

Tail sampling also adds observability latency: the time between a trace happening and the trace appearing in your backend. If your SREs depend on traces for live incident detection, a 30-second decision window means a 30-second blind spot.

Probabilistic vs Adaptive Sampling

Both head and tail sampling can be either probabilistic (fixed rate) or adaptive (rate adjusts to volume).

Probabilistic is the default and the simplest: “keep 10% of traces, forever.” It is deterministic, predictable, and easy to reason about for SLO math (your sampled error rate is an unbiased estimate of the true error rate). Most teams should start here.

Adaptive sampling dynamically adjusts the rate to hit a target traces-per-second budget. If traffic spikes 5×, the sampler drops the rate from 10% to 2% to keep the ingest pipeline steady. This is great for cost control but breaks naive statistical extrapolation — a 1% error rate in a 2%-sampled bucket means something different than in a 50%-sampled bucket. Tools like the OTel Collector probabilistic_sampler processor support this pattern with sampling_percentage driven by upstream load.

A common production pattern is the hybrid: a light head sample (say, ParentBased(TraceIdRatioBased(0.2)) for 20%) plus a tail sampler that further refines the 20% to keep errors and slow traces. This caps collector memory at a fifth of full firehose, while still letting the tail layer find interesting traces. The catch: errors dropped by the head sampler are lost forever. “Keep all errors” becomes “keep all errors among the 20% that survived head sampling” [Source: https://github.com/open-telemetry/opentelemetry-java-instrumentation/discussions/2104].

Figure 11.2: Hybrid head + tail sampling decision pipeline

flowchart TD
    Start[Trace begins<br/>at SDK]
    Start --> Head{Head sampler<br/>ParentBased<br/>TraceIdRatio 20%}
    Head -->|80% drop| Drop1[Discard at SDK<br/>no spans recorded]
    Head -->|20% keep| Export[Export spans<br/>to Collector]
    Export --> Buffer[Buffer by trace_id<br/>decision_wait window]
    Buffer --> Tail{Tail sampling<br/>policies}
    Tail -->|error=true| Keep1[Persist to backend]
    Tail -->|latency &gt; 3s| Keep2[Persist to backend]
    Tail -->|tenant=enterprise| Keep3[Persist to backend]
    Tail -->|10% random| Keep4[Persist to backend]
    Tail -->|none matched| Drop2[Discard at Collector]

Comparison: When to Use Which

DimensionHead-based (TraceIdRatioBased)Tail-based (collector tail_sampling)
Where decidedSDK, at first spanCollector, after buffering
Decision timingImmediate, pre-exportAfter decision_wait (5–30 s)
Latency impact (request)NegligibleNone on request, but observability lag
Memory costVery lowHigh; scales with spans/sec × wait
Network costLow (drops never leave service)High (all spans cross the wire)
Accuracy for rare eventsPoor (random drops)Excellent (sees full trace)
Statistical propertiesUnbiased random sampleBiased toward “interesting”
Configuration complexitySimple, per-SDK settingComplex; tune buffers, timeouts, policies
Best forHigh-QPS, cost-sensitive APIsRare-error capture, targeted debugging
Example use case50k RPS API at 1% samplePayment service, keep all errors + slow

[Source: https://opentelemetry.io/docs/zero-code/java/agent/performance/]

Key Takeaway: Head-based sampling is cheap, deterministic, and ideal for high-volume systems where statistical aggregates matter more than every individual trace. Tail-based sampling is expensive and complex, but it is the only way to reliably catch rare-but-important traces — use it where the cost of missing a trace exceeds the cost of buffering all traces.


Cardinality Management

If sampling controls trace cost, cardinality controls metric cost. In a Prometheus-style time-series database, the unit of storage is not the metric — it is the unique combination of metric name plus label set. Each unique combination is one time series, with its own samples, its own memory footprint in the head block, and its own row in long-term storage [Source: https://blog.codinghorror.com/the-problem-with-logging/].

A metric like http_requests_total with labels {method="GET", status="200"} is one series. Add a user_id label with 100,000 distinct users and you have potentially 100,000 series per method-status combination. The math is multiplicative, not additive, and the explosion is brutal.

Identifying High-Cardinality Labels

The first job is forensic: figuring out which metrics and labels are causing the blow-up. Prometheus has a number of built-in tools for this.

The most useful PromQL query is the “top-N by series count”:

topk(20, count by (__name__)({__name__=~".+"}))

This returns the 20 metric names with the most series. Once you have a suspect, drill into its labels:

topk(10, count by (label_name) (http_server_requests_seconds_count))

For deeper analysis, promtool tsdb analyze reads the on-disk block format and produces a report of the heaviest labels and series:

promtool tsdb analyze /path/to/data > tsdb-report.txt

The TSDB also exposes meta-metrics that should be on every Prometheus operator’s dashboard: prometheus_tsdb_head_series (live series count), prometheus_tsdb_head_series_created_total (churn), and prometheus_tsdb_wal_fsync_duration_seconds (early sign of disk pressure).

For users running Grafana Mimir, the Cardinality Explorer UI and a set of API endpoints make this even easier — /api/v1/cardinality/metric_names, /api/v1/cardinality/label_names?metric=X, and so on. You can script these into CI/CD so a build fails if a new metric crosses, say, 50,000 series [Source: https://blog.codinghorror.com/the-problem-with-logging/].

Allow/Deny Lists and Attribute Drops

Once you know the offenders, the fastest fix is to drop them at scrape time using metric_relabel_configs. This Prometheus config runs after the sample has been scraped but before it is committed to storage, making it the right place for cardinality surgery.

Drop a single noisy label:

metric_relabel_configs:
  - source_labels: [uri]
    regex: ".+"
    action: labeldrop

Drop several at once:

  - regex: "user_id|session_id|request_id"
    action: labeldrop

Keep only an allow-listed set (paranoid mode):

  - regex: "job|instance|method|status_code"
    action: labelkeep

Drop entire metric families:

  - source_labels: [__name__]
    regex: ".*_per_user_.*"
    action: drop

Normalize dynamic path segments:

  - source_labels: [path]
    regex: "/api/v1/users/[0-9]+/(.*)"
    target_label: path
    replacement: "/api/v1/users/:id/$1"
    action: replace

That last pattern — normalizing /users/12345/orders into /users/:id/orders — is one of the most useful tricks in the playbook. It collapses unbounded user-ID-laden paths into a bounded set of route templates without losing the structural information.

In the OpenTelemetry Collector, the equivalent surgery happens in the attributes and transform processors, which can drop, rename, or hash attributes before they reach the metrics exporter.

Aggregating Away Unbounded Dimensions

Sometimes a label genuinely matters for some questions but is too expensive to keep on the raw metric. The pattern here is recording rules: pre-compute and persist a lower-cardinality aggregation, and query the rolled-up series instead of the raw one.

groups:
  - name: myapp_aggregates
    interval: 30s
    rules:
      - record: myapp:http_request_duration_seconds_bucket:service
        expr: |
          sum by (service, le) (
            myapp_http_request_duration_seconds_bucket
          )

This rule sums the raw histogram across all labels except service and le (bucket boundary), so quantile queries become trivial:

histogram_quantile(0.95, rate(myapp:http_request_duration_seconds_bucket:service[5m]))

You can keep the high-cardinality raw metric at short retention (say, 2 hours) for live debugging, while the rolled-up recording rule retains for 90 days at a tiny fraction of the storage.

The best long-term fix, however, is always application-level: do not emit unbounded labels in the first place. Use templated route names. Use stable error_code enums, not raw exception messages. Never put user_id, session_id, request_id, email, or cart_id on a metric — those belong on traces or logs, where the storage model can handle high cardinality [Source: https://blog.codinghorror.com/the-problem-with-logging/].

If you already have bad data in storage, Mimir and Prometheus both expose deletion APIs (DELETE /api/v1/admin/tsdb/delete_series with matcher and time range) followed by promtool tsdb clean-tombstones to actually reclaim disk.

Figure 11.3: Cardinality reduction funnel

flowchart TD
    Raw["Raw exposition<br/>http_requests_total{method, status,<br/>user_id, request_id, path}<br/>~1,000,000 series<br/>(cost: 1000x)"]
    Raw -->|"metric_relabel_configs:<br/>labeldrop user_id, request_id<br/>normalize /users/:id/"| Mid
    Mid["After scrape-time relabel<br/>http_requests_total{method, status, route}<br/>~50,000 series<br/>(cost: 50x)"]
    Mid -->|"recording rule:<br/>sum by (service, le)"| Roll
    Roll["Recording-rule rollup<br/>myapp:http_request_duration:service<br/>~200 series<br/>(cost: 1x)"]

Key Takeaway: Each unique label combination is its own time series, and high-cardinality labels multiply storage cost. Identify offenders with topk queries and Mimir’s Cardinality Explorer, drop them at scrape time with metric_relabel_configs, aggregate via recording rules, and — above all — never put unbounded IDs on metric labels at the application level.


Performance Overhead

The third lever is the cost of generating telemetry — the CPU and memory the instrumentation itself consumes inside your application. This is the cost engineers worry about most, but it is usually the cost that matters least if you have sampling and cardinality under control. Still, knowing the numbers — and how to tune them — is part of being a competent observability operator.

CPU and Memory Cost of SDK Instrumentation

The honest answer is “it depends on the runtime, the agent style, and the workload.” But we can give meaningful ranges.

Java auto-instrumentation (the OpenTelemetry Java Agent) is the heaviest-weight common case because it uses bytecode instrumentation. The Elastic EDOT Java benchmark on a sample JVM service gives concrete numbers [Source: https://www.elastic.co/docs/reference/opentelemetry/edot-sdks/java/overhead]:

MetricNo agentWith EDOT JavaRelative impact
Startup time5.55 s6.82 s~+23% (+1.3 s)
p95 request latency1.96 ms2.06 ms~+5%
Total system CPU53.82%54.25%~+0.8% absolute

For most well-tuned JVM microservices, expect 1–5% additional CPU and tens of MB of extra heap [Source: https://opentelemetry.io/docs/zero-code/java/agent/performance/]. The OpenTelemetry community discussion on this topic puts the worst-case at up to ~20% CPU and ≥0.5 ms extra latency per instrumented hop for very high-throughput, fully instrumented apps without sampling [Source: https://github.com/open-telemetry/opentelemetry-java-instrumentation/discussions/2104]. That worst case is largely driven by GC pressure: every span allocates objects, and at high QPS the allocation rate dominates.

Go auto-instrumentation has a different architecture. The Go SDK is library-based — you import packages and add middleware — rather than a runtime bytecode agent. There is no JVM-style startup penalty and no class-loading hit. In practice, Go OpenTelemetry overhead at sampled rates is typically in the low single-digit CPU percent range with a few to tens of MB of additional RSS [Source: https://opentelemetry.io/docs/zero-code/java/agent/performance/].

Across both runtimes, the dominant cost is almost never the trace API itself — it is the attributes you collect and the exporter you use. Avoid expensive attribute collection in hot loops (don’t call getName() on a reflection-heavy object for every span), keep attribute lists short, and prefer the async batched exporter.

Batching, Async Export, and Buffer Tuning

The OpenTelemetry BatchSpanProcessor (BSP) is the workhorse that decouples span generation from span export. Spans go into an in-memory queue, a background thread pulls them in batches, and the exporter ships them over HTTP/gRPC. The tunable knobs all trade memory for CPU and loss-resistance:

KnobEffect of larger valueEffect of smaller value
Batch sizeBetter CPU/network efficiency, more memoryLower memory, more frequent exports
Schedule delayFewer export calls, spans linger in memoryLower loss risk, more even export
Max queue sizeSurvives bigger spikes without droppingSmaller memory footprint, drops earlier
Sampling rateMore traces shipped, higher overheadCheaper, less coverage

[Source: https://opentelemetry.io/docs/zero-code/java/agent/performance/]

Sampling is by far the most powerful knob. Going from 100% to 10% sampling reduces per-request export work by roughly 10×, and that compound saving cascades through the entire pipeline — fewer spans allocated, fewer batches, less network, less collector CPU, less backend ingest cost. Most production systems run at 1–10% probabilistic sampling, sometimes augmented with tail sampling for rare-error coverage.

Two other patterns matter at scale. First, back-pressure: when the queue fills, BSP drops spans rather than blocking the application. You must monitor otel_sdk_span_processor_dropped_spans (or the equivalent backend metric); a sustained nonzero rate means you are losing trace data and need to either sample harder or grow the queue. Second, async-only exporters in production: the synchronous SimpleSpanProcessor blocks the application thread on every export and should be reserved for dev/test.

Benchmarking Instrumented vs Uninstrumented Code

The right way to make capacity decisions is to measure your own application, not trust generic numbers. A minimal benchmark protocol:

  1. Baseline run: same hardware, same load generator, instrumentation disabled. Record p50/p95/p99 latency, CPU%, RSS, GC pause time, throughput.
  2. Instrumented run: same workload, instrumentation enabled at the sampling rate you intend to ship. Record the same metrics.
  3. Compute deltas: percentage CPU increase, absolute latency increase per request, RSS delta in MB.
  4. Stress run: drive load 2–3× above expected peak to find the breakpoint where instrumentation overhead becomes nonlinear (usually GC-driven on JVM).
  5. Tune and re-measure: lower sampling rate, drop unnecessary instrumentations, raise queue size, and re-run.

Plan capacity with headroom: +5–20% CPU for Java auto-instrumentation, low single-digit for Go, and budget heap growth on the JVM side [Source: https://www.elastic.co/docs/reference/opentelemetry/edot-sdks/java/overhead].

Figure 11.4: Async export pipeline (BatchSpanProcessor)

flowchart LR
    App["Application thread<br/>span.end()"]
    App -->|enqueue<br/>non-blocking| Queue["In-memory queue<br/>max_queue_size"]
    Queue -->|"drop on full<br/>(dropped_spans counter)"| DropPath[Span dropped]
    Queue -->|background worker<br/>pulls batches| Batcher["Batcher<br/>batch_size /<br/>schedule_delay"]
    Batcher -->|OTLP gRPC/HTTP| Exp[Exporter]
    Exp --> Col[OTel Collector]
    Col --> Backend["Backend<br/>(Tempo / Mimir / Loki)"]

Key Takeaway: OpenTelemetry instrumentation typically costs 1–5% CPU and tens of MB of memory in well-tuned services, with Java agents at the higher end and Go SDKs at the lower. The most powerful overhead knob is the sampling rate, followed by batching configuration. Always benchmark against your own workload — generic numbers are a starting point, not a substitute for measurement.


Cost-aware Architecture

Sampling, cardinality, and overhead controls are tactical. The strategic layer is architecture: where in the pipeline you do what work, how telemetry flows from edge to backend, and how long each tier retains data. A well-architected observability stack costs a fraction of a poorly architected one for the same operational value.

Think of it as cold-chain logistics for telemetry. Fresh data needs fast, expensive storage. As it ages, you transfer it to progressively cheaper tiers, eventually freezing it into cold archive or deleting it entirely. The trick is matching the temperature curve of value-over-time to the cost curve of storage tiers.

Metric Aggregation at the Edge

The further telemetry travels before it is reduced, the more expensive it becomes. Network bytes, collector CPU, backend ingestion fees, hot-storage GB-hours — all scale with the volume that crosses each boundary.

Edge aggregation means doing reduction work as close to the source as possible:

This “fan-in funnel” is the dominant production pattern. A typical Kubernetes observability stack might run a Prometheus Agent or OTel Collector DaemonSet (one per node), feeding a regional collector StatefulSet (one per cluster), feeding the central backend. Each tier reduces volume by 2–10×.

Figure 11.5: Fan-in collector tiers and tiered storage

flowchart LR
    subgraph Apps["App pods (many)"]
        A1[App + SDK]
        A2[App + SDK]
        A3[App + SDK]
    end
    subgraph Node["Node-level Collector<br/>(DaemonSet)"]
        N1[Relabel + sample]
    end
    subgraph Regional["Regional Collector tier<br/>(StatefulSet)"]
        R1["Tail sampling<br/>cardinality enforcement"]
    end
    subgraph Backend["Observability backend<br/>(Mimir / Tempo / Loki)"]
        Hot["Hot tier<br/>SSD, 2–24 h"]
        Warm["Warm tier<br/>S3, 7–30 d"]
        Cold["Cold tier<br/>Glacier, 90 d – 1 y"]
    end
    A1 --> N1
    A2 --> N1
    A3 --> N1
    N1 -->|2–10x reduction| R1
    R1 -->|2–10x reduction| Hot
    Hot -->|age out| Warm
    Warm -->|age out| Cold

Log Volume Reduction Strategies

Logs are the wildest tier in cost. Unlike metrics (bounded by cardinality) and traces (bounded by sampling), logs are a free-form firehose that engineers reflexively crank up under stress. A few patterns to keep log spend in check:

  1. Structured logging only: JSON or equivalent. This lets the pipeline filter and route deterministically rather than running expensive regex over free-text.
  2. Severity-based routing: keep DEBUG and INFO in a short-retention warm tier (3–7 days). Send WARN and above to longer retention. Send only ERROR+ to the always-on alerting pipeline.
  3. Sample DEBUG/INFO in production: 1–10% sampling of high-volume application logs, similar to traces. Per-tenant or per-route exceptions for active investigations.
  4. Drop or hash unbounded fields: stack traces are large and often duplicate — deduplicate by hash, store the hash with the log, keep the full trace only on the first occurrence per N-minute window.
  5. Convert logs to metrics where possible: a counter of db_timeout_total{service="orders"} is vastly cheaper than ten thousand log lines of “DB timeout in orders.” Use log-to-metric processors in the collector.
  6. Suppress duplicate spam: a “circuit breaker open” log line at 1000 lines/sec is not telling you anything new after the first one. Use a rate-limiting processor.

Tiered Storage and Retention Policies

A modern observability backend like Mimir, Tempo, or Loki splits storage into tiers:

TierLatencyCostTypical use
Hot (memory / SSD)msHighLast 2–24 h, live alerting, on-call investigation
Warm (object storage, e.g. S3)secondsMedium1–30 days, recent incident review, capacity planning
Cold (archive / Glacier)minutes–hoursVery low30 d – years, compliance, long-term trend analysis

A reasonable default policy for each pillar:

PillarHotWarmCold
Metrics (raw)2 h15 d
Metrics (recording rules)24 h90 d1 y
Traces (sampled)24 h7 d
Traces (interesting, tail-sampled)24 h30 d90 d
Logs (ERROR+)7 d30 d1 y
Logs (INFO/DEBUG, sampled)24 h7 d

These numbers are illustrative — real retention should be driven by your incident-postmortem cadence, audit/compliance requirements, and how often investigations actually reach back beyond 7, 30, or 90 days. The principle is to measure how often queries hit each retention bucket and prune accordingly. Most teams discover that less than 1% of queries reach beyond 30 days, which is a strong signal that anything older should move to cold storage.

The architectural lever that ties this all together is back-pressure handling. When ingest spikes above capacity — whether from a traffic surge, a noisy deploy, or a runaway debug log — every component in the pipeline must have a documented behavior: drop, buffer, downsample, or block. The wrong default (block) cascades into application latency. The right default (drop with metric) keeps the application healthy and surfaces the overload to operators via clear telemetry-about-telemetry counters.

Key Takeaway: Cost-aware observability architecture treats telemetry as a tiered logistics problem: reduce volume as early as possible (edge aggregation), match retention to query frequency (hot/warm/cold tiers), and design every component for graceful degradation under back-pressure. The cheapest byte is the one you never shipped.


Chapter Summary

Sampling, cardinality, performance overhead, and architecture are the four levers that determine whether your observability stack is a tool or a cost center.

Sampling is your primary cost lever. Use ParentBased(TraceIdRatioBased(p)) for cheap, coherent head sampling at 1–10% on high-volume APIs. Add tail sampling at the collector when you must guarantee capture of rare events like errors or slow traces — but budget memory carefully (spans/sec × spans/trace × span_size × wait_seconds) and accept 5–30 seconds of observability latency.

Cardinality is the metric-side equivalent of sampling. Each unique label combination is a series, and unbounded labels (user_id, request_id, raw paths) explode cost multiplicatively. Identify offenders with topk queries and Mimir’s Cardinality Explorer, drop or normalize them at scrape time with metric_relabel_configs, aggregate via recording rules, and fix the application code so they never reach the pipeline.

Performance overhead of instrumentation is usually 1–5% CPU and tens of MB of memory in well-tuned services, with Java agents at the higher end (up to ~20% in worst cases) and Go SDKs at the lower end. The sampling rate is the most powerful overhead knob; batching tuning is second. Always benchmark against your own workload.

Architecture ties it all together. Reduce volume as early as possible (edge aggregation in app, node, regional tiers), match retention to query frequency (hot/warm/cold tiers), convert logs to metrics where possible, and design every stage for back-pressure with documented drop/buffer behavior. The cheapest, fastest, most accurate telemetry is the telemetry you correctly chose not to keep.

The discipline these four levers share is intentionality: every byte of telemetry should exist because you made a deliberate decision that its value exceeds its cost.


Key Terms

TermDefinition
Head samplingSampling decision made at the start of a trace in the SDK, before spans are recorded. Cheap and deterministic but may miss rare events.
Tail samplingSampling decision made in the collector after the trace has (mostly) completed. Accurate for rare events but expensive in memory and adds observability latency.
TraceIdRatioBasedOpenTelemetry head sampler that hashes the trace ID to make a deterministic keep-or-drop decision at probability p. Usually wrapped in ParentBased.
ParentBasedOpenTelemetry sampler wrapper that makes child spans inherit their parent’s sampling decision, ensuring coherent traces across services.
Adaptive samplingSampling whose rate adjusts dynamically to traffic volume or a target traces-per-second budget. Useful for cost control but complicates statistical extrapolation.
CardinalityThe number of unique time series produced by a metric, equal to the product of the distinct values of each label. Unbounded labels cause cardinality explosion.
metric_relabel_configsPrometheus configuration that operates on individual scraped samples to drop, rename, or normalize labels. The primary scrape-time cardinality control.
Recording ruleA Prometheus rule that pre-computes and persists an aggregation (e.g., sum by (service, le)) so queries hit a lower-cardinality derived series.
BatchSpanProcessor (BSP)OpenTelemetry SDK component that buffers spans and exports them asynchronously in batches. Tunable via batch size, schedule delay, and queue size.
BatchingThe practice of accumulating telemetry into groups before export to amortize per-call overhead and improve network efficiency, at the cost of slightly increased memory and latency.
Back-pressureThe condition where a downstream component cannot accept telemetry as fast as the upstream produces it. Must be handled by dropping, buffering, downsampling, or blocking — with the chosen behavior documented.
RetentionThe duration that telemetry is kept queryable in each storage tier. Hot tiers (ms-latency) are short and expensive; cold tiers (minute-latency) are long and cheap.
Tiered storageArchitecture that automatically migrates telemetry between hot, warm, and cold tiers based on age, matching cost to query frequency.
Edge aggregationThe practice of reducing telemetry volume (sampling, aggregating, dropping attributes) as close to the source as possible — in the SDK, sidecar, or node agent — to minimize downstream cost.

Chapter 12: SLOs, Alerting, and Operational Excellence

This is the capstone chapter. Earlier chapters gave you the signals — metrics from Prometheus, traces and logs from OpenTelemetry, dashboards from Grafana. This chapter gives you the discipline that turns those signals into a reliable production practice. You will move from “we collect telemetry” to “we run services to a contract with our users, and we know — quantitatively — when that contract is at risk.”

We will build the SLI/SLO model on top of PromQL, design multi-window multi-burn-rate (MWMBR) alerts that respect error budgets, configure Alertmanager so the on-call rotation is humane instead of corrosive, sketch a reference architecture for a Kubernetes platform, and finish with a look at where observability is heading — profiles as a fourth signal, AI-assisted root cause analysis, and the convergence of OpenTelemetry semantic conventions.

Learning Objectives

By the end of this chapter, you will be able to:


Service Level Indicators and Objectives

A production service is a promise to its users. Service Level Indicators (SLIs) measure how well you keep that promise; Service Level Objectives (SLOs) are the targets you commit to; and the error budget is the contractually allowed slack between perfect and “good enough.” This is the conceptual machinery that lets engineering and product organizations talk about reliability without arguing about anecdotes.

Choosing Good SLIs

A good SLI is a ratio between good events and valid events — for example, “the fraction of HTTP requests that returned a non-5xx within 300 ms” — measured from the user’s perspective. Ratios are powerful because they normalize naturally with traffic: a 1% error rate is the same whether you serve 10 RPS or 10,000 RPS [Source: https://www.dash0.com/guides/prometheus-monitoring].

Four families of SLIs cover most services:

SLI FamilyQuestion It AnswersExample PromQL Shape
AvailabilityDid the request succeed?sum(rate(http_requests_total{code!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
LatencyWas the response fast enough?sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) / sum(rate(http_request_duration_seconds_count[5m]))
FreshnessIs the data recent enough?time() - max(pipeline_last_success_timestamp_seconds) < 600
CorrectnessDid the system compute the right answer?Domain-specific — e.g., reconciliation counters

Notice that two of these — availability and latency — are directly expressible from any standard OpenTelemetry HTTP instrumentation: http.server.duration histograms and http.server.request.count counters share the same http.response.status_code attribute that PromQL groups on. This is why aligning on OTel semantic conventions (Chapter 8) pays off here: every service in your fleet uses the same metric names, so a single SLO recording-rule template applies everywhere.

Analogy: think of SLIs as the dashboard of a car. Availability is “does the engine start?”, latency is “how fast can it accelerate?”, freshness is “how old is the GPS reading?”, and correctness is “does the odometer read what we actually drove?” The car can be running while still failing on any of these dimensions.

Figure 12.1: SLI → SLO → Error Budget → Burn Rate

graph TD
    SLI[Service Level Indicator<br/>good events / valid events<br/>e.g. non-5xx requests under 300ms]
    SLO[Service Level Objective<br/>target on the SLI<br/>e.g. 99.9% over 30 days]
    EB[Error Budget<br/>= 1 - SLO<br/>e.g. 0.1% ~ 43.2 min / month]
    BR[Burn Rate<br/>observed_error_fraction / error_budget<br/>how fast budget is consumed]
    A[Burn-Rate Alert<br/>fires when consumption is unsustainable<br/>page or ticket by severity]
    SLI --> SLO
    SLO --> EB
    EB --> BR
    BR --> A

Error Budgets

Once you commit to an SLO, simple arithmetic gives you the error budget:

error_budget = 1 - SLO

For a 99.9% availability SLO over 30 days:

That number is the most useful artifact in your reliability program. It converts an abstract percentage into a concrete budget that engineers, product managers, and on-call leads can reason about. If you have already burned 30 minutes this month, you have 13 minutes left — and that knowledge changes deployment risk decisions immediately.

A common error-budget policy reads: “While budget remains, ship features aggressively. When budget is exhausted, freeze risky deploys until the next window.” This makes reliability a forcing function on engineering priorities rather than a vague aspiration [Source: https://www.dash0.com/guides/prometheus-monitoring].

The table below shows budgets for common SLO targets — note how each additional nine is exponentially expensive:

SLOAllowed Bad FractionMonthly Budget (30d)Quarterly Budget (90d)
99%1.0%7 h 12 m21 h 36 m
99.5%0.5%3 h 36 m10 h 48 m
99.9%0.1%43.2 m2 h 9.6 m
99.95%0.05%21.6 m1 h 4.8 m
99.99%0.01%4.32 m12.96 m

Burn Rate and Multi-Window Multi-Burn-Rate Alerts

The burn rate B measures how fast you are consuming the error budget relative to a steady-state pace:

B = observed_error_fraction / error_budget

If B = 1, you will consume exactly the entire budget over the SLO window. If B = 14.4, you’ll exhaust a 30-day budget in roughly 30/14.4 ≈ 2 days. Burn rate is the right abstraction for alerting because it answers the only question that matters: at the current pace, when do we run out of budget? [Source: https://www.dash0.com/guides/prometheus-monitoring]

The naive approach — alert when error rate exceeds some constant threshold — has two problems: it pages on transient blips (false positives), and it does not detect slow, steady erosion of the budget (false negatives). The Google SRE workbook MWMBR pattern solves both by requiring agreement across two windows of different durations [Source: https://prometheus.io/docs/prometheus/latest/getting_started/]:

For a 99.9% / 30-day SLO, the canonical thresholds are:

Alert TypeShort WindowLong WindowBurn RateError ThresholdSeverity
Fast-burn page5 m1 h14.41.44%page
Medium-burn page30 m6 h60.6%page
Slow-burn ticket2 h24 h30.3%ticket
Slowest-burn ticket6 h3 d10.1%ticket

Implement the ratios as recording rules so they evaluate once and can be reused by alerts, dashboards, and ad-hoc queries:

groups:
- name: slo-recording-rules
  interval: 30s
  rules:
  - record: slo:http_errors:ratio_rate5m
    expr: |
      sum by (service, env) (rate(http_requests_total{code=~"5.."}[5m]))
      /
      sum by (service, env) (rate(http_requests_total[5m]))

  - record: slo:http_errors:ratio_rate1h
    expr: |
      sum by (service, env) (rate(http_requests_total{code=~"5.."}[1h]))
      /
      sum by (service, env) (rate(http_requests_total[1h]))

  - record: slo:http_errors:ratio_rate6h
    expr: |
      sum by (service, env) (rate(http_requests_total{code=~"5.."}[6h]))
      /
      sum by (service, env) (rate(http_requests_total[6h]))

  - record: slo:http_errors:ratio_rate3d
    expr: |
      sum by (service, env) (rate(http_requests_total{code=~"5.."}[3d]))
      /
      sum by (service, env) (rate(http_requests_total[3d]))

Then the MWMBR alert rules:

- alert: SLOErrorBudgetBurnFast
  expr: |
    (
      slo:http_errors:ratio_rate5m{service="checkout"} > (14.4 * 0.001)
      and
      slo:http_errors:ratio_rate1h{service="checkout"} > (14.4 * 0.001)
    )
    and
    sum by (service, env) (rate(http_requests_total{service="checkout"}[5m])) > 1
  for: 2m
  labels:
    severity: page
    slo: availability-99.9-30d
    team: payments
  annotations:
    summary: "Checkout burning error budget at >14.4x in {{ $labels.env }}"
    description: |
      5m and 1h error ratios both exceed 1.44%. At this rate the 30-day
      budget will be exhausted in ~2 days. Investigate immediately.
    runbook_url: "https://runbooks.example.com/checkout/slo-fast-burn"

- alert: SLOErrorBudgetBurnSlow
  expr: |
    (
      slo:http_errors:ratio_rate6h{service="checkout"} > (1 * 0.001)
      and
      slo:http_errors:ratio_rate3d{service="checkout"} > (1 * 0.001)
    )
  for: 15m
  labels:
    severity: ticket
    slo: availability-99.9-30d
  annotations:
    summary: "Checkout slow burn — budget exhaustion within SLO window"
    runbook_url: "https://runbooks.example.com/checkout/slo-slow-burn"

Two production details worth noting. First, the sum(rate(...)) > 1 gate suppresses spurious 100% error ratios from a single failed request in a low-traffic window [Source: https://www.sysdig.com/blog/prometheus-exporters-best-practices]. Second, latency SLOs reuse the same machinery — just substitute a histogram-based “good ratio”:

slo:http_latency:good_ratio_5m =
  sum by (service) (rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
  /
  sum by (service) (rate(http_request_duration_seconds_count[5m]))

Figure 12.2: Multi-Window Multi-Burn-Rate Alert Escalation

graph TD
    ER[Observed Error Ratio<br/>recording rules over 5m, 1h, 6h, 3d]
    F1{5m AND 1h<br/>above 14.4x burn?}
    M1{30m AND 6h<br/>above 6x burn?}
    S1{2h AND 24h<br/>above 3x burn?}
    SS1{6h AND 3d<br/>above 1x burn?}
    P1[Fast-Burn PAGE<br/>budget gone in ~2 days<br/>wake on-call now]
    P2[Medium-Burn PAGE<br/>budget gone in ~5 days]
    T1[Slow-Burn TICKET<br/>budget gone in ~10 days]
    T2[Slowest-Burn TICKET<br/>budget exhaustion in window]
    OK[No alert<br/>budget healthy]
    ER --> F1
    F1 -->|yes| P1
    F1 -->|no| M1
    M1 -->|yes| P2
    M1 -->|no| S1
    S1 -->|yes| T1
    S1 -->|no| SS1
    SS1 -->|yes| T2
    SS1 -->|no| OK

Key Takeaway: SLOs convert reliability from opinion into arithmetic. Express SLIs as good/valid ratios using PromQL on OTel-conformant metrics, derive an error budget, and alert on burn rate across paired short and long windows so you page on spikes and catch slow erosion — without crying wolf.


Alerting Architecture

A well-tuned SLO is wasted if the alert it produces lands in a flood of unrelated noise at 3 a.m. The alerting architecture — how alerts route, group, inhibit, and reach humans — determines whether on-call is a sustainable practice or a route to burnout.

Alertmanager Routing, Grouping, and Inhibition

Alertmanager sits downstream of every Prometheus and routes alerts to receivers (PagerDuty, Slack, email). Three primitives shape its behavior: grouping, routing trees, and inhibition.

Grouping collapses multiple related alerts into a single notification. The right group_by labels are stable, low-cardinality identifiers of an incident, not of individual resources. Grouping by pod or container will explode notifications during every rolling deploy; grouping by alertname, service, severity, env produces one notification per actual problem [Source: https://institute.sfeir.com/en/kubernetes-training/deploy-kube-prometheus-stack-production-kubernetes/].

Routing trees branch by severity, environment, and team ownership. Critical prod alerts go to PagerDuty; warnings go to email or a team Slack channel; info goes to a low-priority channel or nowhere [Source: https://www.plural.sh/blog/prometheus-operator-kubernetes-guide/].

Inhibition rules suppress symptom alerts when a known root-cause alert is firing — for instance, a node-down alert silences every per-pod alert on the same node.

A production-shaped Alertmanager configuration:

global:
  resolve_timeout: 5m

route:
  receiver: 'default-slack'
  group_by: ['alertname', 'service', 'severity', 'env']
  group_wait: 60s         # collect bursts before first send
  group_interval: 10m     # how often to send updates for an active group
  repeat_interval: 4h     # re-notify cadence for unresolved pages
  routes:
    - matchers:
        - severity="critical"
      routes:
        - matchers: [ env="prod" ]
          receiver: 'pagerduty-prod'
          continue: true
          routes:
            - matchers: [ team="payments" ]
              receiver: 'pagerduty-payments'
            - matchers: [ team="platform" ]
              receiver: 'pagerduty-platform'
        - matchers: [ env=~"staging|dev" ]
          receiver: 'slack-nonprod'
    - matchers:
        - severity="warning"
      receiver: 'slack-warnings'
      group_interval: 30m
      repeat_interval: 12h
    - matchers:
        - severity="info"
      receiver: 'slack-info'
      repeat_interval: 24h

inhibit_rules:
  - source_matchers: [ severity="critical", alertname="KubernetesNodeDown" ]
    target_matchers: [ alertname=~"KubePodCrashLooping|KubePodNotReady|InstanceDown" ]
    equal: ['node', 'env']

  - source_matchers: [ severity="critical", alertname="DatabaseUnavailable" ]
    target_matchers: [ alertname=~"SLOErrorBudgetBurn.*" ]
    equal: ['env', 'service']

receivers:
  - name: 'pagerduty-prod'
    pagerduty_configs:
      - service_key: '<KEY>'
  - name: 'pagerduty-payments'
    pagerduty_configs:
      - service_key: '<KEY>'
  - name: 'slack-warnings'
    slack_configs:
      - api_url: '<URL>'
        channel: '#alerts-warnings'
        send_resolved: true

Pay attention to the timing knobs:

SettingTypical RangeEffect
group_wait30s – 2mHold first notification to collect siblings
group_interval5m – 30mCadence of updates while group active
repeat_interval2h–6h pages / 12h–24h ticketsCadence of re-pages for unresolved issues

For high availability, run three Alertmanager replicas with gossip clustering so notifications are deduplicated even during rolling restarts; otherwise a restart can double-fire every active alert [Source: https://institute.sfeir.com/en/kubernetes-training/deploy-kube-prometheus-stack-production-kubernetes/].

Every page must be actionable: the on-call engineer should know within 60 seconds what to do. The hard rule is no page without a runbook URL. A runbook entry should answer five questions:

  1. How do I confirm this is real? (PromQL, dashboard, log query)
  2. What is the most likely cause?
  3. What are the safe remediation steps?
  4. When do I escalate, and to whom?
  5. What is the rollback criterion?

Annotate every alert rule:

annotations:
  summary: "Checkout SLO fast burn in {{ $labels.env }}"
  description: |
    5m/1h error ratio > 1.44% for service={{ $labels.service }}.
    Budget exhaustion in ~2 days at current rate.
  dashboard_url: "https://grafana.example.com/d/checkout"
  runbook_url: "https://runbooks.example.com/checkout/slo-fast-burn"

Track three on-call hygiene metrics monthly: alerts per incident, percentage of pages outside business hours, and percentage of alerts with valid runbooks. Any number trending in the wrong direction is a fixable problem [Source: https://community.grafana.com/t/best-practices-we-can-implement-in-production/79356].

Anti-Patterns: Paging on Causes vs Symptoms

The most common failure mode in alerting is paging on causes — “CPU > 90%”, “disk I/O latency > 50 ms”, “garbage collector pauses > 200 ms” — instead of symptoms that affect users. A CPU at 99% with users perfectly happy is not an incident; the same CPU at 60% during a brownout is. Anchor pages on the SLI ladder: the user noticed something is wrong. Reserve cause-based alerts as warning-tier signals or as inputs to ticketing automation [Source: https://www.dash0.com/guides/prometheus-monitoring].

Other anti-patterns worth naming:

Figure 12.3: Alertmanager Routing Tree with Grouping and Inhibition

flowchart TD
    P[Prometheus<br/>alert rules fire] --> AM[Alertmanager<br/>HA cluster x3]
    AM --> G[Group by<br/>alertname, service,<br/>severity, env]
    G --> I{Inhibition rules<br/>root-cause active?}
    I -->|suppressed| X[Drop symptom alert]
    I -->|allowed| R{Route by severity}
    R -->|critical| RE{env?}
    R -->|warning| SW[Slack #alerts-warnings<br/>repeat 12h]
    R -->|info| SI[Slack #alerts-info<br/>repeat 24h]
    RE -->|prod| RT{team?}
    RE -->|staging or dev| SN[Slack #nonprod]
    RT -->|payments| PDP[PagerDuty<br/>payments rotation]
    RT -->|platform| PDF[PagerDuty<br/>platform rotation]
    PDP --> RB[Runbook URL<br/>+ dashboard link<br/>+ silence controls]
    PDF --> RB

Key Takeaway: A humane on-call rotation depends on Alertmanager doing aggressive grouping and inhibition, a routing tree that respects severity and ownership, and a strict rule that every page links to a runbook and points at a user-visible symptom — never a raw cause metric.


Putting It All Together

You have all the pieces. The remaining job is integration — designing a platform that holds together at scale, can be rolled out incrementally, and is itself observable.

Reference Architecture for a Kubernetes Platform

A mature cloud-native observability stack on Kubernetes has six logical layers:

  1. Instrumentation layer — OpenTelemetry SDKs and auto-instrumentation in application processes, emitting OTLP for metrics, traces, logs, and (increasingly) profiles.
  2. Collection layer — OpenTelemetry Collector DaemonSets (per node) for host/pod metrics and OTLP receivers, plus Collector Deployments for fan-in, processing, and routing. Prometheus is deployed via the Prometheus Operator for scrape-based metrics (kube-state-metrics, node-exporter, cAdvisor) [Source: https://www.plural.sh/blog/prometheus-operator-kubernetes-guide/].
  3. Storage layer — Prometheus for short-term metric storage; remote-write to a long-term store (Thanos, Mimir, Cortex); Tempo or Jaeger for traces; Loki or Elasticsearch for logs; Pyroscope or Parca for profiles.
  4. Rule and alerting layer — Prometheus rule files (recording + alerting), Alertmanager HA cluster, runbook hosting (often in the same docs site as user-facing documentation).
  5. Visualization & exploration layer — Grafana with datasource configuration spanning all signal stores, dashboards organized by domain (service, infrastructure, business KPIs).
  6. Meta-observability layer — A small, segregated Prometheus and Alertmanager whose only job is monitoring the observability stack itself.

Figure 12.4: Reference Observability Platform on Kubernetes

flowchart LR
    subgraph L1[1. Instrumentation]
        APP[Application Pods<br/>OTel SDKs<br/>auto-instrumentation]
    end
    subgraph L2[2. Collection]
        DS[OTel Collector<br/>DaemonSet]
        DEP[OTel Collector<br/>Deployment fan-in]
        PROM[Prometheus Operator<br/>ServiceMonitors]
    end
    subgraph L3[3. Storage]
        TSDB[Prometheus TSDB<br/>short-term]
        LTM[Thanos / Mimir<br/>long-term metrics]
        TR[Tempo / Jaeger<br/>traces]
        LG[Loki<br/>logs]
        PR[Pyroscope / Parca<br/>profiles]
    end
    subgraph L4[4. Rules & Alerting]
        RR[Recording Rules]
        AR[Alerting Rules<br/>MWMBR]
        AM[Alertmanager HA<br/>routing + inhibition]
    end
    subgraph L5[5. Visualization]
        GR[Grafana<br/>dashboards + Explore]
    end
    subgraph L6[6. Meta-Observability]
        WD[Watchdog Prometheus<br/>+ Alertmanager<br/>monitors the monitors]
    end
    APP --> DS
    APP --> DEP
    APP --> PROM
    DS --> TSDB
    DS --> TR
    DS --> LG
    DEP --> LTM
    PROM --> TSDB
    TSDB --> LTM
    TSDB --> RR
    RR --> AR
    AR --> AM
    TSDB --> GR
    LTM --> GR
    TR --> GR
    LG --> GR
    PR --> GR
    AM --> GR
    WD -.watches.-> L2
    WD -.watches.-> L3
    WD -.watches.-> L4

The Prometheus Operator simplifies operations because rule files, scrape configs, and Alertmanager configs become Kubernetes CRDs (PrometheusRule, ServiceMonitor, AlertmanagerConfig) managed by the same GitOps tooling as the applications they observe [Source: https://www.plural.sh/blog/prometheus-operator-kubernetes-guide/].

Greenfield Rollout Strategy

If you are starting from scratch, the rollout sequence that minimizes risk while maximizing early value is:

WeekActionOutcome
1Deploy kube-prometheus-stack with default rulesCluster-level metrics, basic alerts, Grafana dashboards
2Add OTel Collector DaemonSet, route to Prometheus + TempoFirst traces from auto-instrumented services
3Define 2–3 critical SLOs with recording rules + MWMBR alertsFirst user-anchored pages
4Configure Alertmanager routing tree, inhibition, runbook URLsReduced noise; team ownership in place
5–6Long-term metric store (Thanos/Mimir) via remote-writeMulti-month retention for capacity reviews
7–8Add logs (Loki) and continuous profiling (Pyroscope)Full four-signal stack
OngoingPer-team SLO definition workshopsReliability conversations become a team practice

Migration from Prometheus-only stacks follows a “wrap, don’t replace” rule. Keep Prometheus where it works (scrape-based infra metrics, alerting), and add OpenTelemetry Collectors as the entry point for new signals. The Collector’s prometheusremotewrite exporter ships OTel metrics into Prometheus or Mimir, while the prometheus receiver lets the Collector scrape existing exporters [Source: https://www.dash0.com/guides/prometheus-monitoring]. The result is a unified pipeline without a forklift migration.

Capacity Planning for the Observability Platform Itself (Meta-Observability)

Observability platforms tend to grow until they consume a meaningful fraction of cluster capacity. Treat the platform like any other service with SLOs of its own. The most important meta-metrics to alert on:

ComponentMeta-MetricWhy It Matters
Prometheusprometheus_rule_evaluation_duration_secondsRule eval lag means alerts arrive late
Prometheusprometheus_tsdb_head_seriesCardinality runaway is the #1 outage cause
Prometheusprometheus_target_scrape_pool_sync_total failuresMissed scrapes = blind spots
Alertmanageralertmanager_notifications_failed_totalPages may not reach humans
Alertmanageralertmanager_cluster_membersHA quorum lost → duplicate notifications
OTel Collectorotelcol_exporter_queue_size / _queue_capacityBackpressure → telemetry loss
OTel Collectorotelcol_processor_dropped_spansSampling/budgets working too aggressively
Tempo/LokiIngestion rate vs. configured limitsQuota violations drop data silently

A small secondary Prometheus with a separate Alertmanager — sometimes called a “watchdog” — scrapes the primary stack and pages the platform team if it goes blind. The pattern is “who watches the watchmen?” implemented as a few hundred lines of YAML [Source: https://prometheus.io/docs/prometheus/latest/getting_started/].

Capacity sizing rules of thumb:

Key Takeaway: A production observability platform is six layers — instrument, collect, store, rule, visualize, and observe itself. Roll out incrementally, lean on the Prometheus Operator and OTel Collector to keep configuration as code, and treat capacity planning for the platform as seriously as you would any user-facing service.


Where Observability Is Heading

The fundamentals — SLOs, MWMBR, Alertmanager hygiene — have been stable for nearly a decade. The interesting changes are happening at the edges, and they will reshape the platform you just built.

Profiles as a Fourth Signal

For most of observability’s history there have been “three pillars”: metrics, logs, traces. Continuous profiling — always-on sampling of stack traces with associated resource usage (CPU cycles, allocated bytes, off-CPU wait time) at 50–200 Hz with 1–3% overhead — is becoming the fourth [Source: https://www.dash0.com/guides/prometheus-monitoring].

What it answers, that the other three cannot:

SignalQuestion
MetricsIs something wrong?
LogsWhat happened?
TracesWhere in the request path?
ProfilesExactly which code is consuming resources, and how did that change?

A concrete example: metrics show p99 latency rising from 200 ms to 500 ms after a deploy. Traces narrow the slowness to CalculateDiscounts spans. Profiles reveal that 40% of CPU is now in a new calculate_rewards_v2 function, dominated by hashmap operations on a high-cardinality in-memory map. Without profiles, you have a suspect; with profiles, you have the exact lines of code.

Two open-source projects lead the space:

The OpenTelemetry profiling signal (currently maturing) adds profiles as a first-class OTLP signal alongside metrics/logs/traces, sharing resource attributes (service.name, k8s.pod.name) and trace/span IDs. Once stable, the same Collector pipeline that already routes your other signals will route profiles too [Source: https://prometheus.io/docs/prometheus/latest/getting_started/].

Adopt profiling when you have at least two of: CPU/memory bills you cannot easily attribute, mysterious tail latency, or performance regressions that are hard to bisect.

Continuous AI-Assisted Root Cause Analysis

The next pragmatic shift is in RCA workflows. Vendors and open-source projects are building systems that:

This is not a replacement for engineering judgment. It is an accelerator that reduces median time-to-diagnose by removing the manual query-writing tax during incidents. Two prerequisites must be in place for it to work well: OTel semantic conventions so the assistant has a consistent vocabulary, and structured runbooks that the assistant can quote as remediation playbooks.

Convergence of OpenTelemetry Semantic Conventions

The longest-running success of the OTel project is that vendors are converging on the same names for the same things — service.name, http.response.status_code, db.system, k8s.pod.name, deployment.environment — so dashboards, alerts, and runbooks written against one backend can be ported to another with relatively little work [Source: https://www.dash0.com/guides/prometheus-monitoring].

For platform teams, the practical implications are:

The strategic takeaway: invest in semantic-convention conformance now. Add CI checks that reject instrumentation using non-standard attribute names; require the OTel resource attributes on every service via the Collector’s resourcedetection processor.

Figure 12.5: The Four Pillars and the Path Forward

graph TD
    subgraph PILLARS[Signals of Modern Observability]
        M[Metrics<br/>Is something wrong?<br/>Prometheus, OTLP]
        L[Logs<br/>What happened?<br/>Loki, Elasticsearch]
        T[Traces<br/>Where in the request path?<br/>Tempo, Jaeger]
        PF[Profiles<br/>Which lines of code?<br/>Pyroscope, Parca]
    end
    SC[OTel Semantic Conventions<br/>service.name, http.response.status_code,<br/>k8s.pod.name, deployment.environment]
    AI[AI-Assisted RCA<br/>correlate signals, summarize hypotheses,<br/>suggest queries from runbooks]
    INC[Incident<br/>faster MTTD &amp; MTTR<br/>portable across vendors]
    M --> SC
    L --> SC
    T --> SC
    PF --> SC
    SC --> AI
    AI --> INC

Key Takeaway: The next decade of observability adds profiles as a fully supported signal, weaves AI assistance into the incident response loop, and accelerates semantic-convention convergence that turns vendor switching into a configuration exercise. Build the foundation on OTel today and you will inherit those benefits without re-platforming.


Chapter Summary

You have crossed the bridge from “we collect telemetry” to “we operate to a contract.” The model is:

The discipline embedded in these patterns is what separates teams that run reliable services from teams that fight fires. The arithmetic of error budgets gives you a shared language with product and engineering leadership. The grouping and inhibition rules give you a sustainable on-call rotation. The meta-observability layer gives you confidence that your monitoring still works when everything else doesn’t. And the trajectory toward profiles, AI assistance, and convention-conformant telemetry means the investment compounds: the foundation you build today gets more powerful with every new capability the ecosystem ships.

If you have made it through this textbook and applied even half of these practices, you are running a cloud-native observability stack that would have been state of the art at most major tech companies five years ago. The work from here is iteration: refining SLOs as you learn what users actually care about, tightening alert rules after every incident review, and adopting new signals as they mature. That is the operational excellence loop.


Key Terms

TermDefinition
SLIService Level Indicator. A quantitative measure of a service property — usually a ratio of good events to valid events — that reflects user experience (availability, latency, freshness, correctness).
SLOService Level Objective. A target value or range for an SLI over a defined window (e.g., 99.9% availability over 30 days).
Error budgetThe allowed amount of “bad” behavior under an SLO, computed as 1 - SLO over the SLO window. For 99.9% over 30 days, ≈ 43 minutes of downtime per month.
Burn rateThe ratio of observed error fraction to error budget. A burn rate of 1 exhausts budget exactly over the SLO window; 14.4 exhausts it in ~2 days for a 30-day window.
MWMBRMulti-window multi-burn-rate. An alerting pattern that requires both a short and a long window to exceed a burn-rate threshold before firing, balancing fast detection against false positives.
AlertmanagerThe Prometheus-ecosystem component that receives alerts, deduplicates and groups them, applies inhibition rules and silences, and routes notifications to receivers like PagerDuty or Slack.
Inhibition ruleAn Alertmanager rule that suppresses a target alert while a higher-severity source alert is firing for the same labeled objects — used to silence symptom alerts when a root-cause alert exists.
RunbookAn operational document linked from every paging alert that explains how to confirm the issue, likely causes, remediation steps, escalation paths, and rollback criteria.
Meta-observabilityThe practice of monitoring the observability platform itself — Prometheus rule eval duration, Alertmanager notification success, Collector queue saturation, exporter scrape health — typically via a separate, isolated stack.
Continuous profilingAlways-on sampling of stack traces with associated resource usage (CPU, memory, off-CPU time) at 50–200 Hz with 1–3% overhead, visualized as flame graphs and diff views — the emerging fourth signal of observability.
OTel semantic conventionsThe standardized set of attribute names (service.name, http.response.status_code, k8s.pod.name, etc.) defined by the OpenTelemetry project that enable cross-vendor portability of dashboards, alerts, and AI-assisted tooling.