Chapter 2: The Three Signals: Metrics, Logs, and Traces in Depth
Learning Objectives
Compare the data models of metrics, logs, and traces and identify when each is the right tool for a given investigative question.
Explain how exemplars and trace IDs correlate signals into a single investigative narrative across Prometheus, OpenTelemetry, and trace backends like Tempo or Jaeger.
Quantify the storage and cardinality cost of each signal type, including the trade-offs of Prometheus histograms versus summaries and classic versus native histograms.
Chapter 1 introduced observability as a property that emerges from three complementary signals. This chapter goes one level deeper: we dissect the wire format of each signal, examine the cardinality math that drives cost, and show how exemplars and W3C Trace Context turn three independent data streams into a single navigable investigative narrative.
Figure 2.1: The three signals and their correlation hooks
graph TD
M["Metrics counts & aggregates 'how many / how fast'"]
L["Logs discrete events 'what happened'"]
T["Traces causal span tree 'why was it slow'"]
M -- "exemplars (trace_id pointers)" --- T
T -- "trace_id / span_id in LogRecord" --- L
M -- "shared Resource (service.name, k8s.*)" --- L
Hub(("Unified investigation"))
M --> Hub
L --> Hub
T --> Hub
Part One — Metrics & Logs (Pre-Quiz)
Take this short quiz before reading sections 1 and 2 below. The same quiz appears again after the content so you can see your improvement.
Pre-Reading Quiz — Metrics & Logs
1. You operate a service across 10 pods. Each pod exposes a request_duration_secondssummary with a quantile="0.99" series. What is the safest way to get a meaningful fleet-wide p99?
Take the avg() of the ten {quantile="0.99"} series across pods.
Take the max() of the ten {quantile="0.99"} series across pods.
You cannot derive a true fleet-wide p99 from summaries; switch the metric to a histogram and aggregate buckets first.
Sum _sum and _count across pods and call histogram_quantile(0.99, ...) on the result.
2. A metric http_requests_total currently has 5 methods, 50 paths, 6 status buckets, 20 services, and 4 regions as labels. A teammate proposes adding user_id as a label "for easy debugging." Why is this dangerous?
Prometheus does not allow more than 5 labels per metric.
Each unique label combination creates a new time series; user_id can multiply 120,000 series by every active user, exploding storage cost.
User IDs cannot be expressed as strings in Prometheus labels.
It would break the rate() function.
3. Which question can a counter answer cleanly (combined with PromQL) that a gauge cannot?
"How much memory is in use right now?"
"What is the request rate per second over the last 5 minutes?"
"What is the current queue depth?"
"What is the current temperature in Celsius?"
4. The OpenTelemetry LogRecord defines a normalized severity_number field. What problem does this primarily solve?
It reduces the on-disk size of log files compared to plain text.
It lets a single query like severity_number >= 17 mean "ERROR or worse" no matter which framework (Java, Python, Go, .NET) emitted the log.
It eliminates the need for a body field.
It encrypts log content so unauthorized backends cannot read it.
5. Your team relies entirely on structured logs (no traces) to debug a microservices checkout flow. A request times out across three services. Why is logs-alone an architecturally weak strategy here?
JSON logs are slower to write than plain text, so logs lag the request.
Log levels above WARN are dropped by most collectors by default.
Logs are emitted per service and do not encode parent/child causality; you must guess causal order from timestamps and propagate request IDs perfectly through every boundary.
Structured logs cannot be indexed by Elasticsearch or Loki.
1. Metrics — Numbers Over Time
A metric is a numeric measurement of a system property recorded at a point in time. Metrics are the cheapest of the three signals because the storage system records aggregates, not individual events. A counter incremented one billion times occupies the same space as one incremented ten times — both are just a current value plus a timestamp series.
Think of metrics as the dashboard gauges in a car: speed, RPM, fuel level. They tell you the state of the system at a glance, but they don't tell you why the engine started knocking three miles back.
Counters, gauges, histograms, and summaries
Counter — monotonically increasing (only goes up, or resets on restart). Examples: http_requests_total, errors_total. Answers "how many?" and "at what rate?" via rate().
Gauge — can go up or down. Examples: memory_bytes_in_use, queue_depth. Represents instantaneous state.
Histogram — a distribution stored as cumulative buckets *_bucket{le="..."} plus _sum and _count. Quantiles like p99 are computed server-side in PromQL with histogram_quantile().
Summary — quantiles computed client-side using a sliding-window algorithm and exposed as {quantile="0.99"} series.
Animation — Counter, Gauge, Histogram (Live)
Instrument
Counter
requests_total = 1,284,902 (monotonic)
Instrument
Gauge
queue_depth = 42 (can rise or fall)
Instrument
Histogram
le=0.1 .. 1s buckets · p99 = 540ms
Counters only climb. Gauges swing both ways. Histograms accumulate into le=<bound> buckets so quantiles can be computed at query time. Click Replay to re-run.
Histogram vs summary — the one trap to remember
The critical difference: summary quantiles cannot be aggregated across instances. Averaging the p99 from each of ten pods does not give a true global p99. You can safely sum only _sum and _count from a summary. Histogram buckets, in contrast, are fully aggregatable — sum() buckets across instances and then call histogram_quantile() to get meaningful service-wide tail latency.
Aspect
Counter
Gauge
Histogram
Summary
Direction
Up only
Any
Distribution
Distribution
Quantile compute
N/A
N/A
Server-side (PromQL)
Client-side (in process)
Cross-instance aggregation
Trivial
Trivial
Yes (sum buckets)
Only _sum/_count
Cardinality per metric
1 series
1 series
N buckets + 2
N quantiles + 2
Classic histograms have a drawback: bucket count multiplies cardinality. Native histograms (Prometheus 2.40+) encode dynamic log-spaced buckets inside a single time series, dramatically reducing series count while supporting high resolution.
Cardinality — the silent budget killer
A Prometheus time series is uniquely identified by its metric name plus the set of label key/value pairs:
Each unique combination of labels creates a new series. The math is brutal — consider these labels:
method: 5 (GET, POST, PUT, DELETE, PATCH)
path: 50 endpoints
status: 6 (status code groups)
service: 20
region: 4
Total potential series: 5 × 50 × 6 × 20 × 4 = 120,000 for one metric. Add user_id with 10,000 unique values and you reach 1.2 billion potential series. This is why veterans say "never label by user ID, request ID, or trace ID." It's also why exemplars exist — they reference a trace ID without making it a series-defining label.
Key Points — Metrics
Cheap because aggregated: 1B increments cost the same as 10 — the TSDB stores current values plus timestamps, not individual events.
Four instrument types: Counter (up only), Gauge (any direction), Histogram (server-side quantiles, aggregatable), Summary (client-side quantiles, not aggregatable).
Histograms beat summaries for SLO work: sum buckets across pods, then histogram_quantile() for true fleet-wide p99.
Native histograms (2.40+): dynamic log-spaced buckets in one series — far less cardinality, often the better default.
Cardinality is the cost: series = product of label cardinalities. Never label by user_id, request_id, or trace_id.
2. Logs — Structured Events
A log is a discrete event record emitted by an application at a specific moment. Where a metric says "23,481 requests in the last minute," a log says "request #382749 from user 42 to /api/orders failed at 10:03:21 with NullPointerException." Logs preserve the individual story that metrics aggregate away.
Structured vs unstructured logging
For decades, application logs looked like this:
2024-05-10T10:00:00Z my-host app: ERROR user 42 not found
This is unstructured logging. Humans can read it; computers struggle. Searching for "all errors involving user 42" forces a downstream tool to parse the string with regex and hope every team used the same format — an information-destruction pipeline.
Structured logging flips this around: the application emits machine-readable key/value records from the start.
The genius of severity_number is normalization: a backend query severity_number >= 17 means "ERROR or worse" whether the source used Python's logging.ERROR, Java's WARN, Go's log.Error, or .NET's LogLevel.Error.
Figure 2.2: OpenTelemetry LogRecord schema
graph TD
LR["LogRecord"]
LR --> TS["timestamp application time"]
LR --> OTS["observed_timestamp pipeline time"]
LR --> SEV["Severity"]
SEV --> SN["severity_number 1-24 normalized"]
SEV --> ST["severity_text 'WARN', 'Warning'..."]
LR --> B["body : AnyValue string / num / map / list"]
LR --> A["attributes http.method, db.system..."]
LR --> R["resource service.name, k8s.*, host.*"]
LR --> TC["Trace Context"]
TC --> TID["trace_id"]
TC --> SID["span_id"]
TC --> TF["trace_flags"]
LR --> IS["instrumentation_scope"]
LR --> DA["dropped_attributes_count"]
Indexing strategies and pipelines
A typical pipeline: application writes structured logs to stdout → node agent (Fluent Bit, Vector, OTel Collector) enriches with Kubernetes metadata and ships → aggregator buffers/samples → storage backend indexes for search.
Elasticsearch-style full-text indexing lets you grep any word in any log instantly, but the index can be larger than the data itself. Loki-style label-only indexing keeps a tiny index over a handful of labels (service, pod, namespace) and stores raw log lines compressed; queries grep through chunks — cheaper to operate but slower for arbitrary substring searches.
Aspect
OpenTelemetry Logs
Fluentd / Logstash
Data model
Standardized LogRecord schema
No global standard; per-pipeline JSON
Severity
Normalized severity_number + severity_text
String, ad-hoc normalization
Service context
Built-in Resource shared with traces/metrics
Plugin-specific conventions
Trace correlation
First-class trace_id/span_id fields
Manual injection + custom parsing
Multi-signal
Logs, traces, metrics share Resource
Logs only
Why logs alone cannot do root-cause analysis
A checkout request times out. The checkout-service log says "called inventory: timeout after 5s." The inventory-service log says "received call, query took 4.8s." The postgres log says "lock wait." Three lines, three streams, one causal chain — but the chain is implicit. Reconstructing it requires propagating a request ID through every boundary (which most teams botch at least one of), aligning clocks that drift, and guessing the causal order from timestamps that don't capture parent/child relationships.
This is exactly the problem traces solve. Logs complement traces by carrying rich per-event context; they do not replace the causal graph.
Key Points — Logs
Structured beats unstructured: emit machine-readable key/value records, not free-form strings; OpenTelemetry's LogRecord formalizes this with a typed schema.
Normalized severity:severity_number on a 1–24 scale lets one query mean the same thing across Python/Java/Go/.NET.
Trace context is first-class:trace_id, span_id, and trace_flags live at the top level of LogRecord — not buried in attributes.
Indexing strategy is a cost choice: full-text (Elasticsearch) is powerful and expensive; label-only (Loki) is cheap but limited to label-driven queries.
Logs alone cannot reconstruct distributed causality: they are per-service and lack a parent/child graph — that's what traces are for.
Part One — Post-Reading Quiz
Same questions, now with the reading behind you.
Post-Reading Quiz — Metrics & Logs
1. You operate a service across 10 pods. Each pod exposes a request_duration_secondssummary with a quantile="0.99" series. What is the safest way to get a meaningful fleet-wide p99?
Take the avg() of the ten {quantile="0.99"} series across pods.
Take the max() of the ten {quantile="0.99"} series across pods.
You cannot derive a true fleet-wide p99 from summaries; switch the metric to a histogram and aggregate buckets first.
Sum _sum and _count across pods and call histogram_quantile(0.99, ...) on the result.
2. A metric http_requests_total currently has 5 methods, 50 paths, 6 status buckets, 20 services, and 4 regions as labels. A teammate proposes adding user_id as a label "for easy debugging." Why is this dangerous?
Prometheus does not allow more than 5 labels per metric.
Each unique label combination creates a new time series; user_id can multiply 120,000 series by every active user, exploding storage cost.
User IDs cannot be expressed as strings in Prometheus labels.
It would break the rate() function.
3. Which question can a counter answer cleanly (combined with PromQL) that a gauge cannot?
"How much memory is in use right now?"
"What is the request rate per second over the last 5 minutes?"
"What is the current queue depth?"
"What is the current temperature in Celsius?"
4. The OpenTelemetry LogRecord defines a normalized severity_number field. What problem does this primarily solve?
It reduces the on-disk size of log files compared to plain text.
It lets a single query like severity_number >= 17 mean "ERROR or worse" no matter which framework (Java, Python, Go, .NET) emitted the log.
It eliminates the need for a body field.
It encrypts log content so unauthorized backends cannot read it.
5. Your team relies entirely on structured logs (no traces) to debug a microservices checkout flow. A request times out across three services. Why is logs-alone an architecturally weak strategy here?
JSON logs are slower to write than plain text, so logs lag the request.
Log levels above WARN are dropped by most collectors by default.
Logs are emitted per service and do not encode parent/child causality; you must guess causal order from timestamps and propagate request IDs perfectly through every boundary.
Structured logs cannot be indexed by Elasticsearch or Loki.
Part Two — Traces & Correlation (Pre-Quiz)
One more pre-quiz before the second half. Same five appear again afterward.
Pre-Reading Quiz — Traces & Correlation
6. Why does an OpenTelemetry span carry both a span_idand a parent span_id in addition to the shared trace_id?
Because trace_id is too long to fit in HTTP headers.
To explicitly encode parent-child causality as a directed graph, rather than inferring causal order from clock timestamps that may drift between hosts.
To allow each service to rename the trace before forwarding it.
Parent span_id is optional and only used by Jaeger.
7. A single HTTP call from checkout-service to inventory-service produces how many spans, and what kinds, under OpenTelemetry semantic conventions?
One INTERNAL span on the caller only.
Two spans: a CLIENT span on checkout-service and a SERVER span on inventory-service, sharing trace_id with a parent-child link.
Two SERVER spans, one on each side.
Three spans: CLIENT, SERVER, and a routing span on the network layer.
8. Why are exemplars stored in a separate path from normal Prometheus time series rather than as additional labels on the metric?
Because Prometheus requires alphabetical ordering of labels and trace_ids break that ordering.
Because trace_id has effectively unbounded cardinality; storing it as a series-defining label would multiply time series per unique trace and crush the TSDB.
Because exemplars are encrypted while metric samples are not.
Because Grafana cannot read trace_ids that live alongside other labels.
9. Your team head-samples traces at 10% but logs unconditionally on every request, with trace_id injected into LogRecords automatically. What is the predictable consequence?
Tempo will reject the over-sampled log volume.
Roughly 90% of trace_id values in logs will point to traces that were never stored, breaking "jump from log to trace" links.
Log timestamps will drift relative to trace timestamps.
Loki's label index will exceed Elasticsearch's, increasing cost.
10. Why does a shared OpenTelemetry Resource (e.g. service.name = checkout-service) matter more than any single correlation hook between metrics, logs, and traces?
Because Resource attributes are encrypted end-to-end.
Because once metrics, logs, and traces all carry the same Resource attributes, a single query like service.name = checkout-service returns matching telemetry of all three signals with no per-tool tag mapping.
Because Resource attributes are required to compute histogram_quantile().
Because Resource attributes replace the need for trace_id propagation.
3. Traces — Causally-linked Spans
A trace is the recorded journey of a single request through a distributed system. If metrics are the dashboard and logs are the event log, traces are the GPS track — not just that the trip happened but the exact route, which segments were slow, and where the detours were.
Trace, span, and span context
A trace is identified by a 128-bit trace_id (32 hex chars), e.g. 4bf92f3577b34da6a3ce929d0e0e4736. All work performed in service of one logical request shares this ID.
A span is a named unit of work within the trace with a 64-bit span_id (16 hex chars). Each span has a start, end, name, status, and attributes.
A span context is the immutable envelope carrying trace_id, current span_id, and trace_flags across process boundaries. The standard wire format is the W3C traceparent header:
When checkout-service calls inventory-service, it copies its current span context into a traceparent header. The receiver reads it, knows the parent's trace_id and span_id, and starts a child span under the same trace. This is how one trace stitches itself together across N services with no central coordinator.
Animation — Span Tree Expansion (Checkout request)
SERVERPOST /checkout (a1) 980ms
CLIENT inventory.check (b2, parent=a1) 620ms
SERVER GET /inventory (c3, parent=b2) 600ms
DB db.query SELECT stock (d4) 540ms
CLIENT payment.charge (e5, parent=a1) 320ms
SERVER POST /charge (f6, parent=e5) 300ms
All six spans share one trace_id. Each non-root span points to its parent's span_id, forming an explicit causal tree. The yellow bar (db.query, 540ms) dominates the request — this is what a flame graph would highlight at the root cause.
Span kinds and auto-derived service maps
OpenTelemetry classifies spans by kind:
INTERNAL — work entirely within one process.
SERVER — receives an incoming RPC (callee side).
CLIENT — sends an outgoing RPC (caller side).
PRODUCER — emits a message to a queue.
CONSUMER — receives a message from a queue.
A typical HTTP call produces two spans: a CLIENT span on the caller and a SERVER span on the callee, linked by the same trace_id and a parent-child relationship. Because every cross-service call produces a tagged CLIENT/SERVER pair, a backend can build a live service map automatically: services are nodes, observed parent-child cross-service relationships are edges, edge weight = call rate, edge color = error rate or latency.
Figure 2.4: Service map auto-derived from CLIENT→SERVER span pairs
graph LR
Web["web-frontend"]
Co["checkout-service"]
Inv["inventory-service"]
Pay["payment-service"]
PG[("postgres")]
Rd[("redis")]
Web -- "1200 rps err 0.1%" --> Co
Co -- "1100 rps err 0.4%" --> Inv
Co -- "900 rps err 2.1% (hot)" --> Pay
Inv -- "1100 rps p99 540ms" --> PG
Co -- "1200 rps p99 5ms" --> Rd
Pay -- "900 rps err 1.8%" --> PG
Key Points — Traces
One trace_id, many spans: 128-bit trace_id is shared across the whole request; each span has its own 64-bit span_id plus a parent span_id.
W3C Trace Context (traceparent header) is the vendor-neutral wire format that stitches a trace across services with no central coordinator.
Span kinds (SERVER/CLIENT/PRODUCER/CONSUMER/INTERNAL) let backends draw flame graphs and auto-derive service maps from observed CLIENT→SERVER pairs.
Traces capture causality, not just timing — the parent-child graph is explicit, so root-cause analysis no longer guesses order from drifting clocks.
Service maps reflect the current architecture from live traffic — including unexpected edges your hand-drawn diagram missed.
4. Correlating the Three
Each signal in isolation is useful; together they become an investigative narrative. The mechanics of correlation — how a dot on a Grafana chart links to a trace, which in turn surfaces the relevant log lines — rely on three standardized hooks.
Exemplars in Prometheus histograms
The naive way to link a latency spike to a slow request is to put trace_id in the metric labels — which blows up cardinality. Exemplars solve this by attaching a few representative trace_id/span_id pointers alongside metric samples without making them series-defining labels.
The OpenMetrics text exposition format appends an exemplar after the sample value, prefixed by #:
At most one exemplar per series per scrape interval is typically exported.
Exemplars attach mostly to histogram buckets (great for "find me a trace in this latency bucket") and sometimes to counter increments.
Exemplars live in a separate storage path from ordinary series in the Prometheus TSDB — specifically to avoid cardinality explosion.
Animation — Metric → Trace → Log Correlation Jump
Metric (Prometheus)
p99 latency spike at 10:05
histogram_quantile(0.99, ...)
exemplar trace_id =
4bf92f3577b34da6...
Trace (Tempo)
GET trace by id →
checkout → inventory → db
db.query span =
2.1s (lock wait)
Log (Loki)
{service="checkout"} |= trace_id
ERROR row-level lock
trace_id=4bf92f35... span_id=d4
Same trace_id appears in all three signals. Exemplar takes you from metric dot → trace; trace_id field in LogRecord takes you from trace span → logs. Shared service.name binds everything to one origin.
sequenceDiagram
participant Dev as Developer
participant Graf as Grafana
participant Prom as Prometheus (TSDB + exemplar store)
participant Tempo as Tempo
Dev->>Graf: open p99 latency dashboard
Graf->>Prom: histogram_quantile(0.99, ...)
Prom-->>Graf: latency series + exemplar dots
Note over Graf: spike at 10:05 dot shows trace_id=4bf92f35...
Dev->>Graf: hover dot, click "View trace"
Graf->>Tempo: GET /traces/4bf92f35...
Tempo-->>Graf: span tree (checkout → inventory → db)
Note over Dev,Tempo: db.query span = 2.1s (lock wait identified)
Trace-to-logs joins via trace_id and span_id
The second leg is trace-to-logs, made possible because the OpenTelemetry LogRecord schema reserves dedicated trace_id and span_id fields at the top level — not buried in attributes. When an application logs inside an active span, the OpenTelemetry logging integration automatically copies the current span's trace_id and span_id into the LogRecord. In Java with log4j2, in Python with the OTel logging instrumentation, in Go with otelslog — the SDK does it for you.
From the trace view, "show logs for this span" is a single derived-field query:
Sampling pitfall: if you head-sample traces at 10% but log unconditionally, 90% of your log trace_ids point to traces that don't exist in Tempo. Use tail-based sampling (decide based on outcomes — errors and slow traces always kept) or ensure the sampling decision propagates early so non-sampled traces don't leave stale IDs in logs.
Unified resource attributes across all signals
The third hook is the OpenTelemetry Resource — a small, fixed set of attributes describing where the telemetry came from, attached to every metric, log, and trace emitted by a process:
Because all three signals share these attributes verbatim, the query "all telemetry for service.name = checkout-service in deployment.environment = prod" returns matching metrics, logs, and traces from a single filter — no vendor-specific tag mapping. Semantic conventions extend this consistency to operation-level attributes (http.request.method, db.system.name, rpc.service) so queries port across backends.
Correlation hook
Joins
Mechanism
Exemplars
Metrics → Traces
trace_id/span_id appended to histogram samples in OpenMetrics
Trace context in LogRecord
Logs ↔ Traces
trace_id/span_id as first-class LogRecord fields, auto-populated by SDK
Shared resource attributes
All three
service.name, k8s.*, host.* identical across signals from the same process
Semantic conventions
All three
Standard attribute keys (http.*, db.*, rpc.*) used uniformly
Figure 2.6: Correlation hub — three standardized hooks unifying the signals
graph TB
subgraph Signals
M["Metrics Prometheus TSDB"]
L["Logs Loki / Elasticsearch"]
T["Traces Tempo / Jaeger"]
end
H["Shared Resource service.name k8s.pod.name deployment.environment"]
EX["Exemplars trace_id appended to histogram samples"]
TC["LogRecord trace context top-level trace_id / span_id auto-populated by SDK"]
M -- "drill via" --- EX
EX -- "to" --- T
T -- "join via" --- TC
TC -- "to" --- L
M --- H
L --- H
T --- H
H --- Q["query: service.name=checkout returns metrics + logs + traces"]
Key Points — Correlation
Exemplars attach trace_id/span_id to histogram samples in OpenMetrics format without making them series-defining labels — stored in a separate path to avoid cardinality explosion.
LogRecord has trace_id/span_id as top-level fields — not attributes — so trace-to-logs is a single join key.
OpenTelemetry SDKs auto-inject the current span context into LogRecords when logging inside an active span; you do not pass it manually.
Shared Resource (service.name, k8s.pod.name, deployment.environment) means one filter returns all three signals from the same origin.
Sampling must be coordinated: head-sampling traces but unconditionally logging breaks log-to-trace links; prefer tail-based sampling or propagate the decision early.
Part Two — Post-Reading Quiz
Same five questions one more time.
Post-Reading Quiz — Traces & Correlation
6. Why does an OpenTelemetry span carry both a span_idand a parent span_id in addition to the shared trace_id?
Because trace_id is too long to fit in HTTP headers.
To explicitly encode parent-child causality as a directed graph, rather than inferring causal order from clock timestamps that may drift between hosts.
To allow each service to rename the trace before forwarding it.
Parent span_id is optional and only used by Jaeger.
7. A single HTTP call from checkout-service to inventory-service produces how many spans, and what kinds, under OpenTelemetry semantic conventions?
One INTERNAL span on the caller only.
Two spans: a CLIENT span on checkout-service and a SERVER span on inventory-service, sharing trace_id with a parent-child link.
Two SERVER spans, one on each side.
Three spans: CLIENT, SERVER, and a routing span on the network layer.
8. Why are exemplars stored in a separate path from normal Prometheus time series rather than as additional labels on the metric?
Because Prometheus requires alphabetical ordering of labels and trace_ids break that ordering.
Because trace_id has effectively unbounded cardinality; storing it as a series-defining label would multiply time series per unique trace and crush the TSDB.
Because exemplars are encrypted while metric samples are not.
Because Grafana cannot read trace_ids that live alongside other labels.
9. Your team head-samples traces at 10% but logs unconditionally on every request, with trace_id injected into LogRecords automatically. What is the predictable consequence?
Tempo will reject the over-sampled log volume.
Roughly 90% of trace_id values in logs will point to traces that were never stored, breaking "jump from log to trace" links.
Log timestamps will drift relative to trace timestamps.
Loki's label index will exceed Elasticsearch's, increasing cost.
10. Why does a shared OpenTelemetry Resource (e.g. service.name = checkout-service) matter more than any single correlation hook between metrics, logs, and traces?
Because Resource attributes are encrypted end-to-end.
Because once metrics, logs, and traces all carry the same Resource attributes, a single query like service.name = checkout-service returns matching telemetry of all three signals with no per-tool tag mapping.
Because Resource attributes are required to compute histogram_quantile().
Because Resource attributes replace the need for trace_id propagation.