Study Guide: The Three Signals: Metrics, Logs, and Traces in Depth

Chapter 1 introduced observability as a property that emerges from three complementary signals. This chapter goes one level deeper: we dissect the wire format of each signal, examine the cardinality math that drives cost, and show how exemplars and W3C Trace Context turn three independent data streams into a single navigable investigative narrative.

Figure 2.1: The three signals and their correlation hooks

Part One — Metrics & Logs (Pre-Quiz)

Take this short quiz before reading sections 1 and 2 below. The same quiz appears again after the content so you can see your improvement.

Pre-Reading Quiz — Metrics & Logs

1. You operate a service across 10 pods. Each pod exposes a request_duration_seconds summary with a quantile="0.99" series. What is the safest way to get a meaningful fleet-wide p99?

Take the avg() of the ten {quantile="0.99"} series across pods.

Take the max() of the ten {quantile="0.99"} series across pods.

You cannot derive a true fleet-wide p99 from summaries; switch the metric to a histogram and aggregate buckets first.

Sum _sum and _count across pods and call histogram_quantile(0.99, ...) on the result.

2. A metric http_requests_total currently has 5 methods, 50 paths, 6 status buckets, 20 services, and 4 regions as labels. A teammate proposes adding user_id as a label "for easy debugging." Why is this dangerous?

Prometheus does not allow more than 5 labels per metric.

Each unique label combination creates a new time series; user_id can multiply 120,000 series by every active user, exploding storage cost.

User IDs cannot be expressed as strings in Prometheus labels.

It would break the rate() function.

3. Which question can a counter answer cleanly (combined with PromQL) that a gauge cannot?

"How much memory is in use right now?"

"What is the request rate per second over the last 5 minutes?"

"What is the current queue depth?"

"What is the current temperature in Celsius?"

4. The OpenTelemetry LogRecord defines a normalized severity_number field. What problem does this primarily solve?

It reduces the on-disk size of log files compared to plain text.

It lets a single query like severity_number >= 17 mean "ERROR or worse" no matter which framework (Java, Python, Go, .NET) emitted the log.

It eliminates the need for a body field.

It encrypts log content so unauthorized backends cannot read it.

5. Your team relies entirely on structured logs (no traces) to debug a microservices checkout flow. A request times out across three services. Why is logs-alone an architecturally weak strategy here?

JSON logs are slower to write than plain text, so logs lag the request.

Log levels above WARN are dropped by most collectors by default.

Logs are emitted per service and do not encode parent/child causality; you must guess causal order from timestamps and propagate request IDs perfectly through every boundary.

Structured logs cannot be indexed by Elasticsearch or Loki.

1. Metrics — Numbers Over Time

A metric is a numeric measurement of a system property recorded at a point in time. Metrics are the cheapest of the three signals because the storage system records aggregates, not individual events. A counter incremented one billion times occupies the same space as one incremented ten times — both are just a current value plus a timestamp series.

Think of metrics as the dashboard gauges in a car: speed, RPM, fuel level. They tell you the state of the system at a glance, but they don't tell you why the engine started knocking three miles back.

Counters, gauges, histograms, and summaries

Counter — monotonically increasing (only goes up, or resets on restart). Examples: http_requests_total, errors_total. Answers "how many?" and "at what rate?" via rate().
Gauge — can go up or down. Examples: memory_bytes_in_use, queue_depth. Represents instantaneous state.
Histogram — a distribution stored as cumulative buckets *_bucket{le="..."} plus _sum and _count. Quantiles like p99 are computed server-side in PromQL with histogram_quantile().
Summary — quantiles computed client-side using a sliding-window algorithm and exposed as {quantile="0.99"} series.

Animation — Counter, Gauge, Histogram (Live)

Instrument

Counter

requests_total = 1,284,902 (monotonic)

Instrument

Gauge

queue_depth = 42 (can rise or fall)

Instrument

Histogram

le=0.1 .. 1s buckets · p99 = 540ms

Counters only climb. Gauges swing both ways. Histograms accumulate into le=<bound> buckets so quantiles can be computed at query time. Click Replay to re-run.

Histogram vs summary — the one trap to remember

The critical difference: summary quantiles cannot be aggregated across instances. Averaging the p99 from each of ten pods does not give a true global p99. You can safely sum only _sum and _count from a summary. Histogram buckets, in contrast, are fully aggregatable — sum() buckets across instances and then call histogram_quantile() to get meaningful service-wide tail latency.

Aspect	Counter	Gauge	Histogram	Summary
Direction	Up only	Any	Distribution	Distribution
Quantile compute	N/A	N/A	Server-side (PromQL)	Client-side (in process)
Cross-instance aggregation	Trivial	Trivial	Yes (sum buckets)	Only `_sum`/`_count`
Cardinality per metric	1 series	1 series	N buckets + 2	N quantiles + 2

Classic histograms have a drawback: bucket count multiplies cardinality. Native histograms (Prometheus 2.40+) encode dynamic log-spaced buckets inside a single time series, dramatically reducing series count while supporting high resolution.

Cardinality — the silent budget killer

A Prometheus time series is uniquely identified by its metric name plus the set of label key/value pairs:

http_requests_total{method="GET", path="/api/orders", status="200", service="checkout"}

Each unique combination of labels creates a new series. The math is brutal — consider these labels:

method: 5 (GET, POST, PUT, DELETE, PATCH)
path: 50 endpoints
status: 6 (status code groups)
service: 20
region: 4

Total potential series: 5 × 50 × 6 × 20 × 4 = 120,000 for one metric. Add user_id with 10,000 unique values and you reach 1.2 billion potential series. This is why veterans say "never label by user ID, request ID, or trace ID." It's also why exemplars exist — they reference a trace ID without making it a series-defining label.

Key Points — Metrics

Cheap because aggregated: 1B increments cost the same as 10 — the TSDB stores current values plus timestamps, not individual events.
Four instrument types: Counter (up only), Gauge (any direction), Histogram (server-side quantiles, aggregatable), Summary (client-side quantiles, not aggregatable).
Histograms beat summaries for SLO work: sum buckets across pods, then histogram_quantile() for true fleet-wide p99.
Native histograms (2.40+): dynamic log-spaced buckets in one series — far less cardinality, often the better default.
Cardinality is the cost: series = product of label cardinalities. Never label by user_id, request_id, or trace_id.

2. Logs — Structured Events

A log is a discrete event record emitted by an application at a specific moment. Where a metric says "23,481 requests in the last minute," a log says "request #382749 from user 42 to /api/orders failed at 10:03:21 with NullPointerException." Logs preserve the individual story that metrics aggregate away.

Structured vs unstructured logging

For decades, application logs looked like this:

2024-05-10T10:00:00Z my-host app: ERROR user 42 not found

This is unstructured logging. Humans can read it; computers struggle. Searching for "all errors involving user 42" forces a downstream tool to parse the string with regex and hope every team used the same format — an information-destruction pipeline.

Structured logging flips this around: the application emits machine-readable key/value records from the start.

{
  "ts": "2024-05-10T10:00:00Z",
  "level": "info",
  "service": "billing",
  "request_id": "abc-123",
  "user_id": 42,
  "msg": "charged credit card",
  "amount": 99.99
}

OpenTelemetry formalizes this with a typed LogRecord schema. Key fields:

timestamp — when the event occurred (application time).
observed_timestamp — when the collector saw it (pipeline time).
severity_number — normalized 1–24 scale: TRACE (1–4), DEBUG (5–8), INFO (9–12), WARN (13–16), ERROR (17–20), FATAL (21–24).
severity_text — original severity string ("WARN", "Warning", "warn").
body — main content, typed as AnyValue (string, number, map, or list).
attributes — structured context (http.method, db.system, ...).
resource — service/host/cluster metadata shared with traces and metrics.
trace_id, span_id, trace_flags — first-class trace context fields.

The genius of severity_number is normalization: a backend query severity_number >= 17 means "ERROR or worse" whether the source used Python's logging.ERROR, Java's WARN, Go's log.Error, or .NET's LogLevel.Error.

Figure 2.2: OpenTelemetry LogRecord schema

graph TD LR["LogRecord"] LR --> TS["timestamp
application time"] LR --> OTS["observed_timestamp
pipeline time"] LR --> SEV["Severity"] SEV --> SN["severity_number
1-24 normalized"] SEV --> ST["severity_text
'WARN', 'Warning'..."] LR --> B["body : AnyValue
string / num / map / list"] LR --> A["attributes
http.method, db.system..."] LR --> R["resource
service.name, k8s.*, host.*"] LR --> TC["Trace Context"] TC --> TID["trace_id"] TC --> SID["span_id"] TC --> TF["trace_flags"] LR --> IS["instrumentation_scope"] LR --> DA["dropped_attributes_count"]

Indexing strategies and pipelines

A typical pipeline: application writes structured logs to stdout → node agent (Fluent Bit, Vector, OTel Collector) enriches with Kubernetes metadata and ships → aggregator buffers/samples → storage backend indexes for search.

Elasticsearch-style full-text indexing lets you grep any word in any log instantly, but the index can be larger than the data itself. Loki-style label-only indexing keeps a tiny index over a handful of labels (service, pod, namespace) and stores raw log lines compressed; queries grep through chunks — cheaper to operate but slower for arbitrary substring searches.

Aspect	OpenTelemetry Logs	Fluentd / Logstash
Data model	Standardized LogRecord schema	No global standard; per-pipeline JSON
Severity	Normalized `severity_number` + `severity_text`	String, ad-hoc normalization
Service context	Built-in `Resource` shared with traces/metrics	Plugin-specific conventions
Trace correlation	First-class `trace_id`/`span_id` fields	Manual injection + custom parsing
Multi-signal	Logs, traces, metrics share Resource	Logs only

Why logs alone cannot do root-cause analysis

A checkout request times out. The checkout-service log says "called inventory: timeout after 5s." The inventory-service log says "received call, query took 4.8s." The postgres log says "lock wait." Three lines, three streams, one causal chain — but the chain is implicit. Reconstructing it requires propagating a request ID through every boundary (which most teams botch at least one of), aligning clocks that drift, and guessing the causal order from timestamps that don't capture parent/child relationships.

This is exactly the problem traces solve. Logs complement traces by carrying rich per-event context; they do not replace the causal graph.

Key Points — Logs

Structured beats unstructured: emit machine-readable key/value records, not free-form strings; OpenTelemetry's LogRecord formalizes this with a typed schema.
Normalized severity: severity_number on a 1–24 scale lets one query mean the same thing across Python/Java/Go/.NET.
Trace context is first-class: trace_id, span_id, and trace_flags live at the top level of LogRecord — not buried in attributes.
Indexing strategy is a cost choice: full-text (Elasticsearch) is powerful and expensive; label-only (Loki) is cheap but limited to label-driven queries.
Logs alone cannot reconstruct distributed causality: they are per-service and lack a parent/child graph — that's what traces are for.

Part One — Post-Reading Quiz

Post-Reading Quiz — Metrics & Logs

1. You operate a service across 10 pods. Each pod exposes a request_duration_seconds summary with a quantile="0.99" series. What is the safest way to get a meaningful fleet-wide p99?

Take the avg() of the ten {quantile="0.99"} series across pods.

Take the max() of the ten {quantile="0.99"} series across pods.

You cannot derive a true fleet-wide p99 from summaries; switch the metric to a histogram and aggregate buckets first.

Sum _sum and _count across pods and call histogram_quantile(0.99, ...) on the result.

Prometheus does not allow more than 5 labels per metric.

Each unique label combination creates a new time series; user_id can multiply 120,000 series by every active user, exploding storage cost.

User IDs cannot be expressed as strings in Prometheus labels.

It would break the rate() function.

3. Which question can a counter answer cleanly (combined with PromQL) that a gauge cannot?

"How much memory is in use right now?"

"What is the request rate per second over the last 5 minutes?"

"What is the current queue depth?"

"What is the current temperature in Celsius?"

4. The OpenTelemetry LogRecord defines a normalized severity_number field. What problem does this primarily solve?

It reduces the on-disk size of log files compared to plain text.

It lets a single query like severity_number >= 17 mean "ERROR or worse" no matter which framework (Java, Python, Go, .NET) emitted the log.

It eliminates the need for a body field.

It encrypts log content so unauthorized backends cannot read it.

5. Your team relies entirely on structured logs (no traces) to debug a microservices checkout flow. A request times out across three services. Why is logs-alone an architecturally weak strategy here?

JSON logs are slower to write than plain text, so logs lag the request.

Log levels above WARN are dropped by most collectors by default.

Logs are emitted per service and do not encode parent/child causality; you must guess causal order from timestamps and propagate request IDs perfectly through every boundary.

Structured logs cannot be indexed by Elasticsearch or Loki.

Part Two — Traces & Correlation (Pre-Quiz)

Pre-Reading Quiz — Traces & Correlation

6. Why does an OpenTelemetry span carry both a span_id and a parent span_id in addition to the shared trace_id?

Because trace_id is too long to fit in HTTP headers.

To explicitly encode parent-child causality as a directed graph, rather than inferring causal order from clock timestamps that may drift between hosts.

To allow each service to rename the trace before forwarding it.

Parent span_id is optional and only used by Jaeger.

7. A single HTTP call from checkout-service to inventory-service produces how many spans, and what kinds, under OpenTelemetry semantic conventions?

One INTERNAL span on the caller only.

Two spans: a CLIENT span on checkout-service and a SERVER span on inventory-service, sharing trace_id with a parent-child link.

Two SERVER spans, one on each side.

Three spans: CLIENT, SERVER, and a routing span on the network layer.

8. Why are exemplars stored in a separate path from normal Prometheus time series rather than as additional labels on the metric?

Because Prometheus requires alphabetical ordering of labels and trace_ids break that ordering.

Because trace_id has effectively unbounded cardinality; storing it as a series-defining label would multiply time series per unique trace and crush the TSDB.

Because exemplars are encrypted while metric samples are not.

Because Grafana cannot read trace_ids that live alongside other labels.

9. Your team head-samples traces at 10% but logs unconditionally on every request, with trace_id injected into LogRecords automatically. What is the predictable consequence?

Tempo will reject the over-sampled log volume.

Roughly 90% of trace_id values in logs will point to traces that were never stored, breaking "jump from log to trace" links.

Log timestamps will drift relative to trace timestamps.

Loki's label index will exceed Elasticsearch's, increasing cost.

10. Why does a shared OpenTelemetry Resource (e.g. service.name = checkout-service) matter more than any single correlation hook between metrics, logs, and traces?

Because Resource attributes are encrypted end-to-end.

Because once metrics, logs, and traces all carry the same Resource attributes, a single query like service.name = checkout-service returns matching telemetry of all three signals with no per-tool tag mapping.

Because Resource attributes are required to compute histogram_quantile().

Because Resource attributes replace the need for trace_id propagation.

3. Traces — Causally-linked Spans

A trace is the recorded journey of a single request through a distributed system. If metrics are the dashboard and logs are the event log, traces are the GPS track — not just that the trip happened but the exact route, which segments were slow, and where the detours were.

Trace, span, and span context

A trace is identified by a 128-bit trace_id (32 hex chars), e.g. 4bf92f3577b34da6a3ce929d0e0e4736. All work performed in service of one logical request shares this ID.
A span is a named unit of work within the trace with a 64-bit span_id (16 hex chars). Each span has a start, end, name, status, and attributes.
A span context is the immutable envelope carrying trace_id, current span_id, and trace_flags across process boundaries. The standard wire format is the W3C traceparent header:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
              |       trace_id              |   span_id    |flags

When checkout-service calls inventory-service, it copies its current span context into a traceparent header. The receiver reads it, knows the parent's trace_id and span_id, and starts a child span under the same trace. This is how one trace stitches itself together across N services with no central coordinator.

Animation — Span Tree Expansion (Checkout request)

SERVERPOST /checkout (a1) 980ms

CLIENT inventory.check (b2, parent=a1) 620ms

SERVER GET /inventory (c3, parent=b2) 600ms

DB db.query SELECT stock (d4) 540ms

CLIENT payment.charge (e5, parent=a1) 320ms

SERVER POST /charge (f6, parent=e5) 300ms

All six spans share one trace_id. Each non-root span points to its parent's span_id, forming an explicit causal tree. The yellow bar (db.query, 540ms) dominates the request — this is what a flame graph would highlight at the root cause.

Span kinds and auto-derived service maps

OpenTelemetry classifies spans by kind:

INTERNAL — work entirely within one process.
SERVER — receives an incoming RPC (callee side).
CLIENT — sends an outgoing RPC (caller side).
PRODUCER — emits a message to a queue.
CONSUMER — receives a message from a queue.

A typical HTTP call produces two spans: a CLIENT span on the caller and a SERVER span on the callee, linked by the same trace_id and a parent-child relationship. Because every cross-service call produces a tagged CLIENT/SERVER pair, a backend can build a live service map automatically: services are nodes, observed parent-child cross-service relationships are edges, edge weight = call rate, edge color = error rate or latency.

Figure 2.4: Service map auto-derived from CLIENT→SERVER span pairs

graph LR Web["web-frontend"] Co["checkout-service"] Inv["inventory-service"] Pay["payment-service"] PG[("postgres")] Rd[("redis")] Web -- "1200 rps
err 0.1%" --> Co Co -- "1100 rps
err 0.4%" --> Inv Co -- "900 rps
err 2.1% (hot)" --> Pay Inv -- "1100 rps
p99 540ms" --> PG Co -- "1200 rps
p99 5ms" --> Rd Pay -- "900 rps
err 1.8%" --> PG

Key Points — Traces

One trace_id, many spans: 128-bit trace_id is shared across the whole request; each span has its own 64-bit span_id plus a parent span_id.
W3C Trace Context (traceparent header) is the vendor-neutral wire format that stitches a trace across services with no central coordinator.
Span kinds (SERVER/CLIENT/PRODUCER/CONSUMER/INTERNAL) let backends draw flame graphs and auto-derive service maps from observed CLIENT→SERVER pairs.
Traces capture causality, not just timing — the parent-child graph is explicit, so root-cause analysis no longer guesses order from drifting clocks.
Service maps reflect the current architecture from live traffic — including unexpected edges your hand-drawn diagram missed.

4. Correlating the Three

Each signal in isolation is useful; together they become an investigative narrative. The mechanics of correlation — how a dot on a Grafana chart links to a trace, which in turn surfaces the relevant log lines — rely on three standardized hooks.

Exemplars in Prometheus histograms

The naive way to link a latency spike to a slow request is to put trace_id in the metric labels — which blows up cardinality. Exemplars solve this by attaching a few representative trace_id/span_id pointers alongside metric samples without making them series-defining labels.

The OpenMetrics text exposition format appends an exemplar after the sample value, prefixed by #:

http_server_request_duration_seconds_bucket{le="0.5",method="GET",service="api"} 240 1700000100 \
  # {trace_id="4bf92f3577b34da6a3ce929d0e0e4736",span_id="00f067aa0ba902b7"} 1.700000099e+09

Constraints worth memorizing:

At most one exemplar per series per scrape interval is typically exported.
Exemplars attach mostly to histogram buckets (great for "find me a trace in this latency bucket") and sometimes to counter increments.
Exemplars live in a separate storage path from ordinary series in the Prometheus TSDB — specifically to avoid cardinality explosion.

Animation — Metric → Trace → Log Correlation Jump

Metric (Prometheus)

p99 latency spike at 10:05

histogram_quantile(0.99, ...)

exemplar trace_id =

4bf92f3577b34da6...

Trace (Tempo)

GET trace by id →

checkout → inventory → db

db.query span =

2.1s (lock wait)

Log (Loki)

{service="checkout"} |= trace_id

ERROR row-level lock

trace_id=4bf92f35... span_id=d4

Same trace_id appears in all three signals. Exemplar takes you from metric dot → trace; trace_id field in LogRecord takes you from trace span → logs. Shared service.name binds everything to one origin.

Figure 2.5: Metric-to-trace exemplar drill-down workflow

sequenceDiagram participant Dev as Developer participant Graf as Grafana participant Prom as Prometheus
(TSDB + exemplar store) participant Tempo as Tempo Dev->>Graf: open p99 latency dashboard Graf->>Prom: histogram_quantile(0.99, ...) Prom-->>Graf: latency series + exemplar dots Note over Graf: spike at 10:05
dot shows trace_id=4bf92f35... Dev->>Graf: hover dot, click "View trace" Graf->>Tempo: GET /traces/4bf92f35... Tempo-->>Graf: span tree (checkout → inventory → db) Note over Dev,Tempo: db.query span = 2.1s
(lock wait identified)

Trace-to-logs joins via trace_id and span_id

The second leg is trace-to-logs, made possible because the OpenTelemetry LogRecord schema reserves dedicated trace_id and span_id fields at the top level — not buried in attributes. When an application logs inside an active span, the OpenTelemetry logging integration automatically copies the current span's trace_id and span_id into the LogRecord. In Java with log4j2, in Python with the OTel logging instrumentation, in Go with otelslog — the SDK does it for you.

From the trace view, "show logs for this span" is a single derived-field query:

{service="checkout"} |= "4bf92f3577b34da6a3ce929d0e0e4736"

Sampling pitfall: if you head-sample traces at 10% but log unconditionally, 90% of your log trace_ids point to traces that don't exist in Tempo. Use tail-based sampling (decide based on outcomes — errors and slow traces always kept) or ensure the sampling decision propagates early so non-sampled traces don't leave stale IDs in logs.

Unified resource attributes across all signals

The third hook is the OpenTelemetry Resource — a small, fixed set of attributes describing where the telemetry came from, attached to every metric, log, and trace emitted by a process:

service.name = "checkout-service"
service.namespace = "payments"
service.instance.id = "pod-abc123"
service.version = "1.4.2"
deployment.environment = "prod"
k8s.pod.name = "checkout-7f8b9-abcde"
k8s.namespace.name = "payments-prod"
host.name = "ip-10-0-1-42.ec2.internal"
cloud.region = "us-east-1"

Because all three signals share these attributes verbatim, the query "all telemetry for service.name = checkout-service in deployment.environment = prod" returns matching metrics, logs, and traces from a single filter — no vendor-specific tag mapping. Semantic conventions extend this consistency to operation-level attributes (http.request.method, db.system.name, rpc.service) so queries port across backends.

Correlation hook	Joins	Mechanism
Exemplars	Metrics → Traces	trace_id/span_id appended to histogram samples in OpenMetrics
Trace context in LogRecord	Logs ↔ Traces	trace_id/span_id as first-class LogRecord fields, auto-populated by SDK
Shared resource attributes	All three	service.name, k8s., host. identical across signals from the same process
Semantic conventions	All three	Standard attribute keys (http., db., rpc.*) used uniformly

Figure 2.6: Correlation hub — three standardized hooks unifying the signals

graph TB subgraph Signals M["Metrics
Prometheus TSDB"] L["Logs
Loki / Elasticsearch"] T["Traces
Tempo / Jaeger"] end H["Shared Resource
service.name
k8s.pod.name
deployment.environment"] EX["Exemplars
trace_id appended
to histogram samples"] TC["LogRecord trace context
top-level trace_id / span_id
auto-populated by SDK"] M -- "drill via" --- EX EX -- "to" --- T T -- "join via" --- TC TC -- "to" --- L M --- H L --- H T --- H H --- Q["query: service.name=checkout
returns metrics + logs + traces"]

Key Points — Correlation

Exemplars attach trace_id/span_id to histogram samples in OpenMetrics format without making them series-defining labels — stored in a separate path to avoid cardinality explosion.
LogRecord has trace_id/span_id as top-level fields — not attributes — so trace-to-logs is a single join key.
OpenTelemetry SDKs auto-inject the current span context into LogRecords when logging inside an active span; you do not pass it manually.
Shared Resource (service.name, k8s.pod.name, deployment.environment) means one filter returns all three signals from the same origin.
Sampling must be coordinated: head-sampling traces but unconditionally logging breaks log-to-trace links; prefer tail-based sampling or propagate the decision early.

Part Two — Post-Reading Quiz

Post-Reading Quiz — Traces & Correlation

6. Why does an OpenTelemetry span carry both a span_id and a parent span_id in addition to the shared trace_id?

Because trace_id is too long to fit in HTTP headers.

To explicitly encode parent-child causality as a directed graph, rather than inferring causal order from clock timestamps that may drift between hosts.

To allow each service to rename the trace before forwarding it.

Parent span_id is optional and only used by Jaeger.

7. A single HTTP call from checkout-service to inventory-service produces how many spans, and what kinds, under OpenTelemetry semantic conventions?

One INTERNAL span on the caller only.

Two spans: a CLIENT span on checkout-service and a SERVER span on inventory-service, sharing trace_id with a parent-child link.

Two SERVER spans, one on each side.

Three spans: CLIENT, SERVER, and a routing span on the network layer.

8. Why are exemplars stored in a separate path from normal Prometheus time series rather than as additional labels on the metric?

Because Prometheus requires alphabetical ordering of labels and trace_ids break that ordering.

Because trace_id has effectively unbounded cardinality; storing it as a series-defining label would multiply time series per unique trace and crush the TSDB.

Because exemplars are encrypted while metric samples are not.

Because Grafana cannot read trace_ids that live alongside other labels.

9. Your team head-samples traces at 10% but logs unconditionally on every request, with trace_id injected into LogRecords automatically. What is the predictable consequence?

Tempo will reject the over-sampled log volume.

Roughly 90% of trace_id values in logs will point to traces that were never stored, breaking "jump from log to trace" links.

Log timestamps will drift relative to trace timestamps.

Loki's label index will exceed Elasticsearch's, increasing cost.

10. Why does a shared OpenTelemetry Resource (e.g. service.name = checkout-service) matter more than any single correlation hook between metrics, logs, and traces?

Because Resource attributes are encrypted end-to-end.

Because Resource attributes are required to compute histogram_quantile().

Because Resource attributes replace the need for trace_id propagation.

Chapter 2: The Three Signals: Metrics, Logs, and Traces in Depth

Learning Objectives

Figure 2.1: The three signals and their correlation hooks

Part One — Metrics & Logs (Pre-Quiz)

1. Metrics — Numbers Over Time

Counters, gauges, histograms, and summaries

Histogram vs summary — the one trap to remember

Cardinality — the silent budget killer

Key Points — Metrics

2. Logs — Structured Events

Structured vs unstructured logging

Figure 2.2: OpenTelemetry LogRecord schema

Indexing strategies and pipelines

Why logs alone cannot do root-cause analysis

Key Points — Logs

Part One — Post-Reading Quiz

Part Two — Traces & Correlation (Pre-Quiz)

3. Traces — Causally-linked Spans

Trace, span, and span context

Span kinds and auto-derived service maps

Figure 2.4: Service map auto-derived from CLIENT→SERVER span pairs

Key Points — Traces

4. Correlating the Three

Exemplars in Prometheus histograms

Figure 2.5: Metric-to-trace exemplar drill-down workflow

Trace-to-logs joins via trace_id and span_id

Unified resource attributes across all signals

Figure 2.6: Correlation hub — three standardized hooks unifying the signals

Key Points — Correlation

Part Two — Post-Reading Quiz

Your Progress

Answer Explanations