Chapter 2: The Three Signals: Metrics, Logs, and Traces in Depth

Learning Objectives

Chapter 1 introduced observability as a property that emerges from three complementary signals. This chapter goes one level deeper: we dissect the wire format of each signal, examine the cardinality math that drives cost, and show how exemplars and W3C Trace Context turn three independent data streams into a single navigable investigative narrative.

Figure 2.1: The three signals and their correlation hooks

graph TD M["Metrics
counts & aggregates
'how many / how fast'"] L["Logs
discrete events
'what happened'"] T["Traces
causal span tree
'why was it slow'"] M -- "exemplars
(trace_id pointers)" --- T T -- "trace_id / span_id
in LogRecord" --- L M -- "shared Resource
(service.name, k8s.*)" --- L Hub(("Unified
investigation")) M --> Hub L --> Hub T --> Hub

Part One — Metrics & Logs (Pre-Quiz)

Take this short quiz before reading sections 1 and 2 below. The same quiz appears again after the content so you can see your improvement.

Pre-Reading Quiz — Metrics & Logs

1. You operate a service across 10 pods. Each pod exposes a request_duration_seconds summary with a quantile="0.99" series. What is the safest way to get a meaningful fleet-wide p99?

Take the avg() of the ten {quantile="0.99"} series across pods.
Take the max() of the ten {quantile="0.99"} series across pods.
You cannot derive a true fleet-wide p99 from summaries; switch the metric to a histogram and aggregate buckets first.
Sum _sum and _count across pods and call histogram_quantile(0.99, ...) on the result.

2. A metric http_requests_total currently has 5 methods, 50 paths, 6 status buckets, 20 services, and 4 regions as labels. A teammate proposes adding user_id as a label "for easy debugging." Why is this dangerous?

Prometheus does not allow more than 5 labels per metric.
Each unique label combination creates a new time series; user_id can multiply 120,000 series by every active user, exploding storage cost.
User IDs cannot be expressed as strings in Prometheus labels.
It would break the rate() function.

3. Which question can a counter answer cleanly (combined with PromQL) that a gauge cannot?

"How much memory is in use right now?"
"What is the request rate per second over the last 5 minutes?"
"What is the current queue depth?"
"What is the current temperature in Celsius?"

4. The OpenTelemetry LogRecord defines a normalized severity_number field. What problem does this primarily solve?

It reduces the on-disk size of log files compared to plain text.
It lets a single query like severity_number >= 17 mean "ERROR or worse" no matter which framework (Java, Python, Go, .NET) emitted the log.
It eliminates the need for a body field.
It encrypts log content so unauthorized backends cannot read it.

5. Your team relies entirely on structured logs (no traces) to debug a microservices checkout flow. A request times out across three services. Why is logs-alone an architecturally weak strategy here?

JSON logs are slower to write than plain text, so logs lag the request.
Log levels above WARN are dropped by most collectors by default.
Logs are emitted per service and do not encode parent/child causality; you must guess causal order from timestamps and propagate request IDs perfectly through every boundary.
Structured logs cannot be indexed by Elasticsearch or Loki.

1. Metrics — Numbers Over Time

A metric is a numeric measurement of a system property recorded at a point in time. Metrics are the cheapest of the three signals because the storage system records aggregates, not individual events. A counter incremented one billion times occupies the same space as one incremented ten times — both are just a current value plus a timestamp series.

Think of metrics as the dashboard gauges in a car: speed, RPM, fuel level. They tell you the state of the system at a glance, but they don't tell you why the engine started knocking three miles back.

Counters, gauges, histograms, and summaries

Animation — Counter, Gauge, Histogram (Live)
Instrument
Counter
requests_total = 1,284,902 (monotonic)
Instrument
Gauge
queue_depth = 42 (can rise or fall)
Instrument
Histogram
le=0.1 .. 1s buckets · p99 = 540ms
Counters only climb. Gauges swing both ways. Histograms accumulate into le=<bound> buckets so quantiles can be computed at query time. Click Replay to re-run.

Histogram vs summary — the one trap to remember

The critical difference: summary quantiles cannot be aggregated across instances. Averaging the p99 from each of ten pods does not give a true global p99. You can safely sum only _sum and _count from a summary. Histogram buckets, in contrast, are fully aggregatable — sum() buckets across instances and then call histogram_quantile() to get meaningful service-wide tail latency.

AspectCounterGaugeHistogramSummary
DirectionUp onlyAnyDistributionDistribution
Quantile computeN/AN/AServer-side (PromQL)Client-side (in process)
Cross-instance aggregationTrivialTrivialYes (sum buckets)Only _sum/_count
Cardinality per metric1 series1 seriesN buckets + 2N quantiles + 2

Classic histograms have a drawback: bucket count multiplies cardinality. Native histograms (Prometheus 2.40+) encode dynamic log-spaced buckets inside a single time series, dramatically reducing series count while supporting high resolution.

Cardinality — the silent budget killer

A Prometheus time series is uniquely identified by its metric name plus the set of label key/value pairs:

http_requests_total{method="GET", path="/api/orders", status="200", service="checkout"}

Each unique combination of labels creates a new series. The math is brutal — consider these labels:

Total potential series: 5 × 50 × 6 × 20 × 4 = 120,000 for one metric. Add user_id with 10,000 unique values and you reach 1.2 billion potential series. This is why veterans say "never label by user ID, request ID, or trace ID." It's also why exemplars exist — they reference a trace ID without making it a series-defining label.

Key Points — Metrics

2. Logs — Structured Events

A log is a discrete event record emitted by an application at a specific moment. Where a metric says "23,481 requests in the last minute," a log says "request #382749 from user 42 to /api/orders failed at 10:03:21 with NullPointerException." Logs preserve the individual story that metrics aggregate away.

Structured vs unstructured logging

For decades, application logs looked like this:

2024-05-10T10:00:00Z my-host app: ERROR user 42 not found

This is unstructured logging. Humans can read it; computers struggle. Searching for "all errors involving user 42" forces a downstream tool to parse the string with regex and hope every team used the same format — an information-destruction pipeline.

Structured logging flips this around: the application emits machine-readable key/value records from the start.

{
  "ts": "2024-05-10T10:00:00Z",
  "level": "info",
  "service": "billing",
  "request_id": "abc-123",
  "user_id": 42,
  "msg": "charged credit card",
  "amount": 99.99
}

OpenTelemetry formalizes this with a typed LogRecord schema. Key fields:

The genius of severity_number is normalization: a backend query severity_number >= 17 means "ERROR or worse" whether the source used Python's logging.ERROR, Java's WARN, Go's log.Error, or .NET's LogLevel.Error.

Figure 2.2: OpenTelemetry LogRecord schema

graph TD LR["LogRecord"] LR --> TS["timestamp
application time"] LR --> OTS["observed_timestamp
pipeline time"] LR --> SEV["Severity"] SEV --> SN["severity_number
1-24 normalized"] SEV --> ST["severity_text
'WARN', 'Warning'..."] LR --> B["body : AnyValue
string / num / map / list"] LR --> A["attributes
http.method, db.system..."] LR --> R["resource
service.name, k8s.*, host.*"] LR --> TC["Trace Context"] TC --> TID["trace_id"] TC --> SID["span_id"] TC --> TF["trace_flags"] LR --> IS["instrumentation_scope"] LR --> DA["dropped_attributes_count"]

Indexing strategies and pipelines

A typical pipeline: application writes structured logs to stdout → node agent (Fluent Bit, Vector, OTel Collector) enriches with Kubernetes metadata and ships → aggregator buffers/samples → storage backend indexes for search.

Elasticsearch-style full-text indexing lets you grep any word in any log instantly, but the index can be larger than the data itself. Loki-style label-only indexing keeps a tiny index over a handful of labels (service, pod, namespace) and stores raw log lines compressed; queries grep through chunks — cheaper to operate but slower for arbitrary substring searches.

AspectOpenTelemetry LogsFluentd / Logstash
Data modelStandardized LogRecord schemaNo global standard; per-pipeline JSON
SeverityNormalized severity_number + severity_textString, ad-hoc normalization
Service contextBuilt-in Resource shared with traces/metricsPlugin-specific conventions
Trace correlationFirst-class trace_id/span_id fieldsManual injection + custom parsing
Multi-signalLogs, traces, metrics share ResourceLogs only

Why logs alone cannot do root-cause analysis

A checkout request times out. The checkout-service log says "called inventory: timeout after 5s." The inventory-service log says "received call, query took 4.8s." The postgres log says "lock wait." Three lines, three streams, one causal chain — but the chain is implicit. Reconstructing it requires propagating a request ID through every boundary (which most teams botch at least one of), aligning clocks that drift, and guessing the causal order from timestamps that don't capture parent/child relationships.

This is exactly the problem traces solve. Logs complement traces by carrying rich per-event context; they do not replace the causal graph.

Key Points — Logs

Part One — Post-Reading Quiz

Same questions, now with the reading behind you.

Post-Reading Quiz — Metrics & Logs

1. You operate a service across 10 pods. Each pod exposes a request_duration_seconds summary with a quantile="0.99" series. What is the safest way to get a meaningful fleet-wide p99?

Take the avg() of the ten {quantile="0.99"} series across pods.
Take the max() of the ten {quantile="0.99"} series across pods.
You cannot derive a true fleet-wide p99 from summaries; switch the metric to a histogram and aggregate buckets first.
Sum _sum and _count across pods and call histogram_quantile(0.99, ...) on the result.

2. A metric http_requests_total currently has 5 methods, 50 paths, 6 status buckets, 20 services, and 4 regions as labels. A teammate proposes adding user_id as a label "for easy debugging." Why is this dangerous?

Prometheus does not allow more than 5 labels per metric.
Each unique label combination creates a new time series; user_id can multiply 120,000 series by every active user, exploding storage cost.
User IDs cannot be expressed as strings in Prometheus labels.
It would break the rate() function.

3. Which question can a counter answer cleanly (combined with PromQL) that a gauge cannot?

"How much memory is in use right now?"
"What is the request rate per second over the last 5 minutes?"
"What is the current queue depth?"
"What is the current temperature in Celsius?"

4. The OpenTelemetry LogRecord defines a normalized severity_number field. What problem does this primarily solve?

It reduces the on-disk size of log files compared to plain text.
It lets a single query like severity_number >= 17 mean "ERROR or worse" no matter which framework (Java, Python, Go, .NET) emitted the log.
It eliminates the need for a body field.
It encrypts log content so unauthorized backends cannot read it.

5. Your team relies entirely on structured logs (no traces) to debug a microservices checkout flow. A request times out across three services. Why is logs-alone an architecturally weak strategy here?

JSON logs are slower to write than plain text, so logs lag the request.
Log levels above WARN are dropped by most collectors by default.
Logs are emitted per service and do not encode parent/child causality; you must guess causal order from timestamps and propagate request IDs perfectly through every boundary.
Structured logs cannot be indexed by Elasticsearch or Loki.

Part Two — Traces & Correlation (Pre-Quiz)

One more pre-quiz before the second half. Same five appear again afterward.

Pre-Reading Quiz — Traces & Correlation

6. Why does an OpenTelemetry span carry both a span_id and a parent span_id in addition to the shared trace_id?

Because trace_id is too long to fit in HTTP headers.
To explicitly encode parent-child causality as a directed graph, rather than inferring causal order from clock timestamps that may drift between hosts.
To allow each service to rename the trace before forwarding it.
Parent span_id is optional and only used by Jaeger.

7. A single HTTP call from checkout-service to inventory-service produces how many spans, and what kinds, under OpenTelemetry semantic conventions?

One INTERNAL span on the caller only.
Two spans: a CLIENT span on checkout-service and a SERVER span on inventory-service, sharing trace_id with a parent-child link.
Two SERVER spans, one on each side.
Three spans: CLIENT, SERVER, and a routing span on the network layer.

8. Why are exemplars stored in a separate path from normal Prometheus time series rather than as additional labels on the metric?

Because Prometheus requires alphabetical ordering of labels and trace_ids break that ordering.
Because trace_id has effectively unbounded cardinality; storing it as a series-defining label would multiply time series per unique trace and crush the TSDB.
Because exemplars are encrypted while metric samples are not.
Because Grafana cannot read trace_ids that live alongside other labels.

9. Your team head-samples traces at 10% but logs unconditionally on every request, with trace_id injected into LogRecords automatically. What is the predictable consequence?

Tempo will reject the over-sampled log volume.
Roughly 90% of trace_id values in logs will point to traces that were never stored, breaking "jump from log to trace" links.
Log timestamps will drift relative to trace timestamps.
Loki's label index will exceed Elasticsearch's, increasing cost.

10. Why does a shared OpenTelemetry Resource (e.g. service.name = checkout-service) matter more than any single correlation hook between metrics, logs, and traces?

Because Resource attributes are encrypted end-to-end.
Because once metrics, logs, and traces all carry the same Resource attributes, a single query like service.name = checkout-service returns matching telemetry of all three signals with no per-tool tag mapping.
Because Resource attributes are required to compute histogram_quantile().
Because Resource attributes replace the need for trace_id propagation.

3. Traces — Causally-linked Spans

A trace is the recorded journey of a single request through a distributed system. If metrics are the dashboard and logs are the event log, traces are the GPS track — not just that the trip happened but the exact route, which segments were slow, and where the detours were.

Trace, span, and span context

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
              |       trace_id              |   span_id    |flags

When checkout-service calls inventory-service, it copies its current span context into a traceparent header. The receiver reads it, knows the parent's trace_id and span_id, and starts a child span under the same trace. This is how one trace stitches itself together across N services with no central coordinator.

Animation — Span Tree Expansion (Checkout request)
SERVERPOST /checkout (a1) 980ms
CLIENT  inventory.check (b2, parent=a1) 620ms
SERVER    GET /inventory (c3, parent=b2) 600ms
DB      db.query SELECT stock (d4) 540ms
CLIENT  payment.charge (e5, parent=a1) 320ms
SERVER    POST /charge (f6, parent=e5) 300ms
All six spans share one trace_id. Each non-root span points to its parent's span_id, forming an explicit causal tree. The yellow bar (db.query, 540ms) dominates the request — this is what a flame graph would highlight at the root cause.

Span kinds and auto-derived service maps

OpenTelemetry classifies spans by kind:

A typical HTTP call produces two spans: a CLIENT span on the caller and a SERVER span on the callee, linked by the same trace_id and a parent-child relationship. Because every cross-service call produces a tagged CLIENT/SERVER pair, a backend can build a live service map automatically: services are nodes, observed parent-child cross-service relationships are edges, edge weight = call rate, edge color = error rate or latency.

Figure 2.4: Service map auto-derived from CLIENT→SERVER span pairs

graph LR Web["web-frontend"] Co["checkout-service"] Inv["inventory-service"] Pay["payment-service"] PG[("postgres")] Rd[("redis")] Web -- "1200 rps
err 0.1%" --> Co Co -- "1100 rps
err 0.4%" --> Inv Co -- "900 rps
err 2.1% (hot)" --> Pay Inv -- "1100 rps
p99 540ms" --> PG Co -- "1200 rps
p99 5ms" --> Rd Pay -- "900 rps
err 1.8%" --> PG

Key Points — Traces

4. Correlating the Three

Each signal in isolation is useful; together they become an investigative narrative. The mechanics of correlation — how a dot on a Grafana chart links to a trace, which in turn surfaces the relevant log lines — rely on three standardized hooks.

Exemplars in Prometheus histograms

The naive way to link a latency spike to a slow request is to put trace_id in the metric labels — which blows up cardinality. Exemplars solve this by attaching a few representative trace_id/span_id pointers alongside metric samples without making them series-defining labels.

The OpenMetrics text exposition format appends an exemplar after the sample value, prefixed by #:

http_server_request_duration_seconds_bucket{le="0.5",method="GET",service="api"} 240 1700000100 \
  # {trace_id="4bf92f3577b34da6a3ce929d0e0e4736",span_id="00f067aa0ba902b7"} 1.700000099e+09

Constraints worth memorizing:

Animation — Metric → Trace → Log Correlation Jump
Metric (Prometheus)
p99 latency spike at 10:05
histogram_quantile(0.99, ...)
exemplar trace_id =
4bf92f3577b34da6...
Trace (Tempo)
GET trace by id →
checkout → inventory → db
db.query span =
2.1s (lock wait)
Log (Loki)
{service="checkout"} |= trace_id
ERROR row-level lock
trace_id=4bf92f35... span_id=d4
Same trace_id appears in all three signals. Exemplar takes you from metric dot → trace; trace_id field in LogRecord takes you from trace span → logs. Shared service.name binds everything to one origin.

Figure 2.5: Metric-to-trace exemplar drill-down workflow

sequenceDiagram participant Dev as Developer participant Graf as Grafana participant Prom as Prometheus
(TSDB + exemplar store) participant Tempo as Tempo Dev->>Graf: open p99 latency dashboard Graf->>Prom: histogram_quantile(0.99, ...) Prom-->>Graf: latency series + exemplar dots Note over Graf: spike at 10:05
dot shows trace_id=4bf92f35... Dev->>Graf: hover dot, click "View trace" Graf->>Tempo: GET /traces/4bf92f35... Tempo-->>Graf: span tree (checkout → inventory → db) Note over Dev,Tempo: db.query span = 2.1s
(lock wait identified)

Trace-to-logs joins via trace_id and span_id

The second leg is trace-to-logs, made possible because the OpenTelemetry LogRecord schema reserves dedicated trace_id and span_id fields at the top level — not buried in attributes. When an application logs inside an active span, the OpenTelemetry logging integration automatically copies the current span's trace_id and span_id into the LogRecord. In Java with log4j2, in Python with the OTel logging instrumentation, in Go with otelslog — the SDK does it for you.

From the trace view, "show logs for this span" is a single derived-field query:

{service="checkout"} |= "4bf92f3577b34da6a3ce929d0e0e4736"

Sampling pitfall: if you head-sample traces at 10% but log unconditionally, 90% of your log trace_ids point to traces that don't exist in Tempo. Use tail-based sampling (decide based on outcomes — errors and slow traces always kept) or ensure the sampling decision propagates early so non-sampled traces don't leave stale IDs in logs.

Unified resource attributes across all signals

The third hook is the OpenTelemetry Resource — a small, fixed set of attributes describing where the telemetry came from, attached to every metric, log, and trace emitted by a process:

service.name = "checkout-service"
service.namespace = "payments"
service.instance.id = "pod-abc123"
service.version = "1.4.2"
deployment.environment = "prod"
k8s.pod.name = "checkout-7f8b9-abcde"
k8s.namespace.name = "payments-prod"
host.name = "ip-10-0-1-42.ec2.internal"
cloud.region = "us-east-1"

Because all three signals share these attributes verbatim, the query "all telemetry for service.name = checkout-service in deployment.environment = prod" returns matching metrics, logs, and traces from a single filter — no vendor-specific tag mapping. Semantic conventions extend this consistency to operation-level attributes (http.request.method, db.system.name, rpc.service) so queries port across backends.

Correlation hookJoinsMechanism
ExemplarsMetrics → Tracestrace_id/span_id appended to histogram samples in OpenMetrics
Trace context in LogRecordLogs ↔ Tracestrace_id/span_id as first-class LogRecord fields, auto-populated by SDK
Shared resource attributesAll threeservice.name, k8s.*, host.* identical across signals from the same process
Semantic conventionsAll threeStandard attribute keys (http.*, db.*, rpc.*) used uniformly

Figure 2.6: Correlation hub — three standardized hooks unifying the signals

graph TB subgraph Signals M["Metrics
Prometheus TSDB"] L["Logs
Loki / Elasticsearch"] T["Traces
Tempo / Jaeger"] end H["Shared Resource
service.name
k8s.pod.name
deployment.environment"] EX["Exemplars
trace_id appended
to histogram samples"] TC["LogRecord trace context
top-level trace_id / span_id
auto-populated by SDK"] M -- "drill via" --- EX EX -- "to" --- T T -- "join via" --- TC TC -- "to" --- L M --- H L --- H T --- H H --- Q["query: service.name=checkout
returns metrics + logs + traces"]

Key Points — Correlation

Part Two — Post-Reading Quiz

Same five questions one more time.

Post-Reading Quiz — Traces & Correlation

6. Why does an OpenTelemetry span carry both a span_id and a parent span_id in addition to the shared trace_id?

Because trace_id is too long to fit in HTTP headers.
To explicitly encode parent-child causality as a directed graph, rather than inferring causal order from clock timestamps that may drift between hosts.
To allow each service to rename the trace before forwarding it.
Parent span_id is optional and only used by Jaeger.

7. A single HTTP call from checkout-service to inventory-service produces how many spans, and what kinds, under OpenTelemetry semantic conventions?

One INTERNAL span on the caller only.
Two spans: a CLIENT span on checkout-service and a SERVER span on inventory-service, sharing trace_id with a parent-child link.
Two SERVER spans, one on each side.
Three spans: CLIENT, SERVER, and a routing span on the network layer.

8. Why are exemplars stored in a separate path from normal Prometheus time series rather than as additional labels on the metric?

Because Prometheus requires alphabetical ordering of labels and trace_ids break that ordering.
Because trace_id has effectively unbounded cardinality; storing it as a series-defining label would multiply time series per unique trace and crush the TSDB.
Because exemplars are encrypted while metric samples are not.
Because Grafana cannot read trace_ids that live alongside other labels.

9. Your team head-samples traces at 10% but logs unconditionally on every request, with trace_id injected into LogRecords automatically. What is the predictable consequence?

Tempo will reject the over-sampled log volume.
Roughly 90% of trace_id values in logs will point to traces that were never stored, breaking "jump from log to trace" links.
Log timestamps will drift relative to trace timestamps.
Loki's label index will exceed Elasticsearch's, increasing cost.

10. Why does a shared OpenTelemetry Resource (e.g. service.name = checkout-service) matter more than any single correlation hook between metrics, logs, and traces?

Because Resource attributes are encrypted end-to-end.
Because once metrics, logs, and traces all carry the same Resource attributes, a single query like service.name = checkout-service returns matching telemetry of all three signals with no per-tool tag mapping.
Because Resource attributes are required to compute histogram_quantile().
Because Resource attributes replace the need for trace_id propagation.

Your Progress

Answer Explanations