Chapter 9: Logs, Events, and Cross-Signal Correlation
Logs in OpenTelemetry are no longer a separate world — they are the third stable signal, with a portable LogRecord schema, mature Collector pipelines, and the all-important trace_id that glues every signal together.
Learning Objectives
Emit OpenTelemetry-compatible structured logs from applications using bridges or native SDKs.
Correlate logs to traces using trace_id and span_id without coupling code to a vendor.
Design a log pipeline that mixes the Collector, Loki, and existing log shippers.
Decide when information belongs in a LogRecord, a span event, or both.
Part 1 — Logs Data Model & Edge Collection
Pre-Quiz — Part 1 (sections 1 & 2)
1. Which field on an OpenTelemetry LogRecord exists specifically so you can diagnose lag introduced by the log pipeline itself?
2. A LogRecord's severity_number of 17 corresponds to which level range?
3. The OpenTelemetry Collector's filelog receiver is most appropriate when:
4. According to the chapter, which approach is recommended for Go and Node.js services where the OTel logs SDK is still experimental?
5. Which Collector processor is most often the difference between "logs I can search" and "logs I can join with metrics and traces" in Kubernetes?
1. OpenTelemetry Logs Data Model
If traces describe what one operation did and metrics describe what is true in aggregate, logs describe what happened, at this exact moment, in detail. OpenTelemetry's logs signal formalizes that intuition into a portable data structure so logs can be shipped, transformed, and queried the same way regardless of language or backend.
A LogRecord is the atomic unit of the logs signal. It lives inside a familiar hierarchy: a Resource (service, host, or pod) contains one or more InstrumentationScopes (the library that emitted the log), which in turn contain LogRecords.
Core fields of a LogRecord
Field
Purpose
timestamp
When the event actually occurred
observed_timestamp
When the collector first saw the event — used to detect pipeline lag
Animation A — LogRecord anatomy: fields stagger into place
Each field appears with a 0.15s stagger. The glowing trace_id and span_id are the correlation hooks tying this log to its trace.
Severity, body, and attributes
OpenTelemetry maps log levels onto a single 1–24 scale: TRACE (1–4), DEBUG (5–8), INFO (9–12), WARN (13–16), ERROR (17–20), FATAL (21–24). The rule of thumb: body is the headline; attributes are the metadata you slice by.
A bridge (appender, handler, provider) converts each event to an OTel LogRecord.
The OTel SDK exports LogRecords via OTLP to the Collector.
For Java, the opentelemetry-logback-appender hooks into Logback; for .NET, builder.Logging.AddOpenTelemetry(...); for Python, a LoggingHandler attached to the root logger; for Node, a pino formatter that pulls trace_id/span_id from the active context.
Severity is a normalized 1–24 scale — 17+ is ERROR.
Body is the headline; attributes are the searchable metadata.
Java/.NET are production-ready; Python is usable; Go/Node should inject IDs into existing loggers.
A log bridge keeps developer ergonomics while normalizing wire format downstream.
2. Collecting Logs at the Edge
The data model is portable; the act of capturing logs is not. The OpenTelemetry Collector absorbs OTLP, files, journald, and even Fluent Forward traffic without forcing a single pattern.
The filelog receiver
The filelog receiver tails files on disk and applies a per-receiver operator pipeline (JSON parser, regex parser, multiline, timestamp/severity extraction).
Animation C — Log pipeline migration: Fluent Bit fades out, OTel Collector fades in
Same log file, two collectors. As migration progresses, Fluent Bit traffic fades while the OTel Collector path lights up — layered coexistence, not rip-and-replace.
Kubernetes patterns
Pattern
Pros
Cons
DaemonSet + filelog
No app changes; any language
hostPath mount; parsing burden
Sidecar OTel SDK
Structured at source; auto-correlation
Pod overhead; hard for 3rd-party
Stdout + k8sattributes
Rich K8s metadata
API load; permissions
The common production layout: DaemonSet + filelog + k8sattributes processor — that processor is often the difference between "logs I can search" and "logs I can join with metrics and traces."
Key Points — Section 2
filelog: tail and parse files on disk (legacy / third-party apps).
journald: pull structured host events from systemd.
fluentforward: ingest from an existing Fluent Bit during migration.
Migration is phased coexistence, not rip-and-replace.
k8sattributes processor enriches every record with pod/namespace/deployment metadata.
Post-Quiz — Part 1 (same questions, retake)
1. Which field on an OpenTelemetry LogRecord exists specifically so you can diagnose lag introduced by the log pipeline itself?
2. A LogRecord's severity_number of 17 corresponds to which level range?
3. The OpenTelemetry Collector's filelog receiver is most appropriate when:
4. According to the chapter, which approach is recommended for Go and Node.js services where the OTel logs SDK is still experimental?
5. Which Collector processor is most often the difference between "logs I can search" and "logs I can join with metrics and traces" in Kubernetes?
Part 2 — Cross-Signal Correlation & Events
Pre-Quiz — Part 2 (sections 3 & 4)
1. What is the primary mechanism that lets Grafana pivot from a span in Tempo to the exact log lines that span produced?
2. Why should you never index trace_id as a Loki label?
3. Your tracing pipeline samples at 1%. You record a critical postmortem-worthy event as a span event. What happens?
4. Which is a key structural difference between a LogRecord and a span event?
5. You need to record a retry attempt that is only meaningful within a single failing HTTP call and benefits from automatic in-trace correlation. Best fit?
3. Cross-Signal Correlation
Structured logs and OTLP transport are means to an end. The end is correlation — clicking a span in Tempo and landing in the precise log lines it produced, or clicking a trace_id in Loki and landing in the full distributed trace.
Stamping trace_id and span_id on logs
There are two paths to get IDs onto a log record:
Automatic, via SDK or bridge — Logback, Log4j2, ILogger, the Python OTel handler all read the active span from context.
Manual, via logger enrichment — in Go and Node.js, fetch the active span yourself and inject IDs as structured fields.
Either way, the result must match exactly: a 32-hex trace_id and 16-hex span_id, lowercase, no dashes. Mismatched casing is a common 2025 pitfall — the Collector's transform processor can normalize it.
Unified resource attributes
Trace-log linking via trace_id is only half the story. Both signals must share the same service.name, service.namespace, deployment.environment, and k8s.namespace.name — otherwise Grafana cannot construct a sensible "Show logs" query.
Figure 9.3 — Trace-to-logs and logs-to-trace pivot
sequenceDiagram
participant App as Application
participant OC as OTel Collector
participant Loki as Loki (Logs)
participant Tempo as Tempo (Traces)
participant G as Grafana
participant U as User
App->>OC: OTLP logs + traces (shared trace_id)
OC->>Loki: Logs pipeline
OC->>Tempo: Traces pipeline
Note over U,G: Trace -> Logs pivot
U->>G: Open trace in Tempo
G->>Tempo: Fetch spans
Tempo-->>G: Spans with trace_id
U->>G: Click "Show logs"
G->>Loki: LogQL {service="payments-api"} | json | trace_id="8f1b..."
Loki-->>G: Matching log lines
G-->>U: Render trace + logs timeline
Note over U,G: Logs -> Trace pivot
U->>G: Click trace_id in log line
G->>Tempo: Lookup trace by trace_id
Tempo-->>G: Full trace
G-->>U: Render trace view
Animation B — Trace -> Log jump: clicking a span flies to Loki and pulls back matching lines
Click a failing span in Tempo → Grafana builds a LogQL query filtered by trace_id → Loki returns the matching log lines into the panel.
The Loki + Tempo + Grafana setup
Loki data source defines a derived field: regex extracts trace_id from the log line and turns it into a clickable link to Tempo.
Tempo data source's Trace to logs setting picks Loki, lists which resource attributes to match (service.name, k8s.namespace.name), and defines the label mapping.
Never index trace_id as a Loki label — cardinality explosion. Keep it as a structured field and rely on derived fields plus | json filtering at query time.
Cross-signal correlation needs three things aligned: matching trace_id/span_id, shared resource attributes, and Grafana data-source wiring.
Logs IDs must match traces IDs exactly — lowercase, no dashes, 32+16 hex.
trace_id is a structured field, never a Loki label (cardinality bomb).
Derived fields make trace_id in log lines clickable into Tempo.
"Trace to logs" in Tempo builds the reverse pivot.
4. Events and Span Events
OpenTelemetry has two things called "events": a LogRecord in the logs signal, and a span event in the traces signal. They look similar but live in different pipelines under different rules.
The structural difference
Aspect
LogRecord
Span Event
Signal
Logs
Traces
Independent identity
Yes
No — lives in a span
Severity
Yes
No
Body
Yes (text or structured)
No (name + attributes only)
Trace correlation
Optional, via explicit IDs
Automatic, inherited from span
Affected by trace sampling
No (separate sampling)
Yes — dropped if span is dropped
Volume profile
High; log-scale backends
Low; embedded in spans
Best for
"App over time"
"Inside this one operation"
The sampling row is the operational killer: if tracing samples at 1%, 99% of span events disappear. Postmortem-grade signals must be LogRecords.
flowchart TD
Start["New piece of information to capture"] --> Q1{"Must survive even if trace is sampled out?"}
Q1 -->|Yes| LR["Emit as LogRecord"]
Q1 -->|No| Q2{"Describes a moment inside one span's operation?"}
Q2 -->|Yes| Q3{"Searched in logs backend? (audit, business, security)"}
Q2 -->|No| LR
Q3 -->|Yes| Both["Emit BOTH: LogRecord + span event"]
Q3 -->|No| SE["Add as span event (exception / retry / state)"]
LR --> Loki2["Logs pipeline -> Loki"]
SE --> Tempo2["Trace pipeline -> Tempo"]
Both --> Loki2
Both --> Tempo2
Span events shine for
Exceptions — the OTel semantic convention defines the exception event with exception.type, exception.message, exception.stacktrace.
Retries — retry.count, retry.delay_ms, retry.reason without exploding span count.
State transitions — connection.established, circuit_breaker.opened, feature_flag.evaluated.
Cache hits/misses on the database span.
Domain events go in logs
Product-analytics-style events (checkout.completed, user.signed_up) are best modeled as LogRecords with an event.name attribute. They ride the logs pipeline (long retention, durable, unsampled) while still carrying trace context for forensic analysis.
Key Points — Section 4
Span events live inside spans (no severity, no body, dropped under sampling).
LogRecords are independent, durable, sampling-resistant.
For exceptions/retries/state transitions inside an operation → span event.
For audit, business, security, postmortem-grade signals → LogRecord.
For critical incidents (500 from payment service) → emit both, joined automatically by trace_id.
Post-Quiz — Part 2 (same questions, retake)
1. What is the primary mechanism that lets Grafana pivot from a span in Tempo to the exact log lines that span produced?
2. Why should you never index trace_id as a Loki label?
3. Your tracing pipeline samples at 1%. You record a critical postmortem-worthy event as a span event. What happens?
4. Which is a key structural difference between a LogRecord and a span event?
5. You need to record a retry attempt that is only meaningful within a single failing HTTP call and benefits from automatic in-trace correlation. Best fit?