Chapter 9: Logs, Events, and Cross-Signal Correlation

Logs in OpenTelemetry are no longer a separate world — they are the third stable signal, with a portable LogRecord schema, mature Collector pipelines, and the all-important trace_id that glues every signal together.

Learning Objectives

Part 1 — Logs Data Model & Edge Collection

Pre-Quiz — Part 1 (sections 1 & 2)

1. Which field on an OpenTelemetry LogRecord exists specifically so you can diagnose lag introduced by the log pipeline itself?

2. A LogRecord's severity_number of 17 corresponds to which level range?

3. The OpenTelemetry Collector's filelog receiver is most appropriate when:

4. According to the chapter, which approach is recommended for Go and Node.js services where the OTel logs SDK is still experimental?

5. Which Collector processor is most often the difference between "logs I can search" and "logs I can join with metrics and traces" in Kubernetes?

1. OpenTelemetry Logs Data Model

If traces describe what one operation did and metrics describe what is true in aggregate, logs describe what happened, at this exact moment, in detail. OpenTelemetry's logs signal formalizes that intuition into a portable data structure so logs can be shipped, transformed, and queried the same way regardless of language or backend.

A LogRecord is the atomic unit of the logs signal. It lives inside a familiar hierarchy: a Resource (service, host, or pod) contains one or more InstrumentationScopes (the library that emitted the log), which in turn contain LogRecords.

Core fields of a LogRecord

FieldPurpose
timestampWhen the event actually occurred
observed_timestampWhen the collector first saw the event — used to detect pipeline lag
severity_number / severity_textNormalized 1–24 scale across all systems
bodyThe main payload — the "headline"
attributesDimensional metadata — the searchable "headers"
trace_id / span_id / trace_flagsCross-signal correlation IDs
resourceService-level metadata shared across signals

Figure 9.1 — LogRecord hierarchy

graph TD
    R["Resource
service.name=payments-api
deployment.environment=prod
k8s.pod.name=payments-7c584fd87f-jc6xg"] R --> S1["InstrumentationScope
com.example.payments"] R --> S2["InstrumentationScope
runtime"] S1 --> L1["LogRecord
severity=ERROR
body=Payment charge failed
trace_id=8f1b...e3f5
span_id=d2a4...b0ce"] S1 --> L2["LogRecord
severity=INFO
body=Charge created
attributes.amount_cents=4200"] S2 --> L3["LogRecord
severity=WARN
body=GC pause 250ms"] S2 --> L4["LogRecord
severity=INFO
body=Heap resized"]
Animation A — LogRecord anatomy: fields stagger into place
LogRecord { } "timestamp": "2025-03-10T10:15:30.123Z" "severity_number": 17 // ERROR "severity_text": "ERROR" "body": "Payment charge failed" "attributes": { "http.route": "/pay", "user.id": "12345" } "resource": { "service.name": "payments-api" } "trace_id": "8f1b5fe2d5de4a51b8884f8f4cdde3f5" "span_id": "d2a41c3ff7a1b0ce"
Each field appears with a 0.15s stagger. The glowing trace_id and span_id are the correlation hooks tying this log to its trace.

Severity, body, and attributes

OpenTelemetry maps log levels onto a single 1–24 scale: TRACE (1–4), DEBUG (5–8), INFO (9–12), WARN (13–16), ERROR (17–20), FATAL (21–24). The rule of thumb: body is the headline; attributes are the metadata you slice by.

Maturity by language (2025)

LanguagePosture
Java, .NETProduction-ready — native logs SDK and bridges
PythonUsable behind a thin abstraction
Go, Node.jsEarly-adopter — keep existing logger, inject IDs manually

Log bridges — the three-step pattern

  1. Application code logs as it always has.
  2. A bridge (appender, handler, provider) converts each event to an OTel LogRecord.
  3. The OTel SDK exports LogRecords via OTLP to the Collector.

For Java, the opentelemetry-logback-appender hooks into Logback; for .NET, builder.Logging.AddOpenTelemetry(...); for Python, a LoggingHandler attached to the root logger; for Node, a pino formatter that pulls trace_id/span_id from the active context.

Key Points — Section 1

2. Collecting Logs at the Edge

The data model is portable; the act of capturing logs is not. The OpenTelemetry Collector absorbs OTLP, files, journald, and even Fluent Forward traffic without forcing a single pattern.

The filelog receiver

The filelog receiver tails files on disk and applies a per-receiver operator pipeline (JSON parser, regex parser, multiline, timestamp/severity extraction).

receivers:
  filelog:
    include:
      - /var/log/pods/*/*/*.log
    include_file_path: true
    start_at: beginning
    operators:
      - type: json_parser
        parse_from: body
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
        severity:
          parse_from: attributes.level

The journald receiver

For host-level systemd events — sshd, cron, OOM kills — the journald receiver reads the binary journal directly, preserving structured fields.

Migrating from Fluent Bit / Fluentd / Vector

The pragmatic path is layered coexistence, not rip-and-replace:

  1. Phase 1 — Leave Fluent Bit in place; deploy the Collector alongside it for new services.
  2. Phase 2 — Standardize log format: JSON with trace_id, span_id, and resource attributes.
  3. Phase 3 — Reconfigure Fluent Bit to forward to the Collector via OTLP, or replace it with the filelog receiver.

The Collector also speaks the Fluent Forward protocol natively as a receiver, so an existing Fluent Bit can point at it during the transition.

Figure 9.2 — Two-track log pipeline during migration

flowchart LR
    subgraph Legacy["Legacy track (Phase 1 - keep running)"]
        A1["Legacy services
stdout/files"] --> FB["Fluent Bit
DaemonSet"] FB --> Splunk["Splunk / ELK"] end subgraph New["OTel track (Phase 2-3 - growing)"] A2["New services
OTLP logs"] --> OC["OTel Collector
DaemonSet
filelog + OTLP receiver"] A3["JSON file logs"] --> OC OC --> Loki["Loki"] OC --> Tempo["Tempo / OTLP backend"] end A1 -. "service adopts OTel SDK
or JSON logs" .-> A2 FB -. "optional: Fluent Forward
during transition" .-> OC
Animation C — Log pipeline migration: Fluent Bit fades out, OTel Collector fades in
/var/log/pods app/0.log (shared source) Fluent Bit DaemonSet (legacy) tail -> parse -> ship Splunk / ELK existing backend OTel Collector filelog receiver + k8sattributes Loki + Tempo OTLP backend
Same log file, two collectors. As migration progresses, Fluent Bit traffic fades while the OTel Collector path lights up — layered coexistence, not rip-and-replace.

Kubernetes patterns

PatternProsCons
DaemonSet + filelogNo app changes; any languagehostPath mount; parsing burden
Sidecar OTel SDKStructured at source; auto-correlationPod overhead; hard for 3rd-party
Stdout + k8sattributesRich K8s metadataAPI load; permissions

The common production layout: DaemonSet + filelog + k8sattributes processor — that processor is often the difference between "logs I can search" and "logs I can join with metrics and traces."

Key Points — Section 2

Post-Quiz — Part 1 (same questions, retake)

1. Which field on an OpenTelemetry LogRecord exists specifically so you can diagnose lag introduced by the log pipeline itself?

2. A LogRecord's severity_number of 17 corresponds to which level range?

3. The OpenTelemetry Collector's filelog receiver is most appropriate when:

4. According to the chapter, which approach is recommended for Go and Node.js services where the OTel logs SDK is still experimental?

5. Which Collector processor is most often the difference between "logs I can search" and "logs I can join with metrics and traces" in Kubernetes?

Part 2 — Cross-Signal Correlation & Events

Pre-Quiz — Part 2 (sections 3 & 4)

1. What is the primary mechanism that lets Grafana pivot from a span in Tempo to the exact log lines that span produced?

2. Why should you never index trace_id as a Loki label?

3. Your tracing pipeline samples at 1%. You record a critical postmortem-worthy event as a span event. What happens?

4. Which is a key structural difference between a LogRecord and a span event?

5. You need to record a retry attempt that is only meaningful within a single failing HTTP call and benefits from automatic in-trace correlation. Best fit?

3. Cross-Signal Correlation

Structured logs and OTLP transport are means to an end. The end is correlation — clicking a span in Tempo and landing in the precise log lines it produced, or clicking a trace_id in Loki and landing in the full distributed trace.

Stamping trace_id and span_id on logs

There are two paths to get IDs onto a log record:

  1. Automatic, via SDK or bridge — Logback, Log4j2, ILogger, the Python OTel handler all read the active span from context.
  2. Manual, via logger enrichment — in Go and Node.js, fetch the active span yourself and inject IDs as structured fields.

Either way, the result must match exactly: a 32-hex trace_id and 16-hex span_id, lowercase, no dashes. Mismatched casing is a common 2025 pitfall — the Collector's transform processor can normalize it.

Unified resource attributes

Trace-log linking via trace_id is only half the story. Both signals must share the same service.name, service.namespace, deployment.environment, and k8s.namespace.name — otherwise Grafana cannot construct a sensible "Show logs" query.

Figure 9.3 — Trace-to-logs and logs-to-trace pivot

sequenceDiagram
    participant App as Application
    participant OC as OTel Collector
    participant Loki as Loki (Logs)
    participant Tempo as Tempo (Traces)
    participant G as Grafana
    participant U as User
    App->>OC: OTLP logs + traces (shared trace_id)
    OC->>Loki: Logs pipeline
    OC->>Tempo: Traces pipeline
    Note over U,G: Trace -> Logs pivot
    U->>G: Open trace in Tempo
    G->>Tempo: Fetch spans
    Tempo-->>G: Spans with trace_id
    U->>G: Click "Show logs"
    G->>Loki: LogQL {service="payments-api"} 
| json | trace_id="8f1b..." Loki-->>G: Matching log lines G-->>U: Render trace + logs timeline Note over U,G: Logs -> Trace pivot U->>G: Click trace_id in log line G->>Tempo: Lookup trace by trace_id Tempo-->>G: Full trace G-->>U: Render trace view
Animation B — Trace -> Log jump: clicking a span flies to Loki and pulls back matching lines
Tempo — Trace 8f1b...e3f5 POST /pay stripe.charge db.write FAILED user clicks "Show logs" Loki — LogQL {service="payments-api"} | json | trace_id = "8f1b5fe2d5de4a51..." ERROR db.write timeout trace_id=8f1b... span_id=d2a4... WARN retry attempt=2 trace_id=8f1b... span_id=d2a4... INFO charge initiated trace_id=8f1b... span_id=d2a4... 3 lines matched, shared trace_id
Click a failing span in Tempo → Grafana builds a LogQL query filtered by trace_id → Loki returns the matching log lines into the panel.

The Loki + Tempo + Grafana setup

The generated LogQL query

{service="payments-api", env="prod"} | json | trace_id = "8f1b5fe2d5de4a51b8884f8f4cdde3f5"

Key Points — Section 3

4. Events and Span Events

OpenTelemetry has two things called "events": a LogRecord in the logs signal, and a span event in the traces signal. They look similar but live in different pipelines under different rules.

The structural difference

AspectLogRecordSpan Event
SignalLogsTraces
Independent identityYesNo — lives in a span
SeverityYesNo
BodyYes (text or structured)No (name + attributes only)
Trace correlationOptional, via explicit IDsAutomatic, inherited from span
Affected by trace samplingNo (separate sampling)Yes — dropped if span is dropped
Volume profileHigh; log-scale backendsLow; embedded in spans
Best for"App over time""Inside this one operation"

The sampling row is the operational killer: if tracing samples at 1%, 99% of span events disappear. Postmortem-grade signals must be LogRecords.

Figure 9.4 — Two pipelines joined by trace_id

flowchart LR
    subgraph Traces["Trace pipeline (sampled)"]
        SP["Span: GET /orders/{id}
trace_id=8f1b...e3f5
span_id=d2a4...b0ce"] SP -.- E1["Span event: exception
exception.type=NPE"] SP -.- E2["Span event: cache.miss
cache.key=user:42"] SP --> Tempo["Tempo"] end subgraph Logs["Logs pipeline (unsampled)"] L1["LogRecord
severity=ERROR
body=Order lookup failed
trace_id=8f1b...e3f5"] L2["LogRecord
severity=INFO
body=checkout.completed"] L1 --> Loki["Loki"] L2 --> Loki end L1 -. "shared trace_id
(cross-signal join)" .-> SP

Figure 9.5 — Decision tree

flowchart TD
    Start["New piece of information
to capture"] --> Q1{"Must survive even if
trace is sampled out?"} Q1 -->|Yes| LR["Emit as LogRecord"] Q1 -->|No| Q2{"Describes a moment
inside one span's operation?"} Q2 -->|Yes| Q3{"Searched in logs backend?
(audit, business, security)"} Q2 -->|No| LR Q3 -->|Yes| Both["Emit BOTH:
LogRecord + span event"] Q3 -->|No| SE["Add as span event
(exception / retry / state)"] LR --> Loki2["Logs pipeline -> Loki"] SE --> Tempo2["Trace pipeline -> Tempo"] Both --> Loki2 Both --> Tempo2

Span events shine for

Domain events go in logs

Product-analytics-style events (checkout.completed, user.signed_up) are best modeled as LogRecords with an event.name attribute. They ride the logs pipeline (long retention, durable, unsampled) while still carrying trace context for forensic analysis.

Key Points — Section 4

Post-Quiz — Part 2 (same questions, retake)

1. What is the primary mechanism that lets Grafana pivot from a span in Tempo to the exact log lines that span produced?

2. Why should you never index trace_id as a Loki label?

3. Your tracing pipeline samples at 1%. You record a critical postmortem-worthy event as a span event. What happens?

4. Which is a key structural difference between a LogRecord and a span event?

5. You need to record a retry attempt that is only meaningful within a single failing HTTP call and benefits from automatic in-trace correlation. Best fit?

Your Progress

Answer Explanations