Study Guide: Chapter 9 — Logs, Events, and Cross-Signal Correlation

Logs in OpenTelemetry are no longer a separate world — they are the third stable signal, with a portable LogRecord schema, mature Collector pipelines, and the all-important trace_id that glues every signal together.

Part 1 — Logs Data Model & Edge Collection

Pre-Quiz — Part 1 (sections 1 & 2)

1. Which field on an OpenTelemetry LogRecord exists specifically so you can diagnose lag introduced by the log pipeline itself?

A. severity_text B. observed_timestamp C. trace_flags D. body

2. A LogRecord's severity_number of 17 corresponds to which level range?

A. DEBUG B. INFO C. WARN D. ERROR

3. The OpenTelemetry Collector's filelog receiver is most appropriate when:

A. Your services already speak OTLP and send logs directly to the Collector. B. You need to tail JSON log files written by third-party or legacy applications on disk. C. You want to read systemd journal entries from /var/log/journal. D. You need to ingest Prometheus remote-write samples.

4. According to the chapter, which approach is recommended for Go and Node.js services where the OTel logs SDK is still experimental?

A. Block log emission until the native SDK reaches GA. B. Switch to Java just for the logging subsystem. C. Keep an existing logger (zap, pino) and inject trace_id/span_id manually so the Collector ingests the JSON. D. Send logs only via Fluent Bit and skip OpenTelemetry entirely.

5. Which Collector processor is most often the difference between "logs I can search" and "logs I can join with metrics and traces" in Kubernetes?

A. batch B. k8sattributes C. memory_limiter D. tail_sampling

1. OpenTelemetry Logs Data Model

If traces describe what one operation did and metrics describe what is true in aggregate, logs describe what happened, at this exact moment, in detail. OpenTelemetry's logs signal formalizes that intuition into a portable data structure so logs can be shipped, transformed, and queried the same way regardless of language or backend.

A LogRecord is the atomic unit of the logs signal. It lives inside a familiar hierarchy: a Resource (service, host, or pod) contains one or more InstrumentationScopes (the library that emitted the log), which in turn contain LogRecords.

Core fields of a LogRecord

Field	Purpose
`timestamp`	When the event actually occurred
`observed_timestamp`	When the collector first saw the event — used to detect pipeline lag
`severity_number` / `severity_text`	Normalized 1–24 scale across all systems
`body`	The main payload — the "headline"
`attributes`	Dimensional metadata — the searchable "headers"
`trace_id` / `span_id` / `trace_flags`	Cross-signal correlation IDs
`resource`	Service-level metadata shared across signals

Figure 9.1 — LogRecord hierarchy

graph TD
    R["Resource
service.name=payments-api
deployment.environment=prod
k8s.pod.name=payments-7c584fd87f-jc6xg"]
    R --> S1["InstrumentationScope
com.example.payments"]
    R --> S2["InstrumentationScope
runtime"]
    S1 --> L1["LogRecord
severity=ERROR
body=Payment charge failed
trace_id=8f1b...e3f5
span_id=d2a4...b0ce"]
    S1 --> L2["LogRecord
severity=INFO
body=Charge created
attributes.amount_cents=4200"]
    S2 --> L3["LogRecord
severity=WARN
body=GC pause 250ms"]
    S2 --> L4["LogRecord
severity=INFO
body=Heap resized"]

Animation A — LogRecord anatomy: fields stagger into place

Each field appears with a 0.15s stagger. The glowing trace_id and span_id are the correlation hooks tying this log to its trace.

Severity, body, and attributes

OpenTelemetry maps log levels onto a single 1–24 scale: TRACE (1–4), DEBUG (5–8), INFO (9–12), WARN (13–16), ERROR (17–20), FATAL (21–24). The rule of thumb: body is the headline; attributes are the metadata you slice by.

Maturity by language (2025)

Language	Posture
Java, .NET	Production-ready — native logs SDK and bridges
Python	Usable behind a thin abstraction
Go, Node.js	Early-adopter — keep existing logger, inject IDs manually

Log bridges — the three-step pattern

Application code logs as it always has.
A bridge (appender, handler, provider) converts each event to an OTel LogRecord.
The OTel SDK exports LogRecords via OTLP to the Collector.

For Java, the opentelemetry-logback-appender hooks into Logback; for .NET, builder.Logging.AddOpenTelemetry(...); for Python, a LoggingHandler attached to the root logger; for Node, a pino formatter that pulls trace_id/span_id from the active context.

Key Points — Section 1

LogRecord schema: timestamp, observed_timestamp, severity_number/text, body, attributes, trace_id, span_id, resource.
Severity is a normalized 1–24 scale — 17+ is ERROR.
Body is the headline; attributes are the searchable metadata.
Java/.NET are production-ready; Python is usable; Go/Node should inject IDs into existing loggers.
A log bridge keeps developer ergonomics while normalizing wire format downstream.

2. Collecting Logs at the Edge

The data model is portable; the act of capturing logs is not. The OpenTelemetry Collector absorbs OTLP, files, journald, and even Fluent Forward traffic without forcing a single pattern.

The filelog receiver

The filelog receiver tails files on disk and applies a per-receiver operator pipeline (JSON parser, regex parser, multiline, timestamp/severity extraction).

receivers:
  filelog:
    include:
      - /var/log/pods/*/*/*.log
    include_file_path: true
    start_at: beginning
    operators:
      - type: json_parser
        parse_from: body
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
        severity:
          parse_from: attributes.level

The journald receiver

For host-level systemd events — sshd, cron, OOM kills — the journald receiver reads the binary journal directly, preserving structured fields.

Migrating from Fluent Bit / Fluentd / Vector

The pragmatic path is layered coexistence, not rip-and-replace:

Phase 1 — Leave Fluent Bit in place; deploy the Collector alongside it for new services.
Phase 2 — Standardize log format: JSON with trace_id, span_id, and resource attributes.
Phase 3 — Reconfigure Fluent Bit to forward to the Collector via OTLP, or replace it with the filelog receiver.

The Collector also speaks the Fluent Forward protocol natively as a receiver, so an existing Fluent Bit can point at it during the transition.

Figure 9.2 — Two-track log pipeline during migration

flowchart LR
    subgraph Legacy["Legacy track (Phase 1 - keep running)"]
        A1["Legacy services
stdout/files"] --> FB["Fluent Bit
DaemonSet"]
        FB --> Splunk["Splunk / ELK"]
    end
    subgraph New["OTel track (Phase 2-3 - growing)"]
        A2["New services
OTLP logs"] --> OC["OTel Collector
DaemonSet
filelog + OTLP receiver"]
        A3["JSON file logs"] --> OC
        OC --> Loki["Loki"]
        OC --> Tempo["Tempo / OTLP backend"]
    end
    A1 -. "service adopts OTel SDK
or JSON logs" .-> A2
    FB -. "optional: Fluent Forward
during transition" .-> OC

Animation C — Log pipeline migration: Fluent Bit fades out, OTel Collector fades in

Same log file, two collectors. As migration progresses, Fluent Bit traffic fades while the OTel Collector path lights up — layered coexistence, not rip-and-replace.

Kubernetes patterns

Pattern	Pros	Cons
DaemonSet + filelog	No app changes; any language	hostPath mount; parsing burden
Sidecar OTel SDK	Structured at source; auto-correlation	Pod overhead; hard for 3rd-party
Stdout + k8sattributes	Rich K8s metadata	API load; permissions

The common production layout: DaemonSet + filelog + k8sattributes processor — that processor is often the difference between "logs I can search" and "logs I can join with metrics and traces."

Post-Quiz — Part 1 (same questions, retake)

1. Which field on an OpenTelemetry LogRecord exists specifically so you can diagnose lag introduced by the log pipeline itself?

A. severity_text B. observed_timestamp C. trace_flags D. body

2. A LogRecord's severity_number of 17 corresponds to which level range?

A. DEBUG B. INFO C. WARN D. ERROR

3. The OpenTelemetry Collector's filelog receiver is most appropriate when:

4. According to the chapter, which approach is recommended for Go and Node.js services where the OTel logs SDK is still experimental?

5. Which Collector processor is most often the difference between "logs I can search" and "logs I can join with metrics and traces" in Kubernetes?

A. batch B. k8sattributes C. memory_limiter D. tail_sampling

Part 2 — Cross-Signal Correlation & Events

Pre-Quiz — Part 2 (sections 3 & 4)

1. What is the primary mechanism that lets Grafana pivot from a span in Tempo to the exact log lines that span produced?

A. Sharing a database between Loki and Tempo. B. Each log line carries the identical trace_id/span_id as the trace, plus shared resource attributes. C. Loki uses a hash of the log body as a foreign key to Tempo. D. Grafana stores spans inline alongside log lines.

2. Why should you never index trace_id as a Loki label?

A. Loki cannot store hex strings as labels. B. It exposes trace IDs in dashboards. C. Each unique trace becomes a label value, exploding cardinality and devastating the index. D. Tempo will reject the trace IDs as duplicates.

3. Your tracing pipeline samples at 1%. You record a critical postmortem-worthy event as a span event. What happens?

A. Span events bypass sampling and are always retained. B. The event is automatically promoted to a LogRecord. C. There is a 99% chance the span (and its event) is dropped — for postmortem-grade signals, use a LogRecord. D. The event is stored in Tempo regardless of sampling decisions.

4. Which is a key structural difference between a LogRecord and a span event?

A. Span events carry their own severity_number just like LogRecords. B. LogRecords have an independent identity and a severity; span events have no severity and live inside a parent span. C. Span events ride the logs pipeline; LogRecords ride the traces pipeline. D. LogRecords inherit trace_id automatically; span events do not.

5. You need to record a retry attempt that is only meaningful within a single failing HTTP call and benefits from automatic in-trace correlation. Best fit?

A. A LogRecord at WARN level on the logs pipeline. B. A separate child span per retry attempt. C. A span event named "retry" with retry.count, retry.delay_ms, retry.reason attributes. D. A Prometheus counter incremented per retry.

3. Cross-Signal Correlation

Structured logs and OTLP transport are means to an end. The end is correlation — clicking a span in Tempo and landing in the precise log lines it produced, or clicking a trace_id in Loki and landing in the full distributed trace.

Stamping trace_id and span_id on logs

There are two paths to get IDs onto a log record:

Automatic, via SDK or bridge — Logback, Log4j2, ILogger, the Python OTel handler all read the active span from context.
Manual, via logger enrichment — in Go and Node.js, fetch the active span yourself and inject IDs as structured fields.

Either way, the result must match exactly: a 32-hex trace_id and 16-hex span_id, lowercase, no dashes. Mismatched casing is a common 2025 pitfall — the Collector's transform processor can normalize it.

Unified resource attributes

Trace-log linking via trace_id is only half the story. Both signals must share the same service.name, service.namespace, deployment.environment, and k8s.namespace.name — otherwise Grafana cannot construct a sensible "Show logs" query.

Figure 9.3 — Trace-to-logs and logs-to-trace pivot

sequenceDiagram
    participant App as Application
    participant OC as OTel Collector
    participant Loki as Loki (Logs)
    participant Tempo as Tempo (Traces)
    participant G as Grafana
    participant U as User
    App->>OC: OTLP logs + traces (shared trace_id)
    OC->>Loki: Logs pipeline
    OC->>Tempo: Traces pipeline
    Note over U,G: Trace -> Logs pivot
    U->>G: Open trace in Tempo
    G->>Tempo: Fetch spans
    Tempo-->>G: Spans with trace_id
    U->>G: Click "Show logs"
    G->>Loki: LogQL {service="payments-api"} 
| json | trace_id="8f1b..."
    Loki-->>G: Matching log lines
    G-->>U: Render trace + logs timeline
    Note over U,G: Logs -> Trace pivot
    U->>G: Click trace_id in log line
    G->>Tempo: Lookup trace by trace_id
    Tempo-->>G: Full trace
    G-->>U: Render trace view

Animation B — Trace -> Log jump: clicking a span flies to Loki and pulls back matching lines

Click a failing span in Tempo → Grafana builds a LogQL query filtered by trace_id → Loki returns the matching log lines into the panel.

The Loki + Tempo + Grafana setup

Loki data source defines a derived field: regex extracts trace_id from the log line and turns it into a clickable link to Tempo.
Tempo data source's Trace to logs setting picks Loki, lists which resource attributes to match (service.name, k8s.namespace.name), and defines the label mapping.
Never index trace_id as a Loki label — cardinality explosion. Keep it as a structured field and rely on derived fields plus | json filtering at query time.

The generated LogQL query

{service="payments-api", env="prod"} | json | trace_id = "8f1b5fe2d5de4a51b8884f8f4cdde3f5"

4. Events and Span Events

OpenTelemetry has two things called "events": a LogRecord in the logs signal, and a span event in the traces signal. They look similar but live in different pipelines under different rules.

The structural difference

Aspect	LogRecord	Span Event
Signal	Logs	Traces
Independent identity	Yes	No — lives in a span
Severity	Yes	No
Body	Yes (text or structured)	No (name + attributes only)
Trace correlation	Optional, via explicit IDs	Automatic, inherited from span
Affected by trace sampling	No (separate sampling)	Yes — dropped if span is dropped
Volume profile	High; log-scale backends	Low; embedded in spans
Best for	"App over time"	"Inside this one operation"

The sampling row is the operational killer: if tracing samples at 1%, 99% of span events disappear. Postmortem-grade signals must be LogRecords.

Figure 9.4 — Two pipelines joined by trace_id

flowchart LR
    subgraph Traces["Trace pipeline (sampled)"]
        SP["Span: GET /orders/{id}
trace_id=8f1b...e3f5
span_id=d2a4...b0ce"]
        SP -.- E1["Span event: exception
exception.type=NPE"]
        SP -.- E2["Span event: cache.miss
cache.key=user:42"]
        SP --> Tempo["Tempo"]
    end
    subgraph Logs["Logs pipeline (unsampled)"]
        L1["LogRecord
severity=ERROR
body=Order lookup failed
trace_id=8f1b...e3f5"]
        L2["LogRecord
severity=INFO
body=checkout.completed"]
        L1 --> Loki["Loki"]
        L2 --> Loki
    end
    L1 -. "shared trace_id
(cross-signal join)" .-> SP

Figure 9.5 — Decision tree

flowchart TD
    Start["New piece of information
to capture"] --> Q1{"Must survive even if
trace is sampled out?"}
    Q1 -->|Yes| LR["Emit as LogRecord"]
    Q1 -->|No| Q2{"Describes a moment
inside one span's operation?"}
    Q2 -->|Yes| Q3{"Searched in logs backend?
(audit, business, security)"}
    Q2 -->|No| LR
    Q3 -->|Yes| Both["Emit BOTH:
LogRecord + span event"]
    Q3 -->|No| SE["Add as span event
(exception / retry / state)"]
    LR --> Loki2["Logs pipeline -> Loki"]
    SE --> Tempo2["Trace pipeline -> Tempo"]
    Both --> Loki2
    Both --> Tempo2

Span events shine for

Exceptions — the OTel semantic convention defines the exception event with exception.type, exception.message, exception.stacktrace.
Retries — retry.count, retry.delay_ms, retry.reason without exploding span count.
State transitions — connection.established, circuit_breaker.opened, feature_flag.evaluated.
Cache hits/misses on the database span.

Domain events go in logs

Product-analytics-style events (checkout.completed, user.signed_up) are best modeled as LogRecords with an event.name attribute. They ride the logs pipeline (long retention, durable, unsampled) while still carrying trace context for forensic analysis.

Post-Quiz — Part 2 (same questions, retake)