Chapter 7 — Distributed Tracing with OpenTelemetry

Learning Objectives

In a monolith, a stack trace tells you what happened. In a cloud-native system, that single click might cross a dozen services, two brokers, three databases, and a handful of language runtimes. The stack trace is gone — what replaces it is the distributed trace: a stitched-together view of how one request flowed through the system, who called whom, how long each hop took, and where things went wrong. OpenTelemetry (OTel) is the open standard that makes those traces portable.

Part 1 — Trace Data Model & Context Propagation

Pre-Reading Check — Part 1

1. Which identifier in a SpanContext is shared by every span in a single logical request?

SpanId TraceId TraceFlags parent_span_id

2. A worker reads a Kafka message, processes it, and finishes. Which SpanKind best describes the worker's top-level span?

SERVER CLIENT PRODUCER CONSUMER

3. In the header traceparent: 00-4bf92f35...4736-00f067aa0ba902b7-01, what does 00f067aa0ba902b7 represent?

The TraceId The version The sender's span-id (parent of any span the receiver creates) A vendor-specific tracestate value

4. You want the user ID to be available on every span emitted by every service downstream of the edge. Which OTel facility should carry it?

A span attribute on the edge SERVER span A span event W3C Baggage A span link

5. When composite propagators are configured with W3C, B3, and Jaeger formats, what happens on an outbound HTTP request?

Only the first format that succeeded on extract is injected Every enabled propagator writes its format, so multiple headers go out together The SDK randomly picks one format per request Only W3C is injected; the others are extract-only

7.1 Trace Data Model

A trace is a directed acyclic graph of spans that share a common TraceId. Each span represents one unit of work — an HTTP handler, a database query, a queue publish — and carries a name, a start/end timestamp, a parent reference, attributes, events, status, and a kind. The mental model: a span is to a trace what a stack frame is to a stack trace, except spans cross process boundaries and overlap in time when work happens in parallel.

SpanContext: the wire envelope

Analogy: TraceId = conference badge color (everyone shares it), SpanId = individual badge number, TraceFlags = whether the photographer may publish your photo.

Span kinds

KindRoleExample
SERVERInbound RPC handlerHTTP handler, gRPC server method
CLIENTOutbound RPChttp.Client.Do, JDBC query
PRODUCERAsync send to queuekafka.Producer.Send
CONSUMERAsync receive/processKafka consumer loop
INTERNALLocal work, no network hopvalidate_cart, JSON parse

The SERVER/CLIENT pairing is what lets Tempo's service-graph processor build a dependency map automatically — without correct kinds, the map is guesswork.

Status, events, links

Figure 7.1 — Parent-child span tree

flowchart TD A["SERVER
POST /checkout
checkout-svc
0 - 480ms"] --> B["CLIENT
payment.charge
checkout-svc
20 - 310ms"] A --> C["CLIENT
inventory.reserve
checkout-svc
20 - 180ms"] B --> D["SERVER
POST /charge
payments-svc
30 - 300ms"] C --> E["SERVER
POST /reserve
inventory-svc
30 - 170ms"] D --> F["CLIENT
db.query users
payments-svc
50 - 110ms"] D --> G["CLIENT
POST gateway
payments-svc
120 - 290ms"] E --> H["CLIENT
db.update stock
inventory-svc
40 - 160ms"] classDef server fill:#1f3a5f,stroke:#58a6ff,color:#fff classDef client fill:#3a2f5f,stroke:#a78bfa,color:#fff class A,D,E server class B,C,F,G,H client
Animation — Span Tree Expansion (Gantt View)
0ms 50ms 100ms 150ms 200ms POST /checkout db.query users cache.get(event) POST /charge Top-level span enters first; children cascade in at their start offsets.
Top span (root, 200ms) appears first; child spans cascade in at proper time offsets.

Key Points — 7.1

7.2 Context Propagation

A trace only works if every service in the path reads, preserves, and forwards the SpanContext. That cross-process handoff is context propagation, implemented by propagators that inject context into outbound carriers and extract it from inbound carriers.

W3C Trace Context — traceparent

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             |  |                                |                |
             |  |                                |                +-- trace-flags (01 = sampled)
             |  |                                +-- parent span-id (16 hex)
             |  +-- trace-id (32 hex, globally unique per trace)
             +-- version (currently "00")

The optional tracestate header carries up to ~32 vendor-specific key=value entries; leftmost has highest precedence; total ~512 chars; no commas or equals signs in values.

Animation — traceparent propagation across three hops
traceparent header (inspected as the request travels): 00- 4bf92f3577b34da6a3ce929d0e0e4736 - a1b2c3d4e5f6a7b8 -01 after Order hop: 00- 4bf92f3577b34da6a3ce929d0e0e4736 - b9c8d7e6f5a4b3c2 -01 Frontend starts span a1b2… Order starts span b9c8… Inventory starts span c7d6… HTTP HTTP trace-id stays constant across all hops parent span-id is rewritten at each hop to the sender's current span Each receiver extracts the header, then starts a new SERVER span whose parent_span_id = the value it received.
trace-id is invariant; only the parent span-id field rotates as the request crosses service boundaries.

Legacy formats: B3 and Jaeger

AspectW3C Trace ContextB3 single-headerJaeger
Header(s)traceparent + tracestateb3uber-trace-id
Separator--:
Sampled flagtrace-flags bit 03rd field: 1/0/dflags bit 1
Debug flagnot defined3rd field: dflags bit 2
Vendor datatracestatenonenone

Composite propagators let one service emit and accept all three formats simultaneously. On extract, the first propagator that succeeds wins; on inject, every enabled propagator writes its own header. That redundancy is the migration trick: roll out W3C alongside B3/Jaeger, let downstream services read whichever they understand, then drop the legacy formats.

Baggage — cross-cutting request data

baggage: user.id=12345, tenant=acme-corp, feature.checkout_v2=enabled
AspectSpan attributeBaggage
Lives onA single spanThe context (independent of any span)
Propagated downstreamNoYes — auto-injected on every outbound call
UseDescribe this operation (db.statement)Request-scoped data (user.id, tenant.id)
Auto-copied to spans?n/aNo — instrumentation must opt in

Security note: untrusted clients can send any baggage they want. Sanitize at edges with an allowlist, strip internal baggage on outbound calls to third parties, and never put secrets, tokens, or PII in baggage.

Figure 7.3 — Composite propagation sequence

sequenceDiagram participant C as Client participant A as Service A
(W3C + B3) participant B as Service B
(W3C only) participant D as Service C
(B3 only) C->>A: HTTP request
traceparent: 00-{trace-id}-{span-C}-01 A->>A: extract context,
start SERVER span {span-A} A->>B: HTTP request
traceparent: 00-{trace-id}-{span-A}-01
b3: {trace-id}-{span-A}-1
baggage: user.id=12345 B->>B: extract traceparent,
start SERVER span {span-B} B->>D: HTTP request
traceparent: 00-{trace-id}-{span-B}-01
b3: {trace-id}-{span-B}-1 D->>D: extract b3 header,
start SERVER span {span-D} D-->>B: response B-->>A: response A-->>C: response Note over C,D: Same trace-id flows through
all four hops despite mixed formats

Key Points — 7.2

Post-Reading Check — Part 1

1. Which identifier in a SpanContext is shared by every span in a single logical request?

SpanId TraceId TraceFlags parent_span_id

2. A worker reads a Kafka message, processes it, and finishes. Which SpanKind best describes the worker's top-level span?

SERVER CLIENT PRODUCER CONSUMER

3. In the header traceparent: 00-4bf92f35...4736-00f067aa0ba902b7-01, what does 00f067aa0ba902b7 represent?

The TraceId The version The sender's span-id (parent of any span the receiver creates) A vendor-specific tracestate value

4. You want the user ID to be available on every span emitted by every service downstream of the edge. Which OTel facility should carry it?

A span attribute on the edge SERVER span A span event W3C Baggage A span link

5. When composite propagators are configured with W3C, B3, and Jaeger formats, what happens on an outbound HTTP request?

Only the first format that succeeded on extract is injected Every enabled propagator writes its format, so multiple headers go out together The SDK randomly picks one format per request Only W3C is injected; the others are extract-only

Part 2 — Building Useful Traces & Visualization

Pre-Reading Check — Part 2

6. Which HTTP server span name is correct under OTel semantic conventions?

GET /users/12345 GET /users/{id} GET https://api.example.com/users/12345?ref=home get_user_by_id with the URL in the name

7. A worker processes 5,000 Kafka messages per poll. Which instrumentation pattern is healthiest?

One span per message so every message is searchable One parent span per batch, events for noteworthy items, counters for "how many processed" One span per message but with sampling at 1% No spans — just logs

8. A payments service handler receives a malformed request and returns HTTP 400. What should the SERVER span's status be?

ERROR because the response was non-2xx OK or UNSET — the server worked correctly; the client sent a bad request ERROR because all 4xx and 5xx are errors It depends on the operation, not on the status

9. Which Tempo metrics-generator processor produces metrics keyed by the (caller, callee) edge of a dependency?

span_metrics service_graphs spanmetrics connector in the Collector histogram_quantile

10. Why are trace-derived metrics generally not the right source of truth for an SLO?

They lack the necessary HTTP status labels PromQL cannot compute quantiles over them Sampling distorts rate; tail-sampling that keeps errors/slow traces inflates the error rate Grafana cannot render them

7.3 Building Useful Traces

Auto-instrumentation will produce spans for every HTTP request and DB call out of the box. The difference between a noisy trace and a debuggable one comes down to names, attributes, error recording, and knowing when not to create a span.

Naming spans for searchability

Test: imagine 10,000 spans — how many distinct names should appear? Tens or low hundreds, not millions. If your span name embeds a UUID, it is too specific.

Attributes vs. events vs. status

Information shapeUseExample
Stable property of the operationAttributehttp.method=GET, db.system=postgresql
Bounded-cardinality filterAttributehttp.status_code=503
Timestamped moment within the spanEventcache.miss, retry.attempt
Exception / errorEvent + StatusrecordException(e) + setStatus(ERROR, "...")
Pass/fail outcomeStatusOK / ERROR

Recording exceptions

with tracer.start_as_current_span("charge_payment") as span:
    span.set_attribute("payment.amount_cents", amount)
    try:
        gateway.charge(card, amount)
    except PaymentDeclined as e:
        span.record_exception(e)
        span.set_status(trace.StatusCode.ERROR, "payment declined")
        raise

Span explosion vs. discipline

The most common mistake is a span per loop iteration. Better options: one span per batch + events for failures; sample inside the loop; use counters for counts; use links for fan-in references.

flowchart TB subgraph wrong["Wrong: 5001 spans per batch"] W1["process_batch SERVER"] W2["process_message x 5000
uniform child spans
blows up trace storage"] W1 --> W2 end subgraph right["Right: 1 span + events + metrics"] R1["process_batch SERVER
messaging.batch.size=5000"] R2["handle_failed_message
(child span, only on error) x 3"] R3["events: cache.miss,
retry.attempt, dlq.send"] R4["counter: messages_processed_total
(metrics, not spans)"] R1 --> R2 R1 -.-> R3 R1 -.-> R4 end classDef bad fill:#5f1f1f,stroke:#f87171,color:#fff classDef good fill:#1f5f3a,stroke:#34d399,color:#fff classDef neutral fill:#1f3a5f,stroke:#58a6ff,color:#fff class W1,W2 bad class R1,R2 good class R3,R4 neutral

Rule of thumb: a span should represent a unit of work big enough that you might one day look at it in a UI.

Key Points — 7.3

7.4 Trace Visualization and Analysis

Jaeger and Grafana Tempo dominate the open-source tracing backends. Both ingest OTLP, both render Gantt waterfalls, and both produce RED-style metrics — but they differ in storage and integration.

Jaeger UI

Storage: Cassandra, Elasticsearch, or OpenSearch.

Grafana Tempo

Tempo stores spans in object storage (S3/GCS/Azure Blob) and indexes only the TraceId — per-trace lookup is cheap; full-text search is expensive. The bet: most queries come from exemplars (a metric or log line already gives you the TraceId).

metrics_generator:
  processor:
    service_graphs:
      enabled: true
      wait: 10s
      max_items: 10000
      peer_attributes:
        - peer.service
        - db.name
        - messaging.system
    span_metrics:
      enabled: true
      dimensions:
        - http.method
        - http.status_code
        - rpc.system
      include_span_kinds:
        - server
        - consumer
Animation — Spans → spanmetrics → RED metrics
spans spanmetrics connector 920/s Rate 0.4% Errors 180ms p95 Spans flow into spanmetrics; out come Prometheus counters and histograms (RED). Tempo metrics-generator also runs service_graphs to produce per-edge metrics.
Spans pour into the spanmetrics connector; Rate, Errors, Duration come out as Prometheus time series.

Service maps

An accurate service map requires three things: consistent service.name resource attribute, correct SpanKind, and peer.service (or db.name, messaging.system) on outbound spans so uninstrumented downstreams can still be inferred.

flowchart LR web["web
SERVER"] gw["api-gateway
SERVER + CLIENT"] orders["orders
SERVER + CLIENT"] pay["payments
SERVER + CLIENT"] inv["inventory
SERVER + CLIENT"] db[("db
peer.service")] web -->|"rate 920/s
err 0.1%
p95 95ms"| gw gw -->|"rate 880/s
err 0.2%
p95 180ms"| orders gw ===>|"rate 412/s
err 3.4%
p95 820ms"| pay orders -->|"rate 720/s
err 0.1%
p95 60ms"| inv orders -->|"rate 720/s
err 0.0%
p95 40ms"| db pay -->|"rate 410/s
err 0.1%
p95 35ms"| db inv -->|"rate 720/s
err 0.0%
p95 30ms"| db classDef ok fill:#1f5f3a,stroke:#34d399,color:#fff classDef hot fill:#5f1f1f,stroke:#f87171,color:#fff classDef store fill:#3a2f5f,stroke:#a78bfa,color:#fff class web,gw,orders,inv ok class pay hot class db store

RED via PromQL

rate(tempo_span_calls_total{span_kind="server"}[5m]) by (service_name)

rate(tempo_span_calls_total{span_kind="server", status_code!="OK"}[5m]) by (service_name)

histogram_quantile(0.95,
  sum by (service_name, le) (
    rate(tempo_span_duration_seconds_bucket{span_kind="server"}[5m])
  )
)

Caveats for trace-derived metrics

Key Points — 7.4

Post-Reading Check — Part 2

6. Which HTTP server span name is correct under OTel semantic conventions?

GET /users/12345 GET /users/{id} GET https://api.example.com/users/12345?ref=home get_user_by_id with the URL in the name

7. A worker processes 5,000 Kafka messages per poll. Which instrumentation pattern is healthiest?

One span per message so every message is searchable One parent span per batch, events for noteworthy items, counters for "how many processed" One span per message but with sampling at 1% No spans — just logs

8. A payments service handler receives a malformed request and returns HTTP 400. What should the SERVER span's status be?

ERROR because the response was non-2xx OK or UNSET — the server worked correctly; the client sent a bad request ERROR because all 4xx and 5xx are errors It depends on the operation, not on the status

9. Which Tempo metrics-generator processor produces metrics keyed by the (caller, callee) edge of a dependency?

span_metrics service_graphs spanmetrics connector in the Collector histogram_quantile

10. Why are trace-derived metrics generally not the right source of truth for an SLO?

They lack the necessary HTTP status labels PromQL cannot compute quantiles over them Sampling distorts rate; tail-sampling that keeps errors/slow traces inflates the error rate Grafana cannot render them

Your Progress

Answer Explanations