Chapter 7 — Distributed Tracing with OpenTelemetry
Learning Objectives
Construct and propagate trace context across service and protocol boundaries.
Build readable, debuggable traces with appropriate span hierarchy and attributes.
Visualize trace data in Jaeger or Tempo to diagnose latency and error patterns.
In a monolith, a stack trace tells you what happened. In a cloud-native system, that single click might cross a dozen services, two brokers, three databases, and a handful of language runtimes. The stack trace is gone — what replaces it is the distributed trace: a stitched-together view of how one request flowed through the system, who called whom, how long each hop took, and where things went wrong. OpenTelemetry (OTel) is the open standard that makes those traces portable.
Part 1 — Trace Data Model & Context Propagation
Pre-Reading Check — Part 1
1. Which identifier in a SpanContext is shared by every span in a single logical request?
SpanIdTraceIdTraceFlagsparent_span_id
2. A worker reads a Kafka message, processes it, and finishes. Which SpanKind best describes the worker's top-level span?
SERVERCLIENTPRODUCERCONSUMER
3. In the header traceparent: 00-4bf92f35...4736-00f067aa0ba902b7-01, what does 00f067aa0ba902b7 represent?
The TraceIdThe versionThe sender's span-id (parent of any span the receiver creates)A vendor-specific tracestate value
4. You want the user ID to be available on every span emitted by every service downstream of the edge. Which OTel facility should carry it?
A span attribute on the edge SERVER spanA span eventW3C BaggageA span link
5. When composite propagators are configured with W3C, B3, and Jaeger formats, what happens on an outbound HTTP request?
Only the first format that succeeded on extract is injectedEvery enabled propagator writes its format, so multiple headers go out togetherThe SDK randomly picks one format per requestOnly W3C is injected; the others are extract-only
7.1 Trace Data Model
A trace is a directed acyclic graph of spans that share a common TraceId. Each span represents one unit of work — an HTTP handler, a database query, a queue publish — and carries a name, a start/end timestamp, a parent reference, attributes, events, status, and a kind. The mental model: a span is to a trace what a stack frame is to a stack trace, except spans cross process boundaries and overlap in time when work happens in parallel.
SpanContext: the wire envelope
TraceId — 128-bit (32 hex chars), globally unique per trace; must not be all zeros.
SpanId — 64-bit (16 hex chars), unique within a trace; the parent's SpanId becomes the child's parent_span_id.
TraceFlags — 8 bits; only bit 0 is defined (01 = sampled).
Analogy: TraceId = conference badge color (everyone shares it), SpanId = individual badge number, TraceFlags = whether the photographer may publish your photo.
Span kinds
Kind
Role
Example
SERVER
Inbound RPC handler
HTTP handler, gRPC server method
CLIENT
Outbound RPC
http.Client.Do, JDBC query
PRODUCER
Async send to queue
kafka.Producer.Send
CONSUMER
Async receive/process
Kafka consumer loop
INTERNAL
Local work, no network hop
validate_cart, JSON parse
The SERVER/CLIENT pairing is what lets Tempo's service-graph processor build a dependency map automatically — without correct kinds, the map is guesswork.
Status, events, links
Status is UNSET, OK, or ERROR. OTel does not auto-infer status from HTTP codes: a 4xx is generally not an error on a SERVER span (the client sent a bad request), but it is from a CLIENT span's perspective. You must set it deliberately.
Events are timestamped annotations within a span. recordException(e) adds an exception event with exception.type, exception.message, and exception.stacktrace attributes.
Links reference other related SpanContexts that are not the strict parent — the right tool for fan-in (one batch span with 1,000 links instead of 1,000 parent references).
Figure 7.1 — Parent-child span tree
flowchart TD
A["SERVER POST /checkout checkout-svc 0 - 480ms"] --> B["CLIENT payment.charge checkout-svc 20 - 310ms"]
A --> C["CLIENT inventory.reserve checkout-svc 20 - 180ms"]
B --> D["SERVER POST /charge payments-svc 30 - 300ms"]
C --> E["SERVER POST /reserve inventory-svc 30 - 170ms"]
D --> F["CLIENT db.query users payments-svc 50 - 110ms"]
D --> G["CLIENT POST gateway payments-svc 120 - 290ms"]
E --> H["CLIENT db.update stock inventory-svc 40 - 160ms"]
classDef server fill:#1f3a5f,stroke:#58a6ff,color:#fff
classDef client fill:#3a2f5f,stroke:#a78bfa,color:#fff
class A,D,E server
class B,C,F,G,H client
Animation — Span Tree Expansion (Gantt View)
Top span (root, 200ms) appears first; child spans cascade in at proper time offsets.
Key Points — 7.1
A trace = DAG of spans sharing one TraceId; each span has its own SpanId and a parent_span_id.
SpanContext (TraceId + SpanId + TraceFlags) is the only data that crosses the wire; names/attributes/events stay local until exported.
SpanKind must be set correctly — SERVER pairs with CLIENT, PRODUCER pairs with CONSUMER. Default INTERNAL kills service maps.
Status ERROR must be set deliberately; HTTP 4xx is not automatically an error on a SERVER span.
Use events for intra-span moments (cache.miss, retry.attempt) and links for fan-in patterns.
7.2 Context Propagation
A trace only works if every service in the path reads, preserves, and forwards the SpanContext. That cross-process handoff is context propagation, implemented by propagators that inject context into outbound carriers and extract it from inbound carriers.
The optional tracestate header carries up to ~32 vendor-specific key=value entries; leftmost has highest precedence; total ~512 chars; no commas or equals signs in values.
Animation — traceparent propagation across three hops
trace-id is invariant; only the parent span-id field rotates as the request crosses service boundaries.
Legacy formats: B3 and Jaeger
Aspect
W3C Trace Context
B3 single-header
Jaeger
Header(s)
traceparent + tracestate
b3
uber-trace-id
Separator
-
-
:
Sampled flag
trace-flags bit 0
3rd field: 1/0/d
flags bit 1
Debug flag
not defined
3rd field: d
flags bit 2
Vendor data
tracestate
none
none
Composite propagators let one service emit and accept all three formats simultaneously. On extract, the first propagator that succeeds wins; on inject, every enabled propagator writes its own header. That redundancy is the migration trick: roll out W3C alongside B3/Jaeger, let downstream services read whichever they understand, then drop the legacy formats.
Security note: untrusted clients can send any baggage they want. Sanitize at edges with an allowlist, strip internal baggage on outbound calls to third parties, and never put secrets, tokens, or PII in baggage.
Figure 7.3 — Composite propagation sequence
sequenceDiagram
participant C as Client
participant A as Service A (W3C + B3)
participant B as Service B (W3C only)
participant D as Service C (B3 only)
C->>A: HTTP request traceparent: 00-{trace-id}-{span-C}-01
A->>A: extract context, start SERVER span {span-A}
A->>B: HTTP request traceparent: 00-{trace-id}-{span-A}-01 b3: {trace-id}-{span-A}-1 baggage: user.id=12345
B->>B: extract traceparent, start SERVER span {span-B}
B->>D: HTTP request traceparent: 00-{trace-id}-{span-B}-01 b3: {trace-id}-{span-B}-1
D->>D: extract b3 header, start SERVER span {span-D}
D-->>B: response
B-->>A: response
A-->>C: response
Note over C,D: Same trace-id flows through all four hops despite mixed formats
Key Points — 7.2
W3C Trace Context (traceparent + tracestate) is OTel's default HTTP propagator; trace-id is invariant, parent span-id rotates at every hop.
Composite propagators co-emit W3C + B3 + Jaeger headers — the foundation of a gradual migration.
Baggage propagates cross-cutting request data on the context, not on any one span; downstream services must opt-in to copy baggage onto their spans.
Never put secrets or PII into baggage. Sanitize at trust boundaries, strip on outbound third-party calls.
Post-Reading Check — Part 1
1. Which identifier in a SpanContext is shared by every span in a single logical request?
SpanIdTraceIdTraceFlagsparent_span_id
2. A worker reads a Kafka message, processes it, and finishes. Which SpanKind best describes the worker's top-level span?
SERVERCLIENTPRODUCERCONSUMER
3. In the header traceparent: 00-4bf92f35...4736-00f067aa0ba902b7-01, what does 00f067aa0ba902b7 represent?
The TraceIdThe versionThe sender's span-id (parent of any span the receiver creates)A vendor-specific tracestate value
4. You want the user ID to be available on every span emitted by every service downstream of the edge. Which OTel facility should carry it?
A span attribute on the edge SERVER spanA span eventW3C BaggageA span link
5. When composite propagators are configured with W3C, B3, and Jaeger formats, what happens on an outbound HTTP request?
Only the first format that succeeded on extract is injectedEvery enabled propagator writes its format, so multiple headers go out togetherThe SDK randomly picks one format per requestOnly W3C is injected; the others are extract-only
Part 2 — Building Useful Traces & Visualization
Pre-Reading Check — Part 2
6. Which HTTP server span name is correct under OTel semantic conventions?
GET /users/12345GET /users/{id}GET https://api.example.com/users/12345?ref=homeget_user_by_id with the URL in the name
7. A worker processes 5,000 Kafka messages per poll. Which instrumentation pattern is healthiest?
One span per message so every message is searchableOne parent span per batch, events for noteworthy items, counters for "how many processed"One span per message but with sampling at 1%No spans — just logs
8. A payments service handler receives a malformed request and returns HTTP 400. What should the SERVER span's status be?
ERROR because the response was non-2xxOK or UNSET — the server worked correctly; the client sent a bad requestERROR because all 4xx and 5xx are errorsIt depends on the operation, not on the status
9. Which Tempo metrics-generator processor produces metrics keyed by the (caller, callee) edge of a dependency?
span_metricsservice_graphsspanmetrics connector in the Collectorhistogram_quantile
10. Why are trace-derived metrics generally not the right source of truth for an SLO?
They lack the necessary HTTP status labelsPromQL cannot compute quantiles over themSampling distorts rate; tail-sampling that keeps errors/slow traces inflates the error rateGrafana cannot render them
7.3 Building Useful Traces
Auto-instrumentation will produce spans for every HTTP request and DB call out of the box. The difference between a noisy trace and a debuggable one comes down to names, attributes, error recording, and knowing when not to create a span.
Naming spans for searchability
HTTP server: route template, not raw URL. GET /users/{id}, not GET /users/12345.
HTTP client: method plus route or host. POST api.payments.svc.
Database: operation + target. SELECT users, full statement in db.statement.
Test: imagine 10,000 spans — how many distinct names should appear? Tens or low hundreds, not millions. If your span name embeds a UUID, it is too specific.
Attributes vs. events vs. status
Information shape
Use
Example
Stable property of the operation
Attribute
http.method=GET, db.system=postgresql
Bounded-cardinality filter
Attribute
http.status_code=503
Timestamped moment within the span
Event
cache.miss, retry.attempt
Exception / error
Event + Status
recordException(e) + setStatus(ERROR, "...")
Pass/fail outcome
Status
OK / ERROR
Recording exceptions
with tracer.start_as_current_span("charge_payment") as span:
span.set_attribute("payment.amount_cents", amount)
try:
gateway.charge(card, amount)
except PaymentDeclined as e:
span.record_exception(e)
span.set_status(trace.StatusCode.ERROR, "payment declined")
raise
Record then re-raise — swallowing without re-raising hides bugs.
Status ERROR is what backends color red and what Tempo's metrics-generator counts in RED metrics.
HTTP 4xx is not automatically a server-side error; it usually is on the client side.
Span explosion vs. discipline
The most common mistake is a span per loop iteration. Better options: one span per batch + events for failures; sample inside the loop; use counters for counts; use links for fan-in references.
flowchart TB
subgraph wrong["Wrong: 5001 spans per batch"]
W1["process_batch SERVER"]
W2["process_message x 5000 uniform child spans blows up trace storage"]
W1 --> W2
end
subgraph right["Right: 1 span + events + metrics"]
R1["process_batch SERVER messaging.batch.size=5000"]
R2["handle_failed_message (child span, only on error) x 3"]
R3["events: cache.miss, retry.attempt, dlq.send"]
R4["counter: messages_processed_total (metrics, not spans)"]
R1 --> R2
R1 -.-> R3
R1 -.-> R4
end
classDef bad fill:#5f1f1f,stroke:#f87171,color:#fff
classDef good fill:#1f5f3a,stroke:#34d399,color:#fff
classDef neutral fill:#1f3a5f,stroke:#58a6ff,color:#fff
class W1,W2 bad
class R1,R2 good
class R3,R4 neutral
Rule of thumb: a span should represent a unit of work big enough that you might one day look at it in a UI.
Key Points — 7.3
Low-cardinality span names (route templates, not raw URLs); follow OTel semantic conventions religiously.
Attributes describe stable properties; events mark timestamped moments; status is the deliberate OK/ERROR signal.
record_exception() + set_status(ERROR, ...) + re-raise is the canonical error pattern.
HTTP 4xx is not automatically an error on a SERVER span; reserve ERROR for 5xx or unhandled exceptions.
Avoid span explosion: one span per batch + events/counters/links beats one span per iteration.
7.4 Trace Visualization and Analysis
Jaeger and Grafana Tempo dominate the open-source tracing backends. Both ingest OTLP, both render Gantt waterfalls, and both produce RED-style metrics — but they differ in storage and integration.
Jaeger UI
Search by service, operation, time, duration, tags (http.status_code=500, error=true), or free-form trace-id lookup.
Trace timeline — the Gantt chart with attributes, events, and stack traces from record_exception.
Trace graph — nodes/edges of one trace.
System architecture — aggregated dependency graph.
Service Performance Monitoring (SPM) — RED metrics, typically via an OTel Collector spanmetrics processor.
Storage: Cassandra, Elasticsearch, or OpenSearch.
Grafana Tempo
Tempo stores spans in object storage (S3/GCS/Azure Blob) and indexes only the TraceId — per-trace lookup is cheap; full-text search is expensive. The bet: most queries come from exemplars (a metric or log line already gives you the TraceId).
Spans pour into the spanmetrics connector; Rate, Errors, Duration come out as Prometheus time series.
Service maps
An accurate service map requires three things: consistent service.name resource attribute, correct SpanKind, and peer.service (or db.name, messaging.system) on outbound spans so uninstrumented downstreams can still be inferred.
flowchart LR
web["web SERVER"]
gw["api-gateway SERVER + CLIENT"]
orders["orders SERVER + CLIENT"]
pay["payments SERVER + CLIENT"]
inv["inventory SERVER + CLIENT"]
db[("db peer.service")]
web -->|"rate 920/s err 0.1% p95 95ms"| gw
gw -->|"rate 880/s err 0.2% p95 180ms"| orders
gw ===>|"rate 412/s err 3.4% p95 820ms"| pay
orders -->|"rate 720/s err 0.1% p95 60ms"| inv
orders -->|"rate 720/s err 0.0% p95 40ms"| db
pay -->|"rate 410/s err 0.1% p95 35ms"| db
inv -->|"rate 720/s err 0.0% p95 30ms"| db
classDef ok fill:#1f5f3a,stroke:#34d399,color:#fff
classDef hot fill:#5f1f1f,stroke:#f87171,color:#fff
classDef store fill:#3a2f5f,stroke:#a78bfa,color:#fff
class web,gw,orders,inv ok
class pay hot
class db store
RED via PromQL
rate(tempo_span_calls_total{span_kind="server"}[5m]) by (service_name)
rate(tempo_span_calls_total{span_kind="server", status_code!="OK"}[5m]) by (service_name)
histogram_quantile(0.95,
sum by (service_name, le) (
rate(tempo_span_duration_seconds_bucket{span_kind="server"}[5m])
)
)
Caveats for trace-derived metrics
Sampling distorts rate. Head sampling at 10% reports ~1/10 of true rate. Tail sampling that preferentially keeps errors/slow traces over-represents errors. Treat trace-derived metrics as a correlation tool; keep direct application metrics as the SLO source of truth.
Cardinality. Don't add user_id or request_id as span_metrics dimensions — it will crash your TSDB. Stick to bounded labels: service, operation, status, method, coarse path.
Key Points — 7.4
Jaeger: Cassandra/ES storage, search-oriented UI, SPM via Collector spanmetrics.
Tempo: object-storage backed, trace-id indexed; built-in metrics-generator with span_metrics (per service) and service_graphs (per edge) processors.
RED via PromQL: rate of span_calls_total, error rate filtered on status_code!="OK", p95 via histogram_quantile over the bucket histogram.
Service maps need consistent service.name, correct SpanKind, and peer.service attributes on outbound spans.
Sampling and cardinality limits mean trace-derived metrics are great for correlation but not for SLO accounting.
Post-Reading Check — Part 2
6. Which HTTP server span name is correct under OTel semantic conventions?
GET /users/12345GET /users/{id}GET https://api.example.com/users/12345?ref=homeget_user_by_id with the URL in the name
7. A worker processes 5,000 Kafka messages per poll. Which instrumentation pattern is healthiest?
One span per message so every message is searchableOne parent span per batch, events for noteworthy items, counters for "how many processed"One span per message but with sampling at 1%No spans — just logs
8. A payments service handler receives a malformed request and returns HTTP 400. What should the SERVER span's status be?
ERROR because the response was non-2xxOK or UNSET — the server worked correctly; the client sent a bad requestERROR because all 4xx and 5xx are errorsIt depends on the operation, not on the status
9. Which Tempo metrics-generator processor produces metrics keyed by the (caller, callee) edge of a dependency?
span_metricsservice_graphsspanmetrics connector in the Collectorhistogram_quantile
10. Why are trace-derived metrics generally not the right source of truth for an SLO?
They lack the necessary HTTP status labelsPromQL cannot compute quantiles over themSampling distorts rate; tail-sampling that keeps errors/slow traces inflates the error rateGrafana cannot render them