Chapter 11: Sampling, Performance, and Cost Control

Observability is one of those engineering disciplines where the cure can become as expensive as the disease. This chapter teaches the opposite reflex: deliberately keep the right telemetry, drop the rest, and do it in the right place at the right cost.

Learning Objectives

Half 1 — Pre-Reading Check

Answer the following five questions before reading sections 1 and 2. You'll answer the same questions again after reading to measure your gain.

Pre-Reading Quiz · Half 1

1. Where in the OpenTelemetry pipeline does head-based sampling make its keep/drop decision?

In the backend (Tempo/Jaeger) after spans are stored
In the Collector after buffering spans by trace ID
In the SDK at the root span, before spans are recorded or exported
At the load balancer in front of the service

2. Why is TraceIdRatioBased(p) typically wrapped in ParentBased?

To make the sampler honor the parent's sample decision so traces are coherent across services
To increase the sample rate at child spans for better coverage
To allow tail-based decisions to override the head rate
To convert a probabilistic sampler into an adaptive one

3. A team is debating capturing every 500 error. Which strategy is best suited to that goal?

Head-based TraceIdRatioBased(0.01) — cheap and fast
Tail-based sampling at the Collector with a policy keyed on error=true
AlwaysOff sampling — errors are caught by metrics anyway
Adaptive head sampling driven by upstream load

4. Which label on an http_requests_total metric is the most dangerous for cardinality?

method (GET, POST, PUT, DELETE)
status_code (200, 400, 500, …)
user_id (one value per registered user)
service (one value per deployed microservice)

5. What is the right place in a Prometheus pipeline to drop a noisy label like request_id?

In the alerting rule that uses the metric
In metric_relabel_configs at scrape time, before the sample is stored
In Grafana panel queries with without(...)
After the fact via promtool tsdb delete-series only

1. Sampling Strategies

Sampling is the single most powerful lever you have for controlling telemetry cost and overhead. It is also the most misunderstood. The trick is to sample in a way that preserves what matters — errors, slow traces, unusual tenants — while discarding the redundant majority.

Head-based: ParentBased, TraceIdRatio, AlwaysOn/Off

Head-based sampling makes the keep-or-drop decision at the first service that creates the root span — typically the SDK in your front-door API. Because the decision is made up front, downstream services never have to record, store, or ship the spans of a dropped trace. This is the cheapest possible form of sampling.

The canonical head sampler in OTel is TraceIdRatioBased. It hashes the trace ID to produce a deterministic sample with probability p. Wrapped in ParentBased, child spans honor the parent's decision, ensuring coherent distributed traces:

ParentBased(root=TraceIdRatioBased(0.1))

Extreme samplers exist too: AlwaysOn for dev/staging and AlwaysOff as a kill switch.

Animation A · Head vs Tail Sampling
HEAD SAMPLING — decision in SDK, drops are immediate SDK Collector Backend 5 traces → 2 kept, 3 dropped at SDK (never exported) TAIL SAMPLING — all spans flow to Collector, buffered, policies decide SDK Collector buffer Backend policies: error=keep, latency>3s=keep, else drop error (kept) slow (kept) dropped
Head sampling drops traces at the SDK before any export work. Tail sampling lets all spans flow to the Collector buffer, waits a decision window, then keeps interesting ones (errors, slow traces) and drops the rest.

Tail-based at the Collector

Tail-based sampling lives in the OTel Collector's tail_sampling processor: SDKs ship every span, the Collector buffers spans by trace ID, waits a decision_wait window (often 5–30 s) for the trace to complete, evaluates policies (error, latency, tenant), then flushes the keepers. The cost is memory:

buffered_bytes ≈ spans/sec × spans/trace × span_size × wait_seconds

At 10,000 spans/sec × 20 spans/trace × 1 KB × 10 s = 2 GB of in-memory spans. Tail sampling also adds 5–30 s of observability latency — not great if SREs depend on live traces for incident detection.

Hybrid head + tail sampling

The common production pattern: light head sample (e.g. ParentBased(TraceIdRatioBased(0.2))) plus tail sampling that further refines the 20% to keep errors and slow traces. Caveat: anything the head sampler drops is gone forever — including errors that lost the dice roll.

flowchart TD Start[Trace begins
at SDK] Start --> Head{Head sampler
ParentBased
TraceIdRatio 20%} Head -->|80% drop| Drop1[Discard at SDK
no spans recorded] Head -->|20% keep| Export[Export spans
to Collector] Export --> Buffer[Buffer by trace_id
decision_wait window] Buffer --> Tail{Tail sampling
policies} Tail -->|error=true| Keep1[Persist to backend] Tail -->|latency > 3s| Keep2[Persist to backend] Tail -->|tenant=enterprise| Keep3[Persist to backend] Tail -->|10% random| Keep4[Persist to backend] Tail -->|none matched| Drop2[Discard at Collector]

Comparison: When to use which

DimensionHead TraceIdRatioBasedTail tail_sampling
Where decidedSDK, at first spanCollector, after buffering
Decision timingImmediate, pre-exportAfter decision_wait (5–30 s)
Memory costVery lowHigh; scales with spans/sec × wait
Network costLow (drops never leave service)High (all spans cross the wire)
Accuracy for rare eventsPoor (random drops)Excellent (sees full trace)
Best forHigh-QPS, cost-sensitive APIsRare-error capture, targeted debugging

Key Takeaways — Sampling

2. Cardinality Management

If sampling controls trace cost, cardinality controls metric cost. In a Prometheus-style TSDB the unit of storage is not the metric — it's the unique combination of metric name plus label set. Each unique combination is one time series, with its own memory footprint and storage row.

Add a user_id label to http_requests_total{method, status} with 100,000 users and the math turns multiplicative very quickly.

Animation B · Cardinality Explosion
Stage 1 · labels {service, endpoint} 50 services × 20 endpoints = 1,000 series (manageable) Stage 2 · add label {user_id} — now multiply by 100k users 100,000,000 series — cardinality explosion 50 × 20 × 100,000 = 100M unique label tuples → TSDB OOM
Cardinality is multiplicative, not additive. Adding one unbounded label like user_id can take a metric from a few thousand to a hundred million series.

Identifying high-cardinality labels

The most useful PromQL forensic query is the “top-N by series count”:

topk(20, count by (__name__)({__name__=~".+"}))

For deeper analysis use promtool tsdb analyze, watch the meta-metrics prometheus_tsdb_head_series and prometheus_tsdb_head_series_created_total, and (for Mimir users) the Cardinality Explorer UI — you can even fail CI builds when a new metric crosses, say, 50,000 series.

Allow/deny lists and attribute drops

metric_relabel_configs:
  - source_labels: [uri]
    regex: ".+"
    action: labeldrop

  - regex: "user_id|session_id|request_id"
    action: labeldrop

  - source_labels: [path]
    regex: "/api/v1/users/[0-9]+/(.*)"
    target_label: path
    replacement: "/api/v1/users/:id/$1"
    action: replace

That last rule — collapsing /users/12345/orders into /users/:id/orders — is one of the most useful tricks in the playbook.

Aggregating away unbounded dimensions

groups:
  - name: myapp_aggregates
    interval: 30s
    rules:
      - record: myapp:http_request_duration_seconds_bucket:service
        expr: sum by (service, le) (myapp_http_request_duration_seconds_bucket)

Keep the high-cardinality raw metric at short retention (2 h) for live debugging, and the rolled-up recording rule for 90 days at a tiny fraction of the storage. Long-term: never emit unbounded labels (user_id, session_id, email) on metrics — those belong on traces or logs.

flowchart TD Raw["Raw exposition
http_requests_total{method, status,
user_id, request_id, path}
~1,000,000 series"] Raw -->|"metric_relabel_configs:
labeldrop user_id, request_id
normalize /users/:id/"| Mid Mid["After scrape-time relabel
http_requests_total{method, status, route}
~50,000 series"] Mid -->|"recording rule:
sum by (service, le)"| Roll Roll["Recording-rule rollup
myapp:http_request_duration:service
~200 series"]

Key Takeaways — Cardinality

Half 1 — Post-Reading Check

Same five questions, after the reading. Compare your scores when you reveal answers below.

Post-Reading Quiz · Half 1

1. Where in the OpenTelemetry pipeline does head-based sampling make its keep/drop decision?

In the backend (Tempo/Jaeger) after spans are stored
In the Collector after buffering spans by trace ID
In the SDK at the root span, before spans are recorded or exported
At the load balancer in front of the service

2. Why is TraceIdRatioBased(p) typically wrapped in ParentBased?

To make the sampler honor the parent's sample decision so traces are coherent across services
To increase the sample rate at child spans for better coverage
To allow tail-based decisions to override the head rate
To convert a probabilistic sampler into an adaptive one

3. A team is debating capturing every 500 error. Which strategy is best suited to that goal?

Head-based TraceIdRatioBased(0.01) — cheap and fast
Tail-based sampling at the Collector with a policy keyed on error=true
AlwaysOff sampling — errors are caught by metrics anyway
Adaptive head sampling driven by upstream load

4. Which label on an http_requests_total metric is the most dangerous for cardinality?

method (GET, POST, PUT, DELETE)
status_code (200, 400, 500, …)
user_id (one value per registered user)
service (one value per deployed microservice)

5. What is the right place in a Prometheus pipeline to drop a noisy label like request_id?

In the alerting rule that uses the metric
In metric_relabel_configs at scrape time, before the sample is stored
In Grafana panel queries with without(...)
After the fact via promtool tsdb delete-series only

Half 2 — Pre-Reading Check

Answer the next five questions before reading sections 3 and 4.

Pre-Reading Quiz · Half 2

6. For a typical well-tuned JVM microservice, what is the expected steady-state CPU overhead of the OpenTelemetry Java agent?

Less than 0.1% — essentially free
1–5% additional CPU
25–50% additional CPU
Over 75% — never use in production

7. Why does production code almost always use BatchSpanProcessor instead of SimpleSpanProcessor?

BatchSpanProcessor encrypts spans before export
SimpleSpanProcessor exports spans synchronously and blocks the application thread on every export
SimpleSpanProcessor cannot handle Java spans
BatchSpanProcessor automatically adjusts the sampling rate

8. The BatchSpanProcessor queue is full. What is the correct default behavior, and what should an operator do?

Block the application thread; nothing to monitor
Drop spans and increment otel_sdk_span_processor_dropped_spans; monitor that counter and sample harder or grow the queue
Spill to disk silently
Crash the process to surface the problem

9. Which architectural pattern best matches “edge aggregation” in a cost-aware observability pipeline?

Send raw spans directly from every app pod to the central backend
Reduce volume as early as possible — SDK views, node-local DaemonSet collector, then regional tier
Keep all data in hot SSD storage forever
Disable instrumentation entirely until an incident occurs

10. In a tiered observability storage model, which mapping is closest to the recommended defaults?

Hot = months on SSD; Cold = seconds on Glacier
Hot = 2–24 h on SSD; Warm = 7–30 d on object storage; Cold = 90 d–1 y on archive
All data on RAM for one hour, then deleted
There is no benefit to tiering — one storage class is enough

3. Performance Overhead

The third lever is the cost of generating telemetry — the CPU and memory the instrumentation itself consumes inside your application. Usually small if sampling and cardinality are under control, but worth measuring.

CPU and memory cost of SDK instrumentation

Java auto-instrumentation (the OTel Java Agent) is the heaviest common case because it uses bytecode instrumentation. The Elastic EDOT Java benchmark on a sample JVM service:

MetricNo agentWith EDOT JavaRelative impact
Startup time5.55 s6.82 s~+23% (+1.3 s)
p95 request latency1.96 ms2.06 ms~+5%
Total system CPU53.82%54.25%~+0.8% absolute

Expect 1–5% additional CPU for most well-tuned JVM services and tens of MB of extra heap. The worst case from community benchmarks is up to ~20% CPU and ≥0.5 ms latency per instrumented hop at high QPS without sampling — driven mostly by GC pressure.

Go auto-instrumentation is library-based, so there's no startup penalty and no class-loading hit. Typical Go OTel overhead is low single-digit CPU percent with a few to tens of MB of additional RSS.

Batching, async export, and buffer tuning

The BatchSpanProcessor (BSP) decouples span generation from span export. Spans go into an in-memory queue, a background thread pulls them in batches, and the exporter ships them. Tuning knobs:

KnobLarger valueSmaller value
Batch sizeBetter CPU/network efficiency, more memoryLower memory, more frequent exports
Schedule delayFewer export calls, spans lingerLower loss risk, more even export
Max queue sizeSurvives bigger spikes without droppingSmaller footprint, drops earlier
Sampling rateMore traces shipped, higher overheadCheaper, less coverage
Animation C · Async vs Synchronous Export
ASYNC (BatchSpanProcessor) — app does not block on export App thread non-blocking Ring buffer max_queue_size Batcher background Collector batch App enqueues a span and returns immediately — the batcher exports in the background. SYNC (SimpleSpanProcessor) — app blocks on every export App thread BLOCKED Exporter (sync) Collector Each export blocks the request thread — throughput collapses under load.
Async export (BatchSpanProcessor) enqueues spans into a ring buffer and exports in background batches — the application thread is never blocked. Synchronous export (SimpleSpanProcessor) blocks every request thread on every export.

Back-pressure and benchmarking

Watch otel_sdk_span_processor_dropped_spans — a sustained nonzero rate means you're losing trace data and need to sample harder or grow the queue. To make capacity decisions, benchmark your own workload: baseline without instrumentation, then enable it at the intended sample rate, then drive 2–3× peak load to find the breakpoint. Plan capacity with +5–20% CPU headroom for Java, low single-digit for Go.

flowchart LR App["Application thread
span.end()"] App -->|enqueue
non-blocking| Queue["In-memory queue
max_queue_size"] Queue -->|"drop on full
(dropped_spans counter)"| DropPath[Span dropped] Queue -->|background worker
pulls batches| Batcher["Batcher
batch_size /
schedule_delay"] Batcher -->|OTLP gRPC/HTTP| Exp[Exporter] Exp --> Col[OTel Collector] Col --> Backend["Backend
(Tempo / Mimir / Loki)"]

Key Takeaways — Performance Overhead

4. Cost-aware Architecture

Sampling, cardinality, and overhead controls are tactical. The strategic layer is architecture: where in the pipeline you do what work, how telemetry flows from edge to backend, and how long each tier retains data. Think cold-chain logistics for telemetry: fresh data needs fast/expensive storage, ageing data shifts to progressively cheaper tiers.

Metric aggregation at the edge

The further telemetry travels before it is reduced, the more expensive it becomes. Do reduction as close to the source as possible:

Each tier reduces volume by 2–10×.

flowchart LR subgraph Apps["App pods (many)"] A1[App + SDK] A2[App + SDK] A3[App + SDK] end subgraph Node["Node-level Collector
(DaemonSet)"] N1[Relabel + sample] end subgraph Regional["Regional Collector tier
(StatefulSet)"] R1["Tail sampling
cardinality enforcement"] end subgraph Backend["Observability backend
(Mimir / Tempo / Loki)"] Hot["Hot tier
SSD, 2–24 h"] Warm["Warm tier
S3, 7–30 d"] Cold["Cold tier
Glacier, 90 d – 1 y"] end A1 --> N1 A2 --> N1 A3 --> N1 N1 -->|2–10x reduction| R1 R1 -->|2–10x reduction| Hot Hot -->|age out| Warm Warm -->|age out| Cold

Log volume reduction

Tiered storage and retention

TierLatencyCostTypical use
Hot (memory / SSD)msHighLast 2–24 h, live alerting, on-call
Warm (object storage)secondsMedium1–30 days, recent incident review
Cold (archive / Glacier)minutes–hoursVery low30 d – years, compliance, trends

Reasonable default retention by pillar:

PillarHotWarmCold
Metrics (raw)2 h15 d
Metrics (recording rules)24 h90 d1 y
Traces (sampled)24 h7 d
Traces (interesting, tail-sampled)24 h30 d90 d
Logs (ERROR+)7 d30 d1 y
Logs (INFO/DEBUG, sampled)24 h7 d

The architectural lever that ties this together is back-pressure handling. Every component must have a documented behavior under overload — drop, buffer, downsample, or block. The right default is “drop with a metric” so the application stays healthy and operators see the overload via clear telemetry-about-telemetry counters.

Key Takeaways — Cost-aware Architecture

Half 2 — Post-Reading Check

Same five questions, now post-reading.

Post-Reading Quiz · Half 2

6. For a typical well-tuned JVM microservice, what is the expected steady-state CPU overhead of the OpenTelemetry Java agent?

Less than 0.1% — essentially free
1–5% additional CPU
25–50% additional CPU
Over 75% — never use in production

7. Why does production code almost always use BatchSpanProcessor instead of SimpleSpanProcessor?

BatchSpanProcessor encrypts spans before export
SimpleSpanProcessor exports spans synchronously and blocks the application thread on every export
SimpleSpanProcessor cannot handle Java spans
BatchSpanProcessor automatically adjusts the sampling rate

8. The BatchSpanProcessor queue is full. What is the correct default behavior, and what should an operator do?

Block the application thread; nothing to monitor
Drop spans and increment otel_sdk_span_processor_dropped_spans; monitor that counter and sample harder or grow the queue
Spill to disk silently
Crash the process to surface the problem

9. Which architectural pattern best matches “edge aggregation” in a cost-aware observability pipeline?

Send raw spans directly from every app pod to the central backend
Reduce volume as early as possible — SDK views, node-local DaemonSet collector, then regional tier
Keep all data in hot SSD storage forever
Disable instrumentation entirely until an incident occurs

10. In a tiered observability storage model, which mapping is closest to the recommended defaults?

Hot = months on SSD; Cold = seconds on Glacier
Hot = 2–24 h on SSD; Warm = 7–30 d on object storage; Cold = 90 d–1 y on archive
All data on RAM for one hour, then deleted
There is no benefit to tiering — one storage class is enough

Your Progress

Answer Explanations