Chapter 11: Sampling, Performance, and Cost Control
Observability is one of those engineering disciplines where the cure can become as expensive as the
disease. This chapter teaches the opposite reflex: deliberately keep the right telemetry,
drop the rest, and do it in the right place at the right cost.
Learning Objectives
Compare head-based and tail-based sampling and pick the right strategy per workload.
Tune cardinality and ingestion rates to control observability spend.
Measure and reduce the performance overhead of instrumentation in production services.
Design tiered storage and retention policies that match telemetry value to cost.
Half 1 — Pre-Reading Check
Answer the following five questions before reading sections 1 and 2. You'll
answer the same questions again after reading to measure your gain.
Pre-Reading Quiz · Half 1
1. Where in the OpenTelemetry pipeline does head-based sampling make its keep/drop decision?
In the backend (Tempo/Jaeger) after spans are stored
In the Collector after buffering spans by trace ID
In the SDK at the root span, before spans are recorded or exported
At the load balancer in front of the service
2. Why is TraceIdRatioBased(p) typically wrapped in ParentBased?
To make the sampler honor the parent's sample decision so traces are coherent across services
To increase the sample rate at child spans for better coverage
To allow tail-based decisions to override the head rate
To convert a probabilistic sampler into an adaptive one
3. A team is debating capturing every 500 error. Which strategy is best suited to that goal?
Head-based TraceIdRatioBased(0.01) — cheap and fast
Tail-based sampling at the Collector with a policy keyed on error=true
AlwaysOff sampling — errors are caught by metrics anyway
Adaptive head sampling driven by upstream load
4. Which label on an http_requests_total metric is the most dangerous for cardinality?
method (GET, POST, PUT, DELETE)
status_code (200, 400, 500, …)
user_id (one value per registered user)
service (one value per deployed microservice)
5. What is the right place in a Prometheus pipeline to drop a noisy label like request_id?
In the alerting rule that uses the metric
In metric_relabel_configs at scrape time, before the sample is stored
In Grafana panel queries with without(...)
After the fact via promtool tsdb delete-series only
1. Sampling Strategies
Sampling is the single most powerful lever you have for controlling telemetry cost and overhead.
It is also the most misunderstood. The trick is to sample in a way that preserves what matters
— errors, slow traces, unusual tenants — while discarding the redundant majority.
Head-based sampling makes the keep-or-drop decision at the first
service that creates the root span — typically the SDK in your front-door API. Because the
decision is made up front, downstream services never have to record, store, or ship the spans of
a dropped trace. This is the cheapest possible form of sampling.
The canonical head sampler in OTel is TraceIdRatioBased. It hashes the trace ID to
produce a deterministic sample with probability p. Wrapped in ParentBased,
child spans honor the parent's decision, ensuring coherent distributed traces:
ParentBased(root=TraceIdRatioBased(0.1))
Extreme samplers exist too: AlwaysOn for dev/staging and AlwaysOff as a
kill switch.
Animation A · Head vs Tail Sampling
Head sampling drops traces at the SDK before any export work. Tail sampling lets all spans
flow to the Collector buffer, waits a decision window, then keeps interesting ones (errors,
slow traces) and drops the rest.
Tail-based at the Collector
Tail-based sampling lives in the OTel Collector's tail_sampling processor: SDKs ship
every span, the Collector buffers spans by trace ID, waits a decision_wait window
(often 5–30 s) for the trace to complete, evaluates policies (error, latency, tenant), then
flushes the keepers. The cost is memory:
At 10,000 spans/sec × 20 spans/trace × 1 KB × 10 s = 2 GB of
in-memory spans. Tail sampling also adds 5–30 s of observability latency
— not great if SREs depend on live traces for incident detection.
Hybrid head + tail sampling
The common production pattern: light head sample (e.g. ParentBased(TraceIdRatioBased(0.2)))
plus tail sampling that further refines the 20% to keep errors and slow traces. Caveat:
anything the head sampler drops is gone forever — including errors that lost the dice roll.
flowchart TD
Start[Trace begins at SDK]
Start --> Head{Head sampler ParentBased TraceIdRatio 20%}
Head -->|80% drop| Drop1[Discard at SDK no spans recorded]
Head -->|20% keep| Export[Export spans to Collector]
Export --> Buffer[Buffer by trace_id decision_wait window]
Buffer --> Tail{Tail sampling policies}
Tail -->|error=true| Keep1[Persist to backend]
Tail -->|latency > 3s| Keep2[Persist to backend]
Tail -->|tenant=enterprise| Keep3[Persist to backend]
Tail -->|10% random| Keep4[Persist to backend]
Tail -->|none matched| Drop2[Discard at Collector]
Comparison: When to use which
Dimension
Head TraceIdRatioBased
Tail tail_sampling
Where decided
SDK, at first span
Collector, after buffering
Decision timing
Immediate, pre-export
After decision_wait (5–30 s)
Memory cost
Very low
High; scales with spans/sec × wait
Network cost
Low (drops never leave service)
High (all spans cross the wire)
Accuracy for rare events
Poor (random drops)
Excellent (sees full trace)
Best for
High-QPS, cost-sensitive APIs
Rare-error capture, targeted debugging
Key Takeaways — Sampling
Head sampling = cheap, deterministic, statistically unbiased, but blindly drops rare events.
Tail sampling = expensive in memory and adds 5–30 s of observability latency, but captures errors and outliers reliably.
Hybrid = 20% head + tail policies is the production sweet spot — cap collector memory while still finding interesting traces.
Always wrap a head ratio sampler in ParentBased so children inherit the parent decision and traces stay coherent.
2. Cardinality Management
If sampling controls trace cost, cardinality controls metric cost. In a
Prometheus-style TSDB the unit of storage is not the metric — it's the unique combination
of metric name plus label set. Each unique combination is one time series, with its own
memory footprint and storage row.
Add a user_id label to http_requests_total{method, status} with 100,000
users and the math turns multiplicative very quickly.
Animation B · Cardinality Explosion
Cardinality is multiplicative, not additive. Adding one unbounded label like user_id
can take a metric from a few thousand to a hundred million series.
Identifying high-cardinality labels
The most useful PromQL forensic query is the “top-N by series count”:
topk(20, count by (__name__)({__name__=~".+"}))
For deeper analysis use promtool tsdb analyze, watch the meta-metrics
prometheus_tsdb_head_series and prometheus_tsdb_head_series_created_total,
and (for Mimir users) the Cardinality Explorer UI — you can even fail CI
builds when a new metric crosses, say, 50,000 series.
That last rule — collapsing /users/12345/orders into
/users/:id/orders — is one of the most useful tricks in the playbook.
Aggregating away unbounded dimensions
groups:
- name: myapp_aggregates
interval: 30s
rules:
- record: myapp:http_request_duration_seconds_bucket:service
expr: sum by (service, le) (myapp_http_request_duration_seconds_bucket)
Keep the high-cardinality raw metric at short retention (2 h) for live debugging, and the
rolled-up recording rule for 90 days at a tiny fraction of the storage. Long-term: never emit
unbounded labels (user_id, session_id, email) on
metrics — those belong on traces or logs.
flowchart TD
Raw["Raw exposition http_requests_total{method, status, user_id, request_id, path} ~1,000,000 series"]
Raw -->|"metric_relabel_configs: labeldrop user_id, request_id normalize /users/:id/"| Mid
Mid["After scrape-time relabel http_requests_total{method, status, route} ~50,000 series"]
Mid -->|"recording rule: sum by (service, le)"| Roll
Roll["Recording-rule rollup myapp:http_request_duration:service ~200 series"]
Key Takeaways — Cardinality
Each unique label-value combination is its own time series — cost is multiplicative.
Never put user_id, session_id, request_id, or raw paths on metrics.
Use topk(20, count by (__name__)(...)) and Mimir's Cardinality Explorer to find offenders.
Drop and normalize at scrape time with metric_relabel_configs; pre-aggregate with recording rules.
Half 1 — Post-Reading Check
Same five questions, after the reading. Compare your scores when you reveal answers below.
Post-Reading Quiz · Half 1
1. Where in the OpenTelemetry pipeline does head-based sampling make its keep/drop decision?
In the backend (Tempo/Jaeger) after spans are stored
In the Collector after buffering spans by trace ID
In the SDK at the root span, before spans are recorded or exported
At the load balancer in front of the service
2. Why is TraceIdRatioBased(p) typically wrapped in ParentBased?
To make the sampler honor the parent's sample decision so traces are coherent across services
To increase the sample rate at child spans for better coverage
To allow tail-based decisions to override the head rate
To convert a probabilistic sampler into an adaptive one
3. A team is debating capturing every 500 error. Which strategy is best suited to that goal?
Head-based TraceIdRatioBased(0.01) — cheap and fast
Tail-based sampling at the Collector with a policy keyed on error=true
AlwaysOff sampling — errors are caught by metrics anyway
Adaptive head sampling driven by upstream load
4. Which label on an http_requests_total metric is the most dangerous for cardinality?
method (GET, POST, PUT, DELETE)
status_code (200, 400, 500, …)
user_id (one value per registered user)
service (one value per deployed microservice)
5. What is the right place in a Prometheus pipeline to drop a noisy label like request_id?
In the alerting rule that uses the metric
In metric_relabel_configs at scrape time, before the sample is stored
In Grafana panel queries with without(...)
After the fact via promtool tsdb delete-series only
Half 2 — Pre-Reading Check
Answer the next five questions before reading sections 3 and 4.
Pre-Reading Quiz · Half 2
6. For a typical well-tuned JVM microservice, what is the expected steady-state CPU overhead of the OpenTelemetry Java agent?
Less than 0.1% — essentially free
1–5% additional CPU
25–50% additional CPU
Over 75% — never use in production
7. Why does production code almost always use BatchSpanProcessor instead of SimpleSpanProcessor?
BatchSpanProcessor encrypts spans before export
SimpleSpanProcessor exports spans synchronously and blocks the application thread on every export
SimpleSpanProcessor cannot handle Java spans
BatchSpanProcessor automatically adjusts the sampling rate
8. The BatchSpanProcessor queue is full. What is the correct default behavior, and what should an operator do?
Block the application thread; nothing to monitor
Drop spans and increment otel_sdk_span_processor_dropped_spans; monitor that counter and sample harder or grow the queue
Spill to disk silently
Crash the process to surface the problem
9. Which architectural pattern best matches “edge aggregation” in a cost-aware observability pipeline?
Send raw spans directly from every app pod to the central backend
Reduce volume as early as possible — SDK views, node-local DaemonSet collector, then regional tier
Keep all data in hot SSD storage forever
Disable instrumentation entirely until an incident occurs
10. In a tiered observability storage model, which mapping is closest to the recommended defaults?
Hot = months on SSD; Cold = seconds on Glacier
Hot = 2–24 h on SSD; Warm = 7–30 d on object storage; Cold = 90 d–1 y on archive
All data on RAM for one hour, then deleted
There is no benefit to tiering — one storage class is enough
3. Performance Overhead
The third lever is the cost of generating telemetry — the CPU and memory the
instrumentation itself consumes inside your application. Usually small if sampling and
cardinality are under control, but worth measuring.
CPU and memory cost of SDK instrumentation
Java auto-instrumentation (the OTel Java Agent) is the heaviest common case
because it uses bytecode instrumentation. The Elastic EDOT Java benchmark on a sample JVM
service:
Metric
No agent
With EDOT Java
Relative impact
Startup time
5.55 s
6.82 s
~+23% (+1.3 s)
p95 request latency
1.96 ms
2.06 ms
~+5%
Total system CPU
53.82%
54.25%
~+0.8% absolute
Expect 1–5% additional CPU for most well-tuned JVM services and
tens of MB of extra heap. The worst case from community benchmarks is up to
~20% CPU and ≥0.5 ms latency per instrumented hop at high QPS without sampling —
driven mostly by GC pressure.
Go auto-instrumentation is library-based, so there's no startup penalty and no
class-loading hit. Typical Go OTel overhead is low single-digit CPU percent
with a few to tens of MB of additional RSS.
Batching, async export, and buffer tuning
The BatchSpanProcessor (BSP) decouples span generation from span export. Spans go
into an in-memory queue, a background thread pulls them in batches, and the exporter ships
them. Tuning knobs:
Knob
Larger value
Smaller value
Batch size
Better CPU/network efficiency, more memory
Lower memory, more frequent exports
Schedule delay
Fewer export calls, spans linger
Lower loss risk, more even export
Max queue size
Survives bigger spikes without dropping
Smaller footprint, drops earlier
Sampling rate
More traces shipped, higher overhead
Cheaper, less coverage
Animation C · Async vs Synchronous Export
Async export (BatchSpanProcessor) enqueues spans into a ring buffer and exports in
background batches — the application thread is never blocked. Synchronous export
(SimpleSpanProcessor) blocks every request thread on every export.
Back-pressure and benchmarking
Watch otel_sdk_span_processor_dropped_spans — a sustained nonzero rate means
you're losing trace data and need to sample harder or grow the queue. To make capacity
decisions, benchmark your own workload: baseline without instrumentation, then enable
it at the intended sample rate, then drive 2–3× peak load to find the breakpoint.
Plan capacity with +5–20% CPU headroom for Java, low single-digit for Go.
Typical: 1–5% CPU and tens of MB of memory; Java agents at the higher end (worst case ~20%), Go SDKs at the lower end.
Sampling rate is the most powerful overhead knob; batching configuration is second.
Always use the async BatchSpanProcessor in production — never SimpleSpanProcessor.
Monitor dropped-spans counters; benchmark on your own workload at 2–3× expected peak.
4. Cost-aware Architecture
Sampling, cardinality, and overhead controls are tactical. The strategic layer is
architecture: where in the pipeline you do what work, how telemetry flows from edge to
backend, and how long each tier retains data. Think cold-chain logistics for telemetry: fresh
data needs fast/expensive storage, ageing data shifts to progressively cheaper tiers.
Metric aggregation at the edge
The further telemetry travels before it is reduced, the more expensive it becomes. Do reduction
as close to the source as possible:
In the application: SDK views aggregate buckets, drop attributes, convert deltas before export.
In a sidecar or node agent: per-node OTel Collector / Prometheus Agent does first-round relabel and sample.
In a regional collector tier: consolidates across nodes, applies tail sampling, forwards to central backend.
Each tier reduces volume by 2–10×.
flowchart LR
subgraph Apps["App pods (many)"]
A1[App + SDK]
A2[App + SDK]
A3[App + SDK]
end
subgraph Node["Node-level Collector (DaemonSet)"]
N1[Relabel + sample]
end
subgraph Regional["Regional Collector tier (StatefulSet)"]
R1["Tail sampling cardinality enforcement"]
end
subgraph Backend["Observability backend (Mimir / Tempo / Loki)"]
Hot["Hot tier SSD, 2–24 h"]
Warm["Warm tier S3, 7–30 d"]
Cold["Cold tier Glacier, 90 d – 1 y"]
end
A1 --> N1
A2 --> N1
A3 --> N1
N1 -->|2–10x reduction| R1
R1 -->|2–10x reduction| Hot
Hot -->|age out| Warm
Warm -->|age out| Cold
Log volume reduction
Structured logging only — JSON, so the pipeline can filter and route without regex.
Severity-based routing — DEBUG/INFO short retention, ERROR+ to always-on alerting.
Sample DEBUG/INFO in production at 1–10%, just like traces.
Drop or hash unbounded fields — deduplicate stack traces by hash.
Convert logs to metrics — db_timeout_total{service="orders"} beats ten thousand log lines.
Suppress duplicate spam with rate-limiting processors.
Tiered storage and retention
Tier
Latency
Cost
Typical use
Hot (memory / SSD)
ms
High
Last 2–24 h, live alerting, on-call
Warm (object storage)
seconds
Medium
1–30 days, recent incident review
Cold (archive / Glacier)
minutes–hours
Very low
30 d – years, compliance, trends
Reasonable default retention by pillar:
Pillar
Hot
Warm
Cold
Metrics (raw)
2 h
15 d
—
Metrics (recording rules)
24 h
90 d
1 y
Traces (sampled)
24 h
7 d
—
Traces (interesting, tail-sampled)
24 h
30 d
90 d
Logs (ERROR+)
7 d
30 d
1 y
Logs (INFO/DEBUG, sampled)
24 h
7 d
—
The architectural lever that ties this together is back-pressure handling.
Every component must have a documented behavior under overload — drop, buffer,
downsample, or block. The right default is “drop with a metric” so the application
stays healthy and operators see the overload via clear telemetry-about-telemetry counters.
Key Takeaways — Cost-aware Architecture
Edge aggregation: reduce as early as possible — SDK views, node DaemonSet, regional tier.
Tiered storage: hot (ms, hours), warm (s, days), cold (min, months) — match retention to query frequency.
Log levers: structured-only, severity routing, sampling, convert logs to metrics where possible.
Back-pressure: every stage must have a documented drop/buffer/block behavior — never let observability stall the app.
Half 2 — Post-Reading Check
Same five questions, now post-reading.
Post-Reading Quiz · Half 2
6. For a typical well-tuned JVM microservice, what is the expected steady-state CPU overhead of the OpenTelemetry Java agent?
Less than 0.1% — essentially free
1–5% additional CPU
25–50% additional CPU
Over 75% — never use in production
7. Why does production code almost always use BatchSpanProcessor instead of SimpleSpanProcessor?
BatchSpanProcessor encrypts spans before export
SimpleSpanProcessor exports spans synchronously and blocks the application thread on every export
SimpleSpanProcessor cannot handle Java spans
BatchSpanProcessor automatically adjusts the sampling rate
8. The BatchSpanProcessor queue is full. What is the correct default behavior, and what should an operator do?
Block the application thread; nothing to monitor
Drop spans and increment otel_sdk_span_processor_dropped_spans; monitor that counter and sample harder or grow the queue
Spill to disk silently
Crash the process to surface the problem
9. Which architectural pattern best matches “edge aggregation” in a cost-aware observability pipeline?
Send raw spans directly from every app pod to the central backend
Reduce volume as early as possible — SDK views, node-local DaemonSet collector, then regional tier
Keep all data in hot SSD storage forever
Disable instrumentation entirely until an incident occurs
10. In a tiered observability storage model, which mapping is closest to the recommended defaults?
Hot = months on SSD; Cold = seconds on Glacier
Hot = 2–24 h on SSD; Warm = 7–30 d on object storage; Cold = 90 d–1 y on archive
All data on RAM for one hour, then deleted
There is no benefit to tiering — one storage class is enough