Study Guide: Chapter 11 — Sampling, Performance, and Cost Control

Observability is one of those engineering disciplines where the cure can become as expensive as the disease. This chapter teaches the opposite reflex: deliberately keep the right telemetry, drop the rest, and do it in the right place at the right cost.

Half 1 — Pre-Reading Check

Answer the following five questions before reading sections 1 and 2. You'll answer the same questions again after reading to measure your gain.

Pre-Reading Quiz · Half 1

1. Where in the OpenTelemetry pipeline does head-based sampling make its keep/drop decision?

In the backend (Tempo/Jaeger) after spans are stored

In the Collector after buffering spans by trace ID

In the SDK at the root span, before spans are recorded or exported

At the load balancer in front of the service

2. Why is TraceIdRatioBased(p) typically wrapped in ParentBased?

To make the sampler honor the parent's sample decision so traces are coherent across services

To increase the sample rate at child spans for better coverage

To allow tail-based decisions to override the head rate

To convert a probabilistic sampler into an adaptive one

3. A team is debating capturing every 500 error. Which strategy is best suited to that goal?

Head-based TraceIdRatioBased(0.01) — cheap and fast

Tail-based sampling at the Collector with a policy keyed on error=true

AlwaysOff sampling — errors are caught by metrics anyway

Adaptive head sampling driven by upstream load

4. Which label on an http_requests_total metric is the most dangerous for cardinality?

method (GET, POST, PUT, DELETE)

status_code (200, 400, 500, …)

user_id (one value per registered user)

service (one value per deployed microservice)

5. What is the right place in a Prometheus pipeline to drop a noisy label like request_id?

In the alerting rule that uses the metric

In metric_relabel_configs at scrape time, before the sample is stored

In Grafana panel queries with without(...)

After the fact via promtool tsdb delete-series only

1. Sampling Strategies

Sampling is the single most powerful lever you have for controlling telemetry cost and overhead. It is also the most misunderstood. The trick is to sample in a way that preserves what matters — errors, slow traces, unusual tenants — while discarding the redundant majority.

Head-based: ParentBased, TraceIdRatio, AlwaysOn/Off

Head-based sampling makes the keep-or-drop decision at the first service that creates the root span — typically the SDK in your front-door API. Because the decision is made up front, downstream services never have to record, store, or ship the spans of a dropped trace. This is the cheapest possible form of sampling.

The canonical head sampler in OTel is TraceIdRatioBased. It hashes the trace ID to produce a deterministic sample with probability p. Wrapped in ParentBased, child spans honor the parent's decision, ensuring coherent distributed traces:

ParentBased(root=TraceIdRatioBased(0.1))

Extreme samplers exist too: AlwaysOn for dev/staging and AlwaysOff as a kill switch.

Animation A · Head vs Tail Sampling

Head sampling drops traces at the SDK before any export work. Tail sampling lets all spans flow to the Collector buffer, waits a decision window, then keeps interesting ones (errors, slow traces) and drops the rest.

Tail-based at the Collector

Tail-based sampling lives in the OTel Collector's tail_sampling processor: SDKs ship every span, the Collector buffers spans by trace ID, waits a decision_wait window (often 5–30 s) for the trace to complete, evaluates policies (error, latency, tenant), then flushes the keepers. The cost is memory:

buffered_bytes ≈ spans/sec × spans/trace × span_size × wait_seconds

At 10,000 spans/sec × 20 spans/trace × 1 KB × 10 s = 2 GB of in-memory spans. Tail sampling also adds 5–30 s of observability latency — not great if SREs depend on live traces for incident detection.

Hybrid head + tail sampling

The common production pattern: light head sample (e.g. ParentBased(TraceIdRatioBased(0.2))) plus tail sampling that further refines the 20% to keep errors and slow traces. Caveat: anything the head sampler drops is gone forever — including errors that lost the dice roll.

flowchart TD Start[Trace begins
at SDK] Start --> Head{Head sampler
ParentBased
TraceIdRatio 20%} Head -->|80% drop| Drop1[Discard at SDK
no spans recorded] Head -->|20% keep| Export[Export spans
to Collector] Export --> Buffer[Buffer by trace_id
decision_wait window] Buffer --> Tail{Tail sampling
policies} Tail -->|error=true| Keep1[Persist to backend] Tail -->|latency > 3s| Keep2[Persist to backend] Tail -->|tenant=enterprise| Keep3[Persist to backend] Tail -->|10% random| Keep4[Persist to backend] Tail -->|none matched| Drop2[Discard at Collector]

Comparison: When to use which

Dimension	Head `TraceIdRatioBased`	Tail `tail_sampling`
Where decided	SDK, at first span	Collector, after buffering
Decision timing	Immediate, pre-export	After `decision_wait` (5–30 s)
Memory cost	Very low	High; scales with `spans/sec × wait`
Network cost	Low (drops never leave service)	High (all spans cross the wire)
Accuracy for rare events	Poor (random drops)	Excellent (sees full trace)
Best for	High-QPS, cost-sensitive APIs	Rare-error capture, targeted debugging

2. Cardinality Management

If sampling controls trace cost, cardinality controls metric cost. In a Prometheus-style TSDB the unit of storage is not the metric — it's the unique combination of metric name plus label set. Each unique combination is one time series, with its own memory footprint and storage row.

Add a user_id label to http_requests_total{method, status} with 100,000 users and the math turns multiplicative very quickly.

Animation B · Cardinality Explosion

Cardinality is multiplicative, not additive. Adding one unbounded label like user_id can take a metric from a few thousand to a hundred million series.

Identifying high-cardinality labels

The most useful PromQL forensic query is the “top-N by series count”:

topk(20, count by (__name__)({__name__=~".+"}))

For deeper analysis use promtool tsdb analyze, watch the meta-metrics prometheus_tsdb_head_series and prometheus_tsdb_head_series_created_total, and (for Mimir users) the Cardinality Explorer UI — you can even fail CI builds when a new metric crosses, say, 50,000 series.

Allow/deny lists and attribute drops

metric_relabel_configs:
  - source_labels: [uri]
    regex: ".+"
    action: labeldrop

  - regex: "user_id|session_id|request_id"
    action: labeldrop

  - source_labels: [path]
    regex: "/api/v1/users/[0-9]+/(.*)"
    target_label: path
    replacement: "/api/v1/users/:id/$1"
    action: replace

That last rule — collapsing /users/12345/orders into /users/:id/orders — is one of the most useful tricks in the playbook.

Aggregating away unbounded dimensions

groups:
  - name: myapp_aggregates
    interval: 30s
    rules:
      - record: myapp:http_request_duration_seconds_bucket:service
        expr: sum by (service, le) (myapp_http_request_duration_seconds_bucket)

Keep the high-cardinality raw metric at short retention (2 h) for live debugging, and the rolled-up recording rule for 90 days at a tiny fraction of the storage. Long-term: never emit unbounded labels (user_id, session_id, email) on metrics — those belong on traces or logs.

flowchart TD Raw["Raw exposition
http_requests_total{method, status,
user_id, request_id, path}
~1,000,000 series"] Raw -->|"metric_relabel_configs:
labeldrop user_id, request_id
normalize /users/:id/"| Mid Mid["After scrape-time relabel
http_requests_total{method, status, route}
~50,000 series"] Mid -->|"recording rule:
sum by (service, le)"| Roll Roll["Recording-rule rollup
myapp:http_request_duration:service
~200 series"]

Half 1 — Post-Reading Check

Same five questions, after the reading. Compare your scores when you reveal answers below.

Post-Reading Quiz · Half 1

1. Where in the OpenTelemetry pipeline does head-based sampling make its keep/drop decision?

In the backend (Tempo/Jaeger) after spans are stored

In the Collector after buffering spans by trace ID

In the SDK at the root span, before spans are recorded or exported

At the load balancer in front of the service

2. Why is TraceIdRatioBased(p) typically wrapped in ParentBased?

To make the sampler honor the parent's sample decision so traces are coherent across services

To increase the sample rate at child spans for better coverage

To allow tail-based decisions to override the head rate

To convert a probabilistic sampler into an adaptive one

3. A team is debating capturing every 500 error. Which strategy is best suited to that goal?

Head-based TraceIdRatioBased(0.01) — cheap and fast

Tail-based sampling at the Collector with a policy keyed on error=true

AlwaysOff sampling — errors are caught by metrics anyway

Adaptive head sampling driven by upstream load

4. Which label on an http_requests_total metric is the most dangerous for cardinality?

method (GET, POST, PUT, DELETE)

status_code (200, 400, 500, …)

user_id (one value per registered user)

service (one value per deployed microservice)

5. What is the right place in a Prometheus pipeline to drop a noisy label like request_id?

In the alerting rule that uses the metric

In metric_relabel_configs at scrape time, before the sample is stored

In Grafana panel queries with without(...)

After the fact via promtool tsdb delete-series only

Half 2 — Pre-Reading Check

Pre-Reading Quiz · Half 2

6. For a typical well-tuned JVM microservice, what is the expected steady-state CPU overhead of the OpenTelemetry Java agent?

Less than 0.1% — essentially free

1–5% additional CPU

25–50% additional CPU

Over 75% — never use in production

7. Why does production code almost always use BatchSpanProcessor instead of SimpleSpanProcessor?

BatchSpanProcessor encrypts spans before export

SimpleSpanProcessor exports spans synchronously and blocks the application thread on every export

SimpleSpanProcessor cannot handle Java spans

BatchSpanProcessor automatically adjusts the sampling rate

8. The BatchSpanProcessor queue is full. What is the correct default behavior, and what should an operator do?

Block the application thread; nothing to monitor

Drop spans and increment otel_sdk_span_processor_dropped_spans; monitor that counter and sample harder or grow the queue

Spill to disk silently

Crash the process to surface the problem

9. Which architectural pattern best matches “edge aggregation” in a cost-aware observability pipeline?

Send raw spans directly from every app pod to the central backend

Reduce volume as early as possible — SDK views, node-local DaemonSet collector, then regional tier

Keep all data in hot SSD storage forever

Disable instrumentation entirely until an incident occurs

10. In a tiered observability storage model, which mapping is closest to the recommended defaults?

Hot = months on SSD; Cold = seconds on Glacier

Hot = 2–24 h on SSD; Warm = 7–30 d on object storage; Cold = 90 d–1 y on archive

All data on RAM for one hour, then deleted

There is no benefit to tiering — one storage class is enough

3. Performance Overhead

The third lever is the cost of generating telemetry — the CPU and memory the instrumentation itself consumes inside your application. Usually small if sampling and cardinality are under control, but worth measuring.

CPU and memory cost of SDK instrumentation

Java auto-instrumentation (the OTel Java Agent) is the heaviest common case because it uses bytecode instrumentation. The Elastic EDOT Java benchmark on a sample JVM service:

Metric	No agent	With EDOT Java	Relative impact
Startup time	5.55 s	6.82 s	~+23% (+1.3 s)
p95 request latency	1.96 ms	2.06 ms	~+5%
Total system CPU	53.82%	54.25%	~+0.8% absolute

Expect 1–5% additional CPU for most well-tuned JVM services and tens of MB of extra heap. The worst case from community benchmarks is up to ~20% CPU and ≥0.5 ms latency per instrumented hop at high QPS without sampling — driven mostly by GC pressure.

Go auto-instrumentation is library-based, so there's no startup penalty and no class-loading hit. Typical Go OTel overhead is low single-digit CPU percent with a few to tens of MB of additional RSS.

Batching, async export, and buffer tuning

The BatchSpanProcessor (BSP) decouples span generation from span export. Spans go into an in-memory queue, a background thread pulls them in batches, and the exporter ships them. Tuning knobs:

Knob	Larger value	Smaller value
Batch size	Better CPU/network efficiency, more memory	Lower memory, more frequent exports
Schedule delay	Fewer export calls, spans linger	Lower loss risk, more even export
Max queue size	Survives bigger spikes without dropping	Smaller footprint, drops earlier
Sampling rate	More traces shipped, higher overhead	Cheaper, less coverage

Animation C · Async vs Synchronous Export

Async export (BatchSpanProcessor) enqueues spans into a ring buffer and exports in background batches — the application thread is never blocked. Synchronous export (SimpleSpanProcessor) blocks every request thread on every export.

Back-pressure and benchmarking

Watch otel_sdk_span_processor_dropped_spans — a sustained nonzero rate means you're losing trace data and need to sample harder or grow the queue. To make capacity decisions, benchmark your own workload: baseline without instrumentation, then enable it at the intended sample rate, then drive 2–3× peak load to find the breakpoint. Plan capacity with +5–20% CPU headroom for Java, low single-digit for Go.

flowchart LR App["Application thread
span.end()"] App -->|enqueue
non-blocking| Queue["In-memory queue
max_queue_size"] Queue -->|"drop on full
(dropped_spans counter)"| DropPath[Span dropped] Queue -->|background worker
pulls batches| Batcher["Batcher
batch_size /
schedule_delay"] Batcher -->|OTLP gRPC/HTTP| Exp[Exporter] Exp --> Col[OTel Collector] Col --> Backend["Backend
(Tempo / Mimir / Loki)"]

4. Cost-aware Architecture

Sampling, cardinality, and overhead controls are tactical. The strategic layer is architecture: where in the pipeline you do what work, how telemetry flows from edge to backend, and how long each tier retains data. Think cold-chain logistics for telemetry: fresh data needs fast/expensive storage, ageing data shifts to progressively cheaper tiers.

Metric aggregation at the edge

The further telemetry travels before it is reduced, the more expensive it becomes. Do reduction as close to the source as possible:

In the application: SDK views aggregate buckets, drop attributes, convert deltas before export.
In a sidecar or node agent: per-node OTel Collector / Prometheus Agent does first-round relabel and sample.
In a regional collector tier: consolidates across nodes, applies tail sampling, forwards to central backend.

Each tier reduces volume by 2–10×.

flowchart LR subgraph Apps["App pods (many)"] A1[App + SDK] A2[App + SDK] A3[App + SDK] end subgraph Node["Node-level Collector
(DaemonSet)"] N1[Relabel + sample] end subgraph Regional["Regional Collector tier
(StatefulSet)"] R1["Tail sampling
cardinality enforcement"] end subgraph Backend["Observability backend
(Mimir / Tempo / Loki)"] Hot["Hot tier
SSD, 2–24 h"] Warm["Warm tier
S3, 7–30 d"] Cold["Cold tier
Glacier, 90 d – 1 y"] end A1 --> N1 A2 --> N1 A3 --> N1 N1 -->|2–10x reduction| R1 R1 -->|2–10x reduction| Hot Hot -->|age out| Warm Warm -->|age out| Cold

Log volume reduction

Structured logging only — JSON, so the pipeline can filter and route without regex.
Severity-based routing — DEBUG/INFO short retention, ERROR+ to always-on alerting.
Sample DEBUG/INFO in production at 1–10%, just like traces.
Drop or hash unbounded fields — deduplicate stack traces by hash.
Convert logs to metrics — db_timeout_total{service="orders"} beats ten thousand log lines.
Suppress duplicate spam with rate-limiting processors.

Tiered storage and retention

Tier	Latency	Cost	Typical use
Hot (memory / SSD)	ms	High	Last 2–24 h, live alerting, on-call
Warm (object storage)	seconds	Medium	1–30 days, recent incident review
Cold (archive / Glacier)	minutes–hours	Very low	30 d – years, compliance, trends

Reasonable default retention by pillar:

Pillar	Hot	Warm	Cold
Metrics (raw)	2 h	15 d	—
Metrics (recording rules)	24 h	90 d	1 y
Traces (sampled)	24 h	7 d	—
Traces (interesting, tail-sampled)	24 h	30 d	90 d
Logs (ERROR+)	7 d	30 d	1 y
Logs (INFO/DEBUG, sampled)	24 h	7 d	—

The architectural lever that ties this together is back-pressure handling. Every component must have a documented behavior under overload — drop, buffer, downsample, or block. The right default is “drop with a metric” so the application stays healthy and operators see the overload via clear telemetry-about-telemetry counters.

Half 2 — Post-Reading Check

Post-Reading Quiz · Half 2