Compose a Collector pipeline from receivers, processors, exporters, and extensions
Apply transform, filter, batch, and tail_sampling processors to control cost and shape data
Operate the Collector reliably with health checks, queuing, and back-pressure tuning
Pre-Study Assessment
1. Which Collector component is the contract that actually wires receivers, processors, and exporters into runnable pipelines?
The receivers top-level block
The service.pipelines block
The extensions block
The connectors block
2. Which processor should always be listed first in a Collector pipeline?
batch
transform
memory_limiter
k8sattributes
3. What is a connector in the OpenTelemetry Collector?
An extension that exposes Prometheus metrics on port 8888
A processor that drops spans matching a predicate
A hybrid component that acts as an exporter on one pipeline and a receiver on another
A YAML anchor used to share config across pipelines
4. Why is the batch processor recommended in essentially every production pipeline?
It enriches spans with Kubernetes pod metadata
It groups telemetry into larger payloads, dramatically reducing per-call overhead at the exporter
It applies tail-sampling policies based on latency and error status
It encrypts data in flight to the backend
5. For tail_sampling policies to actually have spans to evaluate, the SDK must export traces using which sampler?
An aggressive traceidratiobased(0.01) head sampler
always_off — let the Collector decide everything
always_on (or parentbased(always_on)) so unsampled spans reach the Collector
No sampler is needed; tail sampling reconstructs dropped spans
6. Which receiver is most commonly used to ingest container logs from /var/log/pods on a Kubernetes node?
otlp
prometheus
hostmetrics
filelog
7. Which exporter is the standard choice for fanning metrics out into the Prometheus / Mimir / Cortex / Thanos ecosystem?
loki
otlp
prometheusremotewrite
debug
8. What is the role of the file_storage extension on a gateway Collector?
It stores Collector binary releases for rolling upgrades
It backs the exporter sending_queue with disk so accepted telemetry survives restarts
It writes a debug log of every span to a flat file
It mounts a ConfigMap into the Collector pod
9. Which extension exposes a live, in-process view of pipelines and exporter queues for incident triage?
health_check
pprof
zpages
file_storage
10. In the recommended two-tier topology, where does tail_sampling belong?
On the DaemonSet agent, close to each application pod
In the SDK, before spans ever leave the application
On the centralized gateway Deployment, where it can buffer whole traces
On the backend (Tempo), not in the Collector
Section 1: Pipeline Architecture
The Collector is not a single black box — it is a configurable pipeline engine built from four kinds of components, plus optional extensions. Every piece of telemetry flowing through takes the same conceptual journey: it enters through a receiver, traverses a chain of processors, and leaves through one or more exporters. Pipelines are declared per signal type (traces, metrics, logs), and the service block is what actually wires components together into runnable pipelines.
The four component types (plus extensions)
Component
Role
Examples
Receiver
Accepts data in (push) or pulls from a source
otlp, prometheus, hostmetrics, filelog, kafka
Processor
Mutates, filters, batches, samples, enriches in flight
memory_limiter, batch, transform, tail_sampling
Exporter
Sends data to one or more backends
otlp, prometheusremotewrite, loki, debug
Connector
Joins two pipelines — exporter on one side, receiver on the other
Connectors are the cleanest way to derive one signal from another — for example, generating RED metrics (Rate, Errors, Duration) from spans via a spanmetrics connector that exits a traces pipeline and re-enters a separate metrics pipeline.
A component defined in receivers, processors, or exporters but not referenced under service.pipelines is silently ignored. This is the most common source of "my config does nothing" surprises.
Processor order is decisive
Processors run in the order listed. This is one of the most consequential, and most commonly overlooked, properties of Collector configuration:
memory_limiter always first — back-pressure kicks in before later, more expensive processors waste CPU
Enrichment processors next (e.g., k8sattributes, resource) so downstream filters see full context
Filter / sampling next — drop unwanted data before transforms touch it
transform / scrubbing — reshape what is left
batch last — coalesce into large outbound batches just before the exporter
Figure A — Collector pipeline conveyor belt (data packet journey)
Section 1 Takeaway
A Collector pipeline is a typed chain of receivers, processors, and exporters wired together in service.pipelines
Component order, not just choice, defines behavior — memory_limiter first, enrichment, filter/sample, transform, batch last
Components not referenced under service.pipelines are silently ignored
Connectors (e.g., spanmetrics) cleanly derive one signal from another inside the same Collector
Section 2: Key Processors
memory_limiter + batch — the mandatory pair
memory_limiter samples Collector memory on a fixed interval and, when usage crosses configured thresholds, refuses new data by returning errors to receivers. That refusal is what creates back-pressure: upstream senders see failures, retry, and slow down — instead of the Collector dying from an out-of-memory kill.
processors:
memory_limiter:
check_interval: 1s
limit_mib: 800 # ~80% of container memory limit
spike_limit_mib: 200 # tolerance for short bursts
batch:
timeout: 5s
send_batch_size: 512
send_batch_max_size: 4096
Analogy: batch is a hotel shuttle that waits up to five minutes (or until full) before driving to the airport — far more efficient than calling a taxi for every guest. memory_limiter is the bouncer at the lobby door who turns guests away when the lobby is full, so the building never collapses.
transform with OTTL
For richer mutations — conditional logic, regex substitution, cross-field arithmetic — reach for the transform processor, which uses the OpenTelemetry Transformation Language (OTTL). OTTL statements look like set(target, value) where <boolean> and run inside a context (span, metric, datapoint, log, resource, or scope).
processors:
transform:
error_mode: ignore
trace_statements:
- context: span
statements:
# Collapse user IDs in URL paths so cardinality stays bounded
- replace_pattern(attributes["http.target"], "/users/[0-9]+", "/users/:id") where attributes["http.target"] != nil
# Remove PII before exporting
- delete_key(attributes, "user.email")
- delete_key(attributes, "user.id")
# Whitelist what is allowed to leave
- keep_keys(attributes, ["http.method", "http.target", "http.status_code", "service.name"])
# Mark anything from the checkout service
- set(attributes["env"], "prod") where resource.attributes["service.name"] == "checkout-service"
error_mode: ignore matters: the default in some versions is propagate, which can fail an entire batch when a single statement errors. A close cousin, filter, uses OTTL conditions to drop data outright (e.g., dropping /healthz spans).
tail_sampling vs probabilistic_sampler
The probabilistic_sampler is cheap and stateless — it picks (say) 5% based on trace ID, but it cannot prefer error or slow traces. The tail_sampling processor is fundamentally different: it buffers all spans for a trace, keyed by trace ID, and decides keep/drop only after decision_wait seconds or all spans have arrived. Because it sees the whole trace, it can sample on end-to-end latency, final status, or attributes that appear only on a leaf span.
Crucial gotcha: tail sampling only works if SDKs export spans unsampled (always_on or parentbased(always_on)). If the SDK already dropped spans, no policy can resurrect them.
sequenceDiagram
participant SDK as SDK always_on
participant Col as Collector tail_sampling
participant Buf as Trace buffer
participant Pol as Policy evaluator
participant BE as Backend Tempo
SDK->>Col: Span A trace T1 root
Col->>Buf: Buffer T1 spans
SDK->>Col: Span B trace T1 child
Col->>Buf: Buffer T1 spans
SDK->>Col: Span C trace T1 error
Col->>Buf: Buffer T1 spans
Note over Col,Buf: Wait decision_wait 10s
Col->>Pol: Evaluate composite policies
Pol->>Pol: error-traces matches? YES
Pol-->>Col: KEEP trace T1
Col->>BE: Export all T1 spans
Head vs tail sampling at a glance
Aspect
Head / parent-based sampler
tail_sampling processor
Decision point
Root span start
After decision_wait in Collector
Sees full trace
No
Yes
Can prefer errors / slow traces
No
Yes
SDK overhead
Low (drops at source)
High (must export everything)
Collector memory & CPU
Minimal
Substantial (buffers spans)
k8sattributes — Kubernetes enrichment
The k8sattributes processor watches the Kubernetes API and decorates telemetry with metadata about the sending pod (namespace, deployment, node, labels). It identifies the sender either by inbound connection IP or by an explicit k8s.pod.ip resource attribute. Run it on the agent (DaemonSet), never on a central gateway — the gateway only sees the agent's IP, not the application pod's. Limit extract.metadata to fields you actually query on; each one multiplies cardinality and API-server load.
Figure B — Tail sampling buffers a trace, then evaluates policies
Section 2 Takeaway
memory_limiter and batch are non-negotiable in production — the bouncer and the shuttle
transform (OTTL statements) and filter (OTTL conditions) reshape and drop data; set error_mode: ignore to avoid losing batches on one bad statement
tail_sampling buffers spans by trace ID and decides per-trace after decision_wait — only works if SDKs ship spans unsampled
k8sattributes belongs on the DaemonSet agent where connection-based pod association still works
Section 3: Key Receivers and Exporters
Workhorse receivers
Receiver
What it ingests
Typical use
otlp
OTLP gRPC and OTLP HTTP (traces, metrics, logs)
Default for SDK and Collector-to-Collector traffic
prometheus
Scrapes Prometheus /metrics endpoints
Migration from Prometheus; scraping exporters
hostmetrics
OS-level CPU, memory, disk, network, filesystem, process
The prometheus receiver is worth a special mention: it accepts native Prometheus scrape config, so an existing prometheus.yml can be lifted into the Collector almost verbatim — a powerful migration path.
Workhorse exporters
Exporter
Destination
Notes
otlp
Any OTLP-compatible backend (Tempo, Jaeger, vendors)
resource_to_telemetry_conversion: true on prometheusremotewrite promotes OTLP resource attributes (like service.name, k8s.pod.name) into Prometheus labels so they become queryable in PromQL.
Connectors — the inter-pipeline glue
A connector behaves as an exporter on one pipeline and a receiver on another. The canonical example is spanmetrics: it consumes spans and emits aggregated RED metrics — derived signals produced once, near the source, rather than re-derived in each backend.
connectors:
spanmetrics:
histogram:
explicit:
buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s]
dimensions:
- name: http.method
- name: http.status_code
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [spanmetrics, otlp/tempo] # spanmetrics is an EXPORTER here
metrics/spans:
receivers: [spanmetrics] # ... and a RECEIVER here
processors: [batch]
exporters: [prometheusremotewrite]
Other useful connectors include routing (split traffic by attribute) and forward (chain pipelines together).
otlp + prometheus receivers cover most ingest; otlp + prometheusremotewrite + loki exporters cover most egress
Connectors like spanmetrics derive one signal from another inside the same Collector — appearing as exporter in one pipeline and receiver in another
resource_to_telemetry_conversion on remote-write surfaces OTLP resource attributes as PromQL-queryable labels
Section 4: Reliability and Operations
Persistent queue and retry on failure
Most exporters support a sending_queue and retry_on_failure. By default the sending queue is in memory: fast, but lost on restart. Pairing it with the file_storage extension makes the queue durable, so an OOM kill or rolling deployment doesn't drop telemetry that has already been accepted from upstream.
queue_size: 1,000-5,000 batches per exporter (each batch holds thousands of records)
num_consumers: 5-10 typical; raise only if the backend can handle more parallelism
retry_on_failure: exponential backoff with initial_interval: 5s, max_interval: 60s; set max_elapsed_time: 0 for critical data and let memory_limiter drop oldest when the queue fills
zpages is especially useful during incidents — per-component counters and sampled recent traces come directly from the Collector's process, so you can answer "is data flowing? is anything dropping?" without leaving the cluster.
The memory_limiter should target 70-80% of the container memory limit, with spike_limit_mib covering the largest plausible batch. Agents are not HPA-scaled (they scale with node count via DaemonSet); the gateway runs an HPA on CPU at 60-70% target utilization with minReplicas: 2 for graceful scaling and HA.
Tail-sampling capacity follows a simple rule of thumb: num_traces ≥ expected_new_traces_per_sec × decision_wait × 2. At 2,000 traces/sec and a 10s decision_wait, that is num_traces: 40000 minimum — round up to 50,000 for headroom.
Monitor the Collector with… itself: every Collector exposes its internal metrics on port 8888. Alert on:
Memory near limit_mib for sustained periods
Sustained growth of otelcol_exporter_queue_size
Non-zero otelcol_processor_refused_* or otelcol_exporter_send_failed_* counters
HPA at maxReplicas while queues continue to grow
Figure C — Back-pressure: exporter stalls, memory_limiter refuses, recovery
Section 4 Takeaway
Pair sending_queue with the file_storage extension so accepted telemetry survives restarts
Run a two-tier topology: DaemonSet agent (local concerns, k8sattributes) and Deployment gateway (tail sampling, transforms, durable queues, HPA on CPU)
Size memory_limiter at 70-80% of the pod memory limit; num_traces ≥ new_traces/sec × decision_wait × 2
Use the Collector's own :8888 metrics as the operational source of truth — alert on queue growth and refused/failed counters
Post-Study Assessment
1. Which Collector component is the contract that actually wires receivers, processors, and exporters into runnable pipelines?
The receivers top-level block
The service.pipelines block
The extensions block
The connectors block
2. Which processor should always be listed first in a Collector pipeline?
batch
transform
memory_limiter
k8sattributes
3. What is a connector in the OpenTelemetry Collector?
An extension that exposes Prometheus metrics on port 8888
A processor that drops spans matching a predicate
A hybrid component that acts as an exporter on one pipeline and a receiver on another
A YAML anchor used to share config across pipelines
4. Why is the batch processor recommended in essentially every production pipeline?
It enriches spans with Kubernetes pod metadata
It groups telemetry into larger payloads, dramatically reducing per-call overhead at the exporter
It applies tail-sampling policies based on latency and error status
It encrypts data in flight to the backend
5. For tail_sampling policies to actually have spans to evaluate, the SDK must export traces using which sampler?
An aggressive traceidratiobased(0.01) head sampler
always_off — let the Collector decide everything
always_on (or parentbased(always_on)) so unsampled spans reach the Collector
No sampler is needed; tail sampling reconstructs dropped spans
6. Which receiver is most commonly used to ingest container logs from /var/log/pods on a Kubernetes node?
otlp
prometheus
hostmetrics
filelog
7. Which exporter is the standard choice for fanning metrics out into the Prometheus / Mimir / Cortex / Thanos ecosystem?
loki
otlp
prometheusremotewrite
debug
8. What is the role of the file_storage extension on a gateway Collector?
It stores Collector binary releases for rolling upgrades
It backs the exporter sending_queue with disk so accepted telemetry survives restarts
It writes a debug log of every span to a flat file
It mounts a ConfigMap into the Collector pod
9. Which extension exposes a live, in-process view of pipelines and exporter queues for incident triage?
health_check
pprof
zpages
file_storage
10. In the recommended two-tier topology, where does tail_sampling belong?
On the DaemonSet agent, close to each application pod
In the SDK, before spans ever leave the application
On the centralized gateway Deployment, where it can buffer whole traces