Study Guide: Chapter 10 — The OpenTelemetry Collector in Depth

Pre-Study Assessment

1. Which Collector component is the contract that actually wires receivers, processors, and exporters into runnable pipelines?

The receivers top-level block

The service.pipelines block

The extensions block

The connectors block

2. Which processor should always be listed first in a Collector pipeline?

batch

transform

memory_limiter

k8sattributes

3. What is a connector in the OpenTelemetry Collector?

An extension that exposes Prometheus metrics on port 8888

A processor that drops spans matching a predicate

A hybrid component that acts as an exporter on one pipeline and a receiver on another

A YAML anchor used to share config across pipelines

4. Why is the batch processor recommended in essentially every production pipeline?

It enriches spans with Kubernetes pod metadata

It groups telemetry into larger payloads, dramatically reducing per-call overhead at the exporter

It applies tail-sampling policies based on latency and error status

It encrypts data in flight to the backend

5. For tail_sampling policies to actually have spans to evaluate, the SDK must export traces using which sampler?

An aggressive traceidratiobased(0.01) head sampler

always_off — let the Collector decide everything

always_on (or parentbased(always_on)) so unsampled spans reach the Collector

No sampler is needed; tail sampling reconstructs dropped spans

6. Which receiver is most commonly used to ingest container logs from /var/log/pods on a Kubernetes node?

otlp

prometheus

hostmetrics

filelog

7. Which exporter is the standard choice for fanning metrics out into the Prometheus / Mimir / Cortex / Thanos ecosystem?

loki

otlp

prometheusremotewrite

debug

8. What is the role of the file_storage extension on a gateway Collector?

It stores Collector binary releases for rolling upgrades

It backs the exporter sending_queue with disk so accepted telemetry survives restarts

It writes a debug log of every span to a flat file

It mounts a ConfigMap into the Collector pod

9. Which extension exposes a live, in-process view of pipelines and exporter queues for incident triage?

health_check

pprof

zpages

file_storage

10. In the recommended two-tier topology, where does tail_sampling belong?

On the DaemonSet agent, close to each application pod

In the SDK, before spans ever leave the application

On the centralized gateway Deployment, where it can buffer whole traces

On the backend (Tempo), not in the Collector

Section 1: Pipeline Architecture

The Collector is not a single black box — it is a configurable pipeline engine built from four kinds of components, plus optional extensions. Every piece of telemetry flowing through takes the same conceptual journey: it enters through a receiver, traverses a chain of processors, and leaves through one or more exporters. Pipelines are declared per signal type (traces, metrics, logs), and the service block is what actually wires components together into runnable pipelines.

The four component types (plus extensions)

Component	Role	Examples
Receiver	Accepts data in (push) or pulls from a source	`otlp`, `prometheus`, `hostmetrics`, `filelog`, `kafka`
Processor	Mutates, filters, batches, samples, enriches in flight	`memory_limiter`, `batch`, `transform`, `tail_sampling`
Exporter	Sends data to one or more backends	`otlp`, `prometheusremotewrite`, `loki`, `debug`
Connector	Joins two pipelines — exporter on one side, receiver on the other	`spanmetrics`, `routing`, `forward`
Extension	Non-pipeline capabilities (health, profiling, debugging)	`health_check`, `pprof`, `zpages`, `file_storage`

Connectors are the cleanest way to derive one signal from another — for example, generating RED metrics (Rate, Errors, Duration) from spans via a spanmetrics connector that exits a traces pipeline and re-enters a separate metrics pipeline.

Mermaid: Canonical pipeline anatomy (Figure 10.0)

flowchart LR R1[OTLP receiver] --> P1[memory_limiter] R2[Prometheus receiver] --> P1 P1 --> P2[k8sattributes / resource] P2 --> P3[filter / tail_sampling] P3 --> P4[transform OTTL] P4 --> P5[batch] P5 --> E1[OTLP exporter] P5 --> E2[prometheusremotewrite] P5 --> E3[debug]

A minimal three-signal config

service:
  extensions: [health_check, pprof, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

A component defined in receivers, processors, or exporters but not referenced under service.pipelines is silently ignored. This is the most common source of "my config does nothing" surprises.

Processor order is decisive

Processors run in the order listed. This is one of the most consequential, and most commonly overlooked, properties of Collector configuration:

memory_limiter always first — back-pressure kicks in before later, more expensive processors waste CPU
Enrichment processors next (e.g., k8sattributes, resource) so downstream filters see full context
Filter / sampling next — drop unwanted data before transforms touch it
transform / scrubbing — reshape what is left
batch last — coalesce into large outbound batches just before the exporter

Section 2: Key Processors

`memory_limiter` + `batch` — the mandatory pair

memory_limiter samples Collector memory on a fixed interval and, when usage crosses configured thresholds, refuses new data by returning errors to receivers. That refusal is what creates back-pressure: upstream senders see failures, retry, and slow down — instead of the Collector dying from an out-of-memory kill.

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 800       # ~80% of container memory limit
    spike_limit_mib: 200 # tolerance for short bursts
  batch:
    timeout: 5s
    send_batch_size: 512
    send_batch_max_size: 4096

Analogy: batch is a hotel shuttle that waits up to five minutes (or until full) before driving to the airport — far more efficient than calling a taxi for every guest. memory_limiter is the bouncer at the lobby door who turns guests away when the lobby is full, so the building never collapses.

`transform` with OTTL

For richer mutations — conditional logic, regex substitution, cross-field arithmetic — reach for the transform processor, which uses the OpenTelemetry Transformation Language (OTTL). OTTL statements look like set(target, value) where <boolean> and run inside a context (span, metric, datapoint, log, resource, or scope).

processors:
  transform:
    error_mode: ignore
    trace_statements:
      - context: span
        statements:
          # Collapse user IDs in URL paths so cardinality stays bounded
          - replace_pattern(attributes["http.target"], "/users/[0-9]+", "/users/:id") where attributes["http.target"] != nil
          # Remove PII before exporting
          - delete_key(attributes, "user.email")
          - delete_key(attributes, "user.id")
          # Whitelist what is allowed to leave
          - keep_keys(attributes, ["http.method", "http.target", "http.status_code", "service.name"])
          # Mark anything from the checkout service
          - set(attributes["env"], "prod") where resource.attributes["service.name"] == "checkout-service"

error_mode: ignore matters: the default in some versions is propagate, which can fail an entire batch when a single statement errors. A close cousin, filter, uses OTTL conditions to drop data outright (e.g., dropping /healthz spans).

`tail_sampling` vs `probabilistic_sampler`

The probabilistic_sampler is cheap and stateless — it picks (say) 5% based on trace ID, but it cannot prefer error or slow traces. The tail_sampling processor is fundamentally different: it buffers all spans for a trace, keyed by trace ID, and decides keep/drop only after decision_wait seconds or all spans have arrived. Because it sees the whole trace, it can sample on end-to-end latency, final status, or attributes that appear only on a leaf span.

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    expected_new_traces_per_sec: 2000
    policies:
      - name: main
        type: composite
        composite:
          max_total_spans_per_second: 1000
          policy_order: [error-traces, slow-traces, premium-tenants, baseline]
          sub_policies:
            error-traces:    { type: status_code, status_code: { status_codes: [ERROR] } }
            slow-traces:     { type: latency,      latency:      { threshold_ms: 4000 } }
            premium-tenants: { type: string_attribute, string_attribute: { key: tenant.tier, values: ["gold","platinum"] } }
            baseline:        { type: probabilistic, probabilistic: { sampling_percentage: 1.0 } }

Crucial gotcha: tail sampling only works if SDKs export spans unsampled (always_on or parentbased(always_on)). If the SDK already dropped spans, no policy can resurrect them.

Mermaid: Tail sampling decision flow (Figure 10.2)

sequenceDiagram participant SDK as SDK always_on participant Col as Collector tail_sampling participant Buf as Trace buffer participant Pol as Policy evaluator participant BE as Backend Tempo SDK->>Col: Span A trace T1 root Col->>Buf: Buffer T1 spans SDK->>Col: Span B trace T1 child Col->>Buf: Buffer T1 spans SDK->>Col: Span C trace T1 error Col->>Buf: Buffer T1 spans Note over Col,Buf: Wait decision_wait 10s Col->>Pol: Evaluate composite policies Pol->>Pol: error-traces matches? YES Pol-->>Col: KEEP trace T1 Col->>BE: Export all T1 spans

Head vs tail sampling at a glance

Aspect	Head / parent-based sampler	`tail_sampling` processor
Decision point	Root span start	After `decision_wait` in Collector
Sees full trace	No	Yes
Can prefer errors / slow traces	No	Yes
SDK overhead	Low (drops at source)	High (must export everything)
Collector memory & CPU	Minimal	Substantial (buffers spans)

`k8sattributes` — Kubernetes enrichment

The k8sattributes processor watches the Kubernetes API and decorates telemetry with metadata about the sending pod (namespace, deployment, node, labels). It identifies the sender either by inbound connection IP or by an explicit k8s.pod.ip resource attribute. Run it on the agent (DaemonSet), never on a central gateway — the gateway only sees the agent's IP, not the application pod's. Limit extract.metadata to fields you actually query on; each one multiplies cardinality and API-server load.

Section 3: Key Receivers and Exporters

Workhorse receivers

Receiver	What it ingests	Typical use
`otlp`	OTLP gRPC and OTLP HTTP (traces, metrics, logs)	Default for SDK and Collector-to-Collector traffic
`prometheus`	Scrapes Prometheus `/metrics` endpoints	Migration from Prometheus; scraping exporters
`hostmetrics`	OS-level CPU, memory, disk, network, filesystem, process	Node-agent monitoring
`filelog`	Tails log files with multiline and parser support	Container logs from `/var/log/pods` on the node
`kafka`	OTLP-encoded data from Kafka topics	Decoupling ingest from processing
`jaeger`, `zipkin`	Legacy span formats	Brownfield environments mid-migration

A common DaemonSet receiver block:

receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }
  hostmetrics:
    collection_interval: 30s
    scrapers: { cpu: {}, memory: {}, disk: {}, filesystem: {}, network: {}, load: {} }
  filelog:
    include: ["/var/log/pods/*/*/*.log"]
    start_at: end
    operators:
      - type: container

The prometheus receiver is worth a special mention: it accepts native Prometheus scrape config, so an existing prometheus.yml can be lifted into the Collector almost verbatim — a powerful migration path.

Workhorse exporters

Exporter	Destination	Notes
`otlp`	Any OTLP-compatible backend (Tempo, Jaeger, vendors)	Default; gRPC and HTTP
`prometheusremotewrite`	Prometheus, Mimir, Cortex, Thanos	Metrics fan-out into Prometheus ecosystem
`loki`	Grafana Loki	Logs only; attribute-to-label mapping configurable
`debug`	Collector stdout	Replaces the older `logging` exporter
`kafka`	Kafka topic, OTLP-encoded	Pairs with the `kafka` receiver
`file`	Local file (JSON)	Disaster-recovery sink, offline replay

exporters:
  otlp/tempo:
    endpoint: tempo-distributor.observability:4317
    tls: { insecure: true }
    sending_queue: { enabled: true, num_consumers: 10, queue_size: 2000 }
    retry_on_failure: { enabled: true, initial_interval: 5s, max_interval: 60s, max_elapsed_time: 0 }
  prometheusremotewrite:
    endpoint: http://mimir.observability:8080/api/v1/push
    resource_to_telemetry_conversion: { enabled: true }
  loki:
    endpoint: http://loki-gateway/loki/api/v1/push
  debug:
    verbosity: basic

resource_to_telemetry_conversion: true on prometheusremotewrite promotes OTLP resource attributes (like service.name, k8s.pod.name) into Prometheus labels so they become queryable in PromQL.

Connectors — the inter-pipeline glue

A connector behaves as an exporter on one pipeline and a receiver on another. The canonical example is spanmetrics: it consumes spans and emits aggregated RED metrics — derived signals produced once, near the source, rather than re-derived in each backend.

connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s]
    dimensions:
      - name: http.method
      - name: http.status_code

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [spanmetrics, otlp/tempo]      # spanmetrics is an EXPORTER here
    metrics/spans:
      receivers: [spanmetrics]                  # ... and a RECEIVER here
      processors: [batch]
      exporters: [prometheusremotewrite]

Other useful connectors include routing (split traffic by attribute) and forward (chain pipelines together).

Mermaid: Multi-signal pipeline topology (Figure 10.1)

graph TD subgraph Extensions HC[health_check] PP[pprof] ZP[zpages] end subgraph traces_pipeline[traces pipeline] TR[otlp receiver] --> TML[memory_limiter] TML --> TK8S[k8sattributes] TK8S --> TB[batch] TB --> TE[otlp/tempo] end subgraph metrics_pipeline[metrics pipeline] MR[otlp receiver] --> MML[memory_limiter] MML --> MK8S[k8sattributes] MK8S --> MB[batch] MB --> ME[prometheusremotewrite] end subgraph logs_pipeline[logs pipeline] LR[otlp receiver] --> LML[memory_limiter] LML --> LK8S[k8sattributes] LK8S --> LB[batch] LB --> LE[loki] end

Section 4: Reliability and Operations

Persistent queue and retry on failure

Most exporters support a sending_queue and retry_on_failure. By default the sending queue is in memory: fast, but lost on restart. Pairing it with the file_storage extension makes the queue durable, so an OOM kill or rolling deployment doesn't drop telemetry that has already been accepted from upstream.

extensions:
  file_storage:
    directory: /var/lib/otelcol/storage
    timeout: 1s

exporters:
  otlp/tempo:
    endpoint: tempo-distributor.observability:4317
    sending_queue:
      enabled: true
      storage: file_storage   # makes the queue durable across restarts
      num_consumers: 10
      queue_size: 2000
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 60s
      max_elapsed_time: 0     # retry forever

Sizing tips:

queue_size: 1,000-5,000 batches per exporter (each batch holds thousands of records)
num_consumers: 5-10 typical; raise only if the backend can handle more parallelism
retry_on_failure: exponential backoff with initial_interval: 5s, max_interval: 60s; set max_elapsed_time: 0 for critical data and let memory_limiter drop oldest when the queue fills

Mermaid: memory_limiter, batch, queue, retry interplay (Figure 10.4)

flowchart LR IN[Incoming spans / metrics / logs] --> ML{memory_limiter
over threshold?} ML -- "yes: refuse" --> REJ[Return error
upstream backs off] ML -- "no: accept" --> BAT[batch] BAT --> SQ[(sending_queue
file_storage backed)] SQ --> EXP[Exporter consumer pool
num_consumers] EXP -- success --> BE[Backend] EXP -- failure --> RET{retry_on_failure
exponential backoff} RET -- retry --> SQ RET -- queue full --> DROP[Drop oldest
otelcol_exporter_send_failed]

Extensions for operability

Extension	Endpoint	Use
`health_check`	`:13133/`	Kubernetes liveness / readiness probes
`pprof`	`:1777/debug/pprof/`	CPU and heap profiling under load
`zpages`	`:55679/debug/`	Live in-process view: pipelines, exporter queues, recent spans
`file_storage`	filesystem	Backing store for persistent queues

extensions:
  health_check: { endpoint: 0.0.0.0:13133 }
  pprof:        { endpoint: 0.0.0.0:1777 }
  zpages:       { endpoint: 0.0.0.0:55679 }

# pod spec:
livenessProbe:  { httpGet: { path: /, port: 13133 }, initialDelaySeconds: 10 }
readinessProbe: { httpGet: { path: /, port: 13133 }, periodSeconds: 5 }

zpages is especially useful during incidents — per-component counters and sampled recent traces come directly from the Collector's process, so you can answer "is data flowing? is anything dropping?" without leaving the cluster.

Mermaid: Two-tier agent + gateway topology (Figure 10.3)

flowchart LR P1[App pod] --> A1[Agent Collector
DaemonSet
memory_limiter
k8sattributes
light batch] P2[App pod] --> A1 P3[App pod] --> A2[Agent Collector] A1 --> GW[Gateway Collector
Deployment + HPA
tail_sampling
transform
persistent queue] A2 --> GW GW --> TEMPO[(Tempo)] GW --> MIMIR[(Mimir)] GW --> LOKI[(Loki)]

Sizing, throughput, and memory tuning

Role	CPU req	CPU limit	Mem req	Mem limit
Agent (DaemonSet)	100-250m	500-750m	256-512 Mi	512 Mi-1 Gi
Gateway (Deployment)	500m-1 vCPU	2-4 vCPU	1-2 Gi	2-4 Gi

The memory_limiter should target 70-80% of the container memory limit, with spike_limit_mib covering the largest plausible batch. Agents are not HPA-scaled (they scale with node count via DaemonSet); the gateway runs an HPA on CPU at 60-70% target utilization with minReplicas: 2 for graceful scaling and HA.

Tail-sampling capacity follows a simple rule of thumb: num_traces ≥ expected_new_traces_per_sec × decision_wait × 2. At 2,000 traces/sec and a 10s decision_wait, that is num_traces: 40000 minimum — round up to 50,000 for headroom.

Monitor the Collector with… itself: every Collector exposes its internal metrics on port 8888. Alert on:

Memory near limit_mib for sustained periods
Sustained growth of otelcol_exporter_queue_size
Non-zero otelcol_processor_refused_* or otelcol_exporter_send_failed_* counters
HPA at maxReplicas while queues continue to grow

Post-Study Assessment

1. Which Collector component is the contract that actually wires receivers, processors, and exporters into runnable pipelines?

The receivers top-level block

The service.pipelines block

The extensions block

The connectors block

2. Which processor should always be listed first in a Collector pipeline?

batch

transform

memory_limiter

k8sattributes

3. What is a connector in the OpenTelemetry Collector?

An extension that exposes Prometheus metrics on port 8888

A processor that drops spans matching a predicate

A hybrid component that acts as an exporter on one pipeline and a receiver on another

A YAML anchor used to share config across pipelines

4. Why is the batch processor recommended in essentially every production pipeline?

It enriches spans with Kubernetes pod metadata

It groups telemetry into larger payloads, dramatically reducing per-call overhead at the exporter

It applies tail-sampling policies based on latency and error status

It encrypts data in flight to the backend

5. For tail_sampling policies to actually have spans to evaluate, the SDK must export traces using which sampler?

An aggressive traceidratiobased(0.01) head sampler

always_off — let the Collector decide everything

always_on (or parentbased(always_on)) so unsampled spans reach the Collector

No sampler is needed; tail sampling reconstructs dropped spans

6. Which receiver is most commonly used to ingest container logs from /var/log/pods on a Kubernetes node?

otlp

prometheus

hostmetrics

filelog

7. Which exporter is the standard choice for fanning metrics out into the Prometheus / Mimir / Cortex / Thanos ecosystem?

loki

otlp

prometheusremotewrite

debug

8. What is the role of the file_storage extension on a gateway Collector?

It stores Collector binary releases for rolling upgrades

It backs the exporter sending_queue with disk so accepted telemetry survives restarts

It writes a debug log of every span to a flat file

It mounts a ConfigMap into the Collector pod

9. Which extension exposes a live, in-process view of pipelines and exporter queues for incident triage?

health_check

pprof

zpages

file_storage

10. In the recommended two-tier topology, where does tail_sampling belong?

On the DaemonSet agent, close to each application pod

In the SDK, before spans ever leave the application

On the centralized gateway Deployment, where it can buffer whole traces

On the backend (Tempo), not in the Collector

Chapter 10: The OpenTelemetry Collector in Depth

Learning Objectives

Section 1: Pipeline Architecture

The four component types (plus extensions)

Mermaid: Canonical pipeline anatomy (Figure 10.0)

A minimal three-signal config

Processor order is decisive

Section 1 Takeaway

Section 2: Key Processors

`memory_limiter` + `batch` — the mandatory pair

`transform` with OTTL

`tail_sampling` vs `probabilistic_sampler`

Mermaid: Tail sampling decision flow (Figure 10.2)

Head vs tail sampling at a glance

`k8sattributes` — Kubernetes enrichment

Section 2 Takeaway

Section 3: Key Receivers and Exporters

Workhorse receivers

Workhorse exporters

Connectors — the inter-pipeline glue

Mermaid: Multi-signal pipeline topology (Figure 10.1)

Section 3 Takeaway

Section 4: Reliability and Operations

Persistent queue and retry on failure

Mermaid: memory_limiter, batch, queue, retry interplay (Figure 10.4)

Extensions for operability

Mermaid: Two-tier agent + gateway topology (Figure 10.3)

Sizing, throughput, and memory tuning

Section 4 Takeaway

Your Progress

Answer Explanations

Chapter 10: The OpenTelemetry Collector in Depth

Learning Objectives

Section 1: Pipeline Architecture

The four component types (plus extensions)

Mermaid: Canonical pipeline anatomy (Figure 10.0)

A minimal three-signal config

Processor order is decisive

Section 1 Takeaway

Section 2: Key Processors

memory_limiter + batch — the mandatory pair

transform with OTTL

tail_sampling vs probabilistic_sampler

Mermaid: Tail sampling decision flow (Figure 10.2)

Head vs tail sampling at a glance

k8sattributes — Kubernetes enrichment

Section 2 Takeaway

Section 3: Key Receivers and Exporters

Workhorse receivers

Workhorse exporters

Connectors — the inter-pipeline glue

Mermaid: Multi-signal pipeline topology (Figure 10.1)

Section 3 Takeaway

Section 4: Reliability and Operations

Persistent queue and retry on failure

Mermaid: memory_limiter, batch, queue, retry interplay (Figure 10.4)

Extensions for operability

Mermaid: Two-tier agent + gateway topology (Figure 10.3)

Sizing, throughput, and memory tuning

Section 4 Takeaway

Your Progress

Answer Explanations

`memory_limiter` + `batch` — the mandatory pair

`transform` with OTTL

`tail_sampling` vs `probabilistic_sampler`

`k8sattributes` — Kubernetes enrichment