Study Guide: Chapter 5 — OpenTelemetry Architecture: API, SDK, and Collector

Pre-Reading Quiz — Half 1

1. A third-party library is instrumented with OpenTelemetry. Which dependency should it import to keep its footprint minimal and avoid forcing a vendor or backend on its users?

The OpenTelemetry SDK (with default exporter)

The OpenTelemetry API only

The OpenTelemetry Collector binary

A vendor-specific tracing client

2. An application has fully instrumented code but does not initialize an SDK at startup. What happens when the library calls tracer.spanBuilder("doWork").startSpan()?

An exception is thrown because no exporter is registered

A span is created and queued in memory indefinitely

A no-op span is returned and no data is recorded

The span is sent to a default OTLP endpoint on localhost

3. Which SDK component decides whether a span is recorded?

Exporter

Processor

Sampler

Resource

4. A developer sets OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317 and OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf. What problem is this likely to cause?

No problem — OTLP transports auto-negotiate

Port 4317 is the gRPC port, not the HTTP port; transport errors will result

The protocol value is invalid; only grpc is supported

HTTPS is required for OTLP

5. Why are semantic conventions important across languages?

They speed up protobuf encoding

They are required by the Collector's batch processor

They standardize attribute names so any backend can interpret data from any language

They allow gRPC to traverse HTTP/1.1 proxies

1. API vs SDK vs Collector

OpenTelemetry's most important architectural choice is splitting instrumentation surface (API) from pipeline implementation (SDK), and separating both from the out-of-process telemetry agent (Collector). Each layer has a distinct audience, release cadence, and dependency footprint.

1.1 The API — a stable interface for instrumentation

The API defines interfaces like Tracer, Meter, and Logger (plus their providers). Crucially, it does not know about exporters, samplers, batching, OTLP, or any backend. Libraries depend only on the API. If no SDK is registered, the API returns no-op implementations — spans are created but never recorded; the cost is a few function calls.

1.2 The SDK — a configurable pipeline implementation

The SDK is what application developers wire up at startup. It owns four moving parts:

Samplers — decide whether to record (AlwaysOn, ParentBased, TraceIdRatioBased).
Processors — buffer, batch, enrich, or filter (BatchSpanProcessor).
Exporters — serialize and send (OTLP, Jaeger, Zipkin, Prometheus).
Resource — describes the producer (service.name, host.name).

1.3 Why the split matters

A library depending only on opentelemetry-api adds essentially zero weight and zero opinion about your backend — the same library can be used by an app that exports to Jaeger, to a SaaS vendor, or to nothing at all, without recompilation.

1.4 The Collector

A separate Go binary running outside your app. Its pipeline is fixed: receivers → processors → exporters. Moving logic out of the app enables decoupled deploys, large-scale batching, vendor portability, and centralized policy.

Figure 5.1: OpenTelemetry three-layer architecture and OTLP data flow

2. Cross-Language Architecture

OpenTelemetry's promise of consistency across 12+ languages rests on three pillars:

Specification — defines API/SDK semantics per signal.
Semantic conventions — standardized attribute names (service.name, http.response.status_code, k8s.pod.name).
OTLP — the wire protocol with three transport variants.

2.1 Stability per signal, per language

Each language SIG implements the spec at its own pace. Traces stabilized first; metrics followed; logs are the most recent.

2.2 Semantic conventions: the lingua franca

If every team picks its own attribute names — http.statusCode vs http_status vs httpResponse.code — "vendor-neutral" telemetry becomes useless. Semantic conventions give backends like Grafana, Datadog, Honeycomb, and Tempo a consistent vocabulary, enabling out-of-the-box visualizations.

2.3 OTLP transport variants

Variant	Port	Encoding	When to use
OTLP/gRPC	4317	protobuf over HTTP/2	Default in modern Kubernetes; lowest overhead
OTLP/HTTP/protobuf	4318	protobuf over HTTP/1.1 or 2	Through proxies/LBs that don't speak gRPC well
OTLP/HTTP/JSON	4318	JSON over HTTP	Browsers, curl-debugging, low volume

A common mistake is mismatching protocol and port (e.g., http/protobuf with port 4317) — this yields "unexpected response" or "transport error" messages.

2.4 Partial success & retries

OTLP responses carry a partial_success field. Rejected items should NOT be retried — they are usually permanently bad. Retry only on UNAVAILABLE, DEADLINE_EXCEEDED, 5xx, or 429 with exponential backoff and jitter.

Figure 5.3: OTLP export request flow across transport variants

Post-Reading Quiz — Half 1

1. A third-party library is instrumented with OpenTelemetry. Which dependency should it import to keep its footprint minimal and avoid forcing a vendor or backend on its users?

The OpenTelemetry SDK (with default exporter)

The OpenTelemetry API only

The OpenTelemetry Collector binary

A vendor-specific tracing client

2. An application has fully instrumented code but does not initialize an SDK at startup. What happens when the library calls tracer.spanBuilder("doWork").startSpan()?

An exception is thrown because no exporter is registered

A span is created and queued in memory indefinitely

A no-op span is returned and no data is recorded

The span is sent to a default OTLP endpoint on localhost

3. Which SDK component decides whether a span is recorded?

Exporter

Processor

Sampler

Resource

4. A developer sets OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317 and OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf. What problem is this likely to cause?

No problem — OTLP transports auto-negotiate

Port 4317 is the gRPC port, not the HTTP port; transport errors will result

The protocol value is invalid; only grpc is supported

HTTPS is required for OTLP

5. Why are semantic conventions important across languages?

They speed up protobuf encoding

They are required by the Collector's batch processor

They standardize attribute names so any backend can interpret data from any language

They allow gRPC to traverse HTTP/1.1 proxies

Half 2 — Sections 3 & 4

Pre-Reading Quiz — Half 2

1. Which deployment topology is best suited for adding k8s.pod.name and tailing container logs from /var/log/containers/*.log?

Centralized gateway Deployment

DaemonSet agent (one Collector per node)

A single Collector pod managed by an HPA

Collector running directly on the control-plane node

2. Why does tail-based sampling typically require a gateway rather than a node-local agent?

Agents can't run the tail_sampling processor binary

Tail sampling needs visibility into the full trace, but a node-local agent only sees spans for its own node

Gateways have faster CPUs than agents

The tail-sampling processor only ships in vendor distributions

3. In a hybrid topology with multiple gateway replicas running tail sampling, what is required to make tail-sampling decisions correct?

All gateway pods must share a Redis cache

Trace-ID-aware load balancing so all spans of a trace hit the same gateway pod

A sidecar Collector in every application pod

Disabling the batch processor in the gateway

4. Which Collector distribution is the practical default for most production Kubernetes pipelines that need k8sattributes, tail_sampling, and filelog?

otelcol (core)

otelcol-contrib

A custom ocb build

Grafana Alloy only

5. A security team requires a minimal Collector binary with only the components actually used in production, and wants a Go module inventory for compliance. Which approach fits best?

Strip otelcol-contrib at runtime with environment variables

Use the OpenTelemetry Collector Builder (ocb) with a manifest listing exactly the needed components

Deploy the vendor agent and disable unused processors

Use otelcol core and write the missing components inline in YAML

3. Collector Deployment Topologies

The same Collector binary, with different configuration, can be deployed as a sidecar, a DaemonSet per node, or a centralized fleet of pods. Most production Kubernetes environments combine more than one of these patterns.

3.1 Agent mode (sidecar or DaemonSet)

Sidecar: one Collector per app pod. Useful for per-service pipelines or strict tenant isolation. Expensive at scale.
DaemonSet: one Collector per node. Apps on that node send telemetry locally.

Agents excel at: node-local enrichment (k8s.pod.name), host-level signals (kubelet, cAdvisor, container logs), minimal latency, isolated failure domain.

3.2 Gateway mode (centralized Deployment)

Gateways excel at:

Tail-based sampling — needs the full trace, which only a centralized view gives.
Efficient batching — aggregating from many sources lets the gateway build large, compressible batches.
Centralized auth — API keys, OAuth tokens, mTLS certs live in one place.
Multi-tenant routing — per-team backends, per-tenant quotas.

For correct tail sampling across replicas you need trace-ID-aware load balancing so all spans of a trace land on the same pod (the loadbalancing exporter or L7 consistent hashing).

3.3 Trade-off summary

Concern	Agent	Gateway
Batching efficiency	Smaller batches per node	Large aggregated batches
Tail sampling	Limited local view → broken	Global view → correct
Auth to backends	Secrets on every node	Secrets centralized
Reliability	No central SPOF; per-node blast radius	Choke point; needs replicas + HPA
Multi-tenancy	Hard to enforce	Natural enforcement point

3.4 Hybrid — the recommended pattern

DaemonSet agent does cheap node-local work (enrichment, host signals, log tailing) and forwards OTLP to a Deployment gateway, which handles tail sampling, tenant routing, and backend auth. If the gateway is overloaded, agents buffer locally until pressure subsides.

3.5 The OpenTelemetry Operator

On Kubernetes, the Operator provides OpenTelemetryCollector CRDs with mode: daemonset, sidecar, or deployment, plus an Instrumentation CR that auto-injects SDKs for Java, Python, Node, .NET, and Go.

Figure 5.2: Agent, gateway, and hybrid Collector topologies

flowchart TD subgraph AgentMode["Agent topology (DaemonSet)"] direction TB A_App1[App Pod
Node 1] A_App2[App Pod
Node 2] A_Ag1[Collector Agent
Node 1] A_Ag2[Collector Agent
Node 2] A_BE[(Backend)] A_App1 -->|"OTLP localhost"| A_Ag1 A_App2 -->|"OTLP localhost"| A_Ag2 A_Ag1 --> A_BE A_Ag2 --> A_BE end subgraph GatewayMode["Gateway topology (Deployment)"] direction TB G_App1[App Pod
Node 1] G_App2[App Pod
Node 2] G_GW[Gateway Collectors
centralized Deployment + Service] G_BE[(Backend)] G_App1 -->|"OTLP cluster Service"| G_GW G_App2 -->|"OTLP cluster Service"| G_GW G_GW --> G_BE end subgraph HybridMode["Hybrid topology (recommended)"] direction TB H_App1[App Pod
Node 1] H_App2[App Pod
Node 2] H_Ag1[Agent
Node 1] H_Ag2[Agent
Node 2] H_GW[Gateway Collectors
tail sampling + tenant routing + auth] H_BE[(Backend)] H_App1 -->|"OTLP localhost"| H_Ag1 H_App2 -->|"OTLP localhost"| H_Ag2 H_Ag1 -->|"OTLP cross-node"| H_GW H_Ag2 -->|"OTLP cross-node"| H_GW H_GW --> H_BE end

4. Distributions and Builds

The Collector ships in multiple distributions — pre-built bundles of receivers, processors, exporters, and extensions.

4.1 `otelcol` vs `otelcol-contrib`

otelcol (core): minimal, stable, includes OTLP, Prometheus, batch, memory_limiter, attributes, debug.
otelcol-contrib: kitchen-sink — vendor exporters (Datadog, Splunk, AWS, GCP), specialized receivers (kubeletstats, filelog, hostmetrics), advanced processors (tail_sampling, transform, routing, k8sattributes).

Most real-world pipelines need at least one contrib component — otelcol-contrib is the practical default.

4.2 Custom builds with `ocb`

The OpenTelemetry Collector Builder lets you assemble exactly the components you need from a manifest. Motivations:

Supply-chain hygiene — smaller binary, smaller attack surface, faster startup.
Compliance — explicit Go-module inventory.
Performance — fewer registered components.

dist:
  name: my-otelcol
  otelcol_version: 0.95.0
receivers:
  - gomod: go.opentelemetry.io/collector/receiver/otlpreceiver v0.95.0
processors:
  - gomod: github.com/.../k8sattributesprocessor v0.95.0
  - gomod: github.com/.../tailsamplingprocessor v0.95.0
exporters:
  - gomod: go.opentelemetry.io/collector/exporter/otlpexporter v0.95.0

Then ocb --config manifest.yaml emits a single Go binary with exactly those components — nothing more.

4.3 Vendor distributions

AWS ADOT, Splunk OTel Collector, Datadog Agent (OTel mode), and Grafana Agent / Alloy bundle upstream Collector with vendor-tuned defaults. They still respect API/SDK/OTLP boundaries — applications continue emitting standard OTLP, only the Collector is vendor-flavored. Switching vendors is largely a Collector configuration change.

4.4 Choosing a distribution

If you need…	Start with
OTLP-only, simple pipeline	`otelcol` (core)
Most real-world Kubernetes pipelines	`otelcol-contrib`
First-party vendor support contract	Vendor distribution
Minimal binary, supply-chain control	Custom `ocb` build

Figure 5.4: Collector pipeline anatomy

Post-Reading Quiz — Half 2

1. Which deployment topology is best suited for adding k8s.pod.name and tailing container logs from /var/log/containers/*.log?

Centralized gateway Deployment

DaemonSet agent (one Collector per node)

A single Collector pod managed by an HPA

Collector running directly on the control-plane node

2. Why does tail-based sampling typically require a gateway rather than a node-local agent?

Agents can't run the tail_sampling processor binary

Tail sampling needs visibility into the full trace, but a node-local agent only sees spans for its own node

Gateways have faster CPUs than agents

The tail-sampling processor only ships in vendor distributions

3. In a hybrid topology with multiple gateway replicas running tail sampling, what is required to make tail-sampling decisions correct?

All gateway pods must share a Redis cache

Trace-ID-aware load balancing so all spans of a trace hit the same gateway pod

A sidecar Collector in every application pod

Disabling the batch processor in the gateway

4. Which Collector distribution is the practical default for most production Kubernetes pipelines that need k8sattributes, tail_sampling, and filelog?

otelcol (core)

otelcol-contrib

A custom ocb build

Grafana Alloy only

5. A security team requires a minimal Collector binary with only the components actually used in production, and wants a Go module inventory for compliance. Which approach fits best?

Strip otelcol-contrib at runtime with environment variables

Use the OpenTelemetry Collector Builder (ocb) with a manifest listing exactly the needed components

Deploy the vendor agent and disable unused processors

Use otelcol core and write the missing components inline in YAML

Chapter 5 — OpenTelemetry Architecture: API, SDK, and Collector

Learning Objectives

Half 1 — Sections 1 & 2

1. API vs SDK vs Collector

1.1 The API — a stable interface for instrumentation

1.2 The SDK — a configurable pipeline implementation

1.3 Why the split matters

1.4 The Collector

Figure 5.1: OpenTelemetry three-layer architecture and OTLP data flow

Animation: API → SDK → Collector → Backend

Key Points — Section 1

2. Cross-Language Architecture

2.1 Stability per signal, per language

2.2 Semantic conventions: the lingua franca

2.3 OTLP transport variants

2.4 Partial success & retries

Figure 5.3: OTLP export request flow across transport variants

Animation: OTLP/gRPC vs OTLP/HTTP/JSON race

Key Points — Section 2

Half 2 — Sections 3 & 4

3. Collector Deployment Topologies

3.1 Agent mode (sidecar or DaemonSet)

3.2 Gateway mode (centralized Deployment)

3.3 Trade-off summary

3.4 Hybrid — the recommended pattern

3.5 The OpenTelemetry Operator

Figure 5.2: Agent, gateway, and hybrid Collector topologies

Animation: Agent (DaemonSet) vs Gateway (Centralized)

Key Points — Section 3

4. Distributions and Builds

4.1 `otelcol` vs `otelcol-contrib`

4.2 Custom builds with `ocb`

4.3 Vendor distributions

4.4 Choosing a distribution

Figure 5.4: Collector pipeline anatomy

Key Points — Section 4

Your Progress

Answer Explanations

Chapter 5 — OpenTelemetry Architecture: API, SDK, and Collector

Learning Objectives

Half 1 — Sections 1 & 2

1. API vs SDK vs Collector

1.1 The API — a stable interface for instrumentation

1.2 The SDK — a configurable pipeline implementation

1.3 Why the split matters

1.4 The Collector

Figure 5.1: OpenTelemetry three-layer architecture and OTLP data flow

Animation: API → SDK → Collector → Backend

Key Points — Section 1

2. Cross-Language Architecture

2.1 Stability per signal, per language

2.2 Semantic conventions: the lingua franca

2.3 OTLP transport variants

2.4 Partial success & retries

Figure 5.3: OTLP export request flow across transport variants

Animation: OTLP/gRPC vs OTLP/HTTP/JSON race

Key Points — Section 2

Half 2 — Sections 3 & 4

3. Collector Deployment Topologies

3.1 Agent mode (sidecar or DaemonSet)

3.2 Gateway mode (centralized Deployment)

3.3 Trade-off summary

3.4 Hybrid — the recommended pattern

3.5 The OpenTelemetry Operator

Figure 5.2: Agent, gateway, and hybrid Collector topologies

Animation: Agent (DaemonSet) vs Gateway (Centralized)

Key Points — Section 3

4. Distributions and Builds

4.1 otelcol vs otelcol-contrib

4.2 Custom builds with ocb

4.3 Vendor distributions

4.4 Choosing a distribution

Figure 5.4: Collector pipeline anatomy

Key Points — Section 4

Your Progress

Answer Explanations

4.1 `otelcol` vs `otelcol-contrib`

4.2 Custom builds with `ocb`