Chapter 5 — OpenTelemetry Architecture: API, SDK, and Collector
Learning Objectives
Differentiate the OpenTelemetry API, SDK, and Collector and explain why the API/SDK split matters.
Trace data flow from instrumented application to backend through SDK exporters and the Collector.
Select an appropriate deployment topology (agent, gateway, or both) for a given workload.
Compare OTLP transport variants (gRPC, HTTP/protobuf, HTTP/JSON) and pick one for given infrastructure constraints.
Distinguish distributions (otelcol, otelcol-contrib, vendor, custom ocb) and choose appropriately.
Half 1 — Sections 1 & 2
Pre-Reading Quiz — Half 1
1. A third-party library is instrumented with OpenTelemetry. Which dependency should it import to keep its footprint minimal and avoid forcing a vendor or backend on its users?
The OpenTelemetry SDK (with default exporter)
The OpenTelemetry API only
The OpenTelemetry Collector binary
A vendor-specific tracing client
2. An application has fully instrumented code but does not initialize an SDK at startup. What happens when the library calls tracer.spanBuilder("doWork").startSpan()?
An exception is thrown because no exporter is registered
A span is created and queued in memory indefinitely
A no-op span is returned and no data is recorded
The span is sent to a default OTLP endpoint on localhost
3. Which SDK component decides whether a span is recorded?
Exporter
Processor
Sampler
Resource
4. A developer sets OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317 and OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf. What problem is this likely to cause?
No problem — OTLP transports auto-negotiate
Port 4317 is the gRPC port, not the HTTP port; transport errors will result
The protocol value is invalid; only grpc is supported
HTTPS is required for OTLP
5. Why are semantic conventions important across languages?
They speed up protobuf encoding
They are required by the Collector's batch processor
They standardize attribute names so any backend can interpret data from any language
They allow gRPC to traverse HTTP/1.1 proxies
1. API vs SDK vs Collector
OpenTelemetry's most important architectural choice is splitting instrumentation surface (API) from pipeline implementation (SDK), and separating both from the out-of-process telemetry agent (Collector). Each layer has a distinct audience, release cadence, and dependency footprint.
1.1 The API — a stable interface for instrumentation
The API defines interfaces like Tracer, Meter, and Logger (plus their providers). Crucially, it does not know about exporters, samplers, batching, OTLP, or any backend. Libraries depend only on the API. If no SDK is registered, the API returns no-op implementations — spans are created but never recorded; the cost is a few function calls.
1.2 The SDK — a configurable pipeline implementation
The SDK is what application developers wire up at startup. It owns four moving parts:
Samplers — decide whether to record (AlwaysOn, ParentBased, TraceIdRatioBased).
Processors — buffer, batch, enrich, or filter (BatchSpanProcessor).
Exporters — serialize and send (OTLP, Jaeger, Zipkin, Prometheus).
Resource — describes the producer (service.name, host.name).
1.3 Why the split matters
A library depending only on opentelemetry-api adds essentially zero weight and zero opinion about your backend — the same library can be used by an app that exports to Jaeger, to a SaaS vendor, or to nothing at all, without recompilation.
1.4 The Collector
A separate Go binary running outside your app. Its pipeline is fixed: receivers → processors → exporters. Moving logic out of the app enables decoupled deploys, large-scale batching, vendor portability, and centralized policy.
Figure 5.1: OpenTelemetry three-layer architecture and OTLP data flow
flowchart TD
subgraph App["Application Process"]
Lib[Library Code depends on API only]
Code[Application Code depends on API only]
API[OpenTelemetry API Tracer / Meter / Logger interfaces]
SDK[OpenTelemetry SDK samplers + processors + exporters + resource]
Lib --> API
Code --> API
API --> SDK
end
SDK -->|"OTLP/gRPC :4317 or OTLP/HTTP :4318"| Col
subgraph Col["OpenTelemetry Collector (out-of-process)"]
Recv[Receivers OTLP, Prometheus, filelog]
Proc[Processors batch, memory_limiter, k8sattributes, tail_sampling]
Exp[Exporters OTLP, vendor-specific]
Recv --> Proc --> Exp
end
Exp -->|"OTLP or vendor protocol"| BE[("Backend Prometheus / Tempo / Loki / Vendor SaaS")]
Animation: API → SDK → Collector → Backend
A telemetry packet originates in code, passes through API → SDK exporter → Collector → backend.
Key Points — Section 1
API = interfaces (no-op default). Libraries import only this.
OTLP — the wire protocol with three transport variants.
2.1 Stability per signal, per language
Each language SIG implements the spec at its own pace. Traces stabilized first; metrics followed; logs are the most recent.
2.2 Semantic conventions: the lingua franca
If every team picks its own attribute names — http.statusCode vs http_status vs httpResponse.code — "vendor-neutral" telemetry becomes useless. Semantic conventions give backends like Grafana, Datadog, Honeycomb, and Tempo a consistent vocabulary, enabling out-of-the-box visualizations.
2.3 OTLP transport variants
Variant
Port
Encoding
When to use
OTLP/gRPC
4317
protobuf over HTTP/2
Default in modern Kubernetes; lowest overhead
OTLP/HTTP/protobuf
4318
protobuf over HTTP/1.1 or 2
Through proxies/LBs that don't speak gRPC well
OTLP/HTTP/JSON
4318
JSON over HTTP
Browsers, curl-debugging, low volume
A common mistake is mismatching protocol and port (e.g., http/protobuf with port 4317) — this yields "unexpected response" or "transport error" messages.
2.4 Partial success & retries
OTLP responses carry a partial_success field. Rejected items should NOT be retried — they are usually permanently bad. Retry only on UNAVAILABLE, DEADLINE_EXCEEDED, 5xx, or 429 with exponential backoff and jitter.
Figure 5.3: OTLP export request flow across transport variants
sequenceDiagram
participant SDK as SDK Exporter
participant Col as Collector OTLP Receiver
Note over SDK,Col: OTLP/gRPC on :4317
SDK->>Col: HTTP/2 frame: TraceService.Export(ExportTraceServiceRequest, protobuf)
Col-->>SDK: ExportTraceServiceResponse (may include partial_success)
Note over SDK,Col: OTLP/HTTP/protobuf on :4318
SDK->>Col: POST /v1/traces Content-Type application/x-protobuf
Col-->>SDK: 200 OK protobuf body (partial_success)
Note over SDK,Col: OTLP/HTTP/JSON on :4318
SDK->>Col: POST /v1/traces Content-Type application/json
Col-->>SDK: 200 OK JSON body (partial_success)
Note over SDK,Col: Retry only on UNAVAILABLE / 5xx / 429 with backoff
Animation: OTLP/gRPC vs OTLP/HTTP/JSON race
gRPC sends a compact binary frame; HTTP/JSON ships verbose text — same data, higher wire overhead.
Key Points — Section 2
Three pillars of cross-language consistency: spec, semantic conventions, OTLP.
Default to gRPC on 4317; switch to HTTP/protobuf on 4318 for proxies; HTTP/JSON for browsers/curl.
Never cross protocol with port: gRPC ⇔ 4317, HTTP ⇔ 4318.
Retry on transient transport errors only. partial_success rejections are permanent.
Post-Reading Quiz — Half 1
1. A third-party library is instrumented with OpenTelemetry. Which dependency should it import to keep its footprint minimal and avoid forcing a vendor or backend on its users?
The OpenTelemetry SDK (with default exporter)
The OpenTelemetry API only
The OpenTelemetry Collector binary
A vendor-specific tracing client
2. An application has fully instrumented code but does not initialize an SDK at startup. What happens when the library calls tracer.spanBuilder("doWork").startSpan()?
An exception is thrown because no exporter is registered
A span is created and queued in memory indefinitely
A no-op span is returned and no data is recorded
The span is sent to a default OTLP endpoint on localhost
3. Which SDK component decides whether a span is recorded?
Exporter
Processor
Sampler
Resource
4. A developer sets OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317 and OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf. What problem is this likely to cause?
No problem — OTLP transports auto-negotiate
Port 4317 is the gRPC port, not the HTTP port; transport errors will result
The protocol value is invalid; only grpc is supported
HTTPS is required for OTLP
5. Why are semantic conventions important across languages?
They speed up protobuf encoding
They are required by the Collector's batch processor
They standardize attribute names so any backend can interpret data from any language
They allow gRPC to traverse HTTP/1.1 proxies
Half 2 — Sections 3 & 4
Pre-Reading Quiz — Half 2
1. Which deployment topology is best suited for adding k8s.pod.name and tailing container logs from /var/log/containers/*.log?
Centralized gateway Deployment
DaemonSet agent (one Collector per node)
A single Collector pod managed by an HPA
Collector running directly on the control-plane node
2. Why does tail-based sampling typically require a gateway rather than a node-local agent?
Agents can't run the tail_sampling processor binary
Tail sampling needs visibility into the full trace, but a node-local agent only sees spans for its own node
Gateways have faster CPUs than agents
The tail-sampling processor only ships in vendor distributions
3. In a hybrid topology with multiple gateway replicas running tail sampling, what is required to make tail-sampling decisions correct?
All gateway pods must share a Redis cache
Trace-ID-aware load balancing so all spans of a trace hit the same gateway pod
A sidecar Collector in every application pod
Disabling the batch processor in the gateway
4. Which Collector distribution is the practical default for most production Kubernetes pipelines that need k8sattributes, tail_sampling, and filelog?
otelcol (core)
otelcol-contrib
A custom ocb build
Grafana Alloy only
5. A security team requires a minimal Collector binary with only the components actually used in production, and wants a Go module inventory for compliance. Which approach fits best?
Strip otelcol-contrib at runtime with environment variables
Use the OpenTelemetry Collector Builder (ocb) with a manifest listing exactly the needed components
Deploy the vendor agent and disable unused processors
Use otelcol core and write the missing components inline in YAML
3. Collector Deployment Topologies
The same Collector binary, with different configuration, can be deployed as a sidecar, a DaemonSet per node, or a centralized fleet of pods. Most production Kubernetes environments combine more than one of these patterns.
3.1 Agent mode (sidecar or DaemonSet)
Sidecar: one Collector per app pod. Useful for per-service pipelines or strict tenant isolation. Expensive at scale.
DaemonSet: one Collector per node. Apps on that node send telemetry locally.
For correct tail sampling across replicas you need trace-ID-aware load balancing so all spans of a trace land on the same pod (the loadbalancing exporter or L7 consistent hashing).
3.3 Trade-off summary
Concern
Agent
Gateway
Batching efficiency
Smaller batches per node
Large aggregated batches
Tail sampling
Limited local view → broken
Global view → correct
Auth to backends
Secrets on every node
Secrets centralized
Reliability
No central SPOF; per-node blast radius
Choke point; needs replicas + HPA
Multi-tenancy
Hard to enforce
Natural enforcement point
3.4 Hybrid — the recommended pattern
DaemonSet agent does cheap node-local work (enrichment, host signals, log tailing) and forwards OTLP to a Deployment gateway, which handles tail sampling, tenant routing, and backend auth. If the gateway is overloaded, agents buffer locally until pressure subsides.
3.5 The OpenTelemetry Operator
On Kubernetes, the Operator provides OpenTelemetryCollector CRDs with mode: daemonset, sidecar, or deployment, plus an Instrumentation CR that auto-injects SDKs for Java, Python, Node, .NET, and Go.
Figure 5.2: Agent, gateway, and hybrid Collector topologies
Then ocb --config manifest.yaml emits a single Go binary with exactly those components — nothing more.
4.3 Vendor distributions
AWS ADOT, Splunk OTel Collector, Datadog Agent (OTel mode), and Grafana Agent / Alloy bundle upstream Collector with vendor-tuned defaults. They still respect API/SDK/OTLP boundaries — applications continue emitting standard OTLP, only the Collector is vendor-flavored. Switching vendors is largely a Collector configuration change.
otelcol-contrib is the practical default; otelcol core is the OTLP-only minimal build.
Use ocb when you need supply-chain control or a minimal binary.
Vendor distributions are upstream Collector + vendor defaults + support.
Apps always emit standard OTLP — distribution choice is a Collector concern.
Post-Reading Quiz — Half 2
1. Which deployment topology is best suited for adding k8s.pod.name and tailing container logs from /var/log/containers/*.log?
Centralized gateway Deployment
DaemonSet agent (one Collector per node)
A single Collector pod managed by an HPA
Collector running directly on the control-plane node
2. Why does tail-based sampling typically require a gateway rather than a node-local agent?
Agents can't run the tail_sampling processor binary
Tail sampling needs visibility into the full trace, but a node-local agent only sees spans for its own node
Gateways have faster CPUs than agents
The tail-sampling processor only ships in vendor distributions
3. In a hybrid topology with multiple gateway replicas running tail sampling, what is required to make tail-sampling decisions correct?
All gateway pods must share a Redis cache
Trace-ID-aware load balancing so all spans of a trace hit the same gateway pod
A sidecar Collector in every application pod
Disabling the batch processor in the gateway
4. Which Collector distribution is the practical default for most production Kubernetes pipelines that need k8sattributes, tail_sampling, and filelog?
otelcol (core)
otelcol-contrib
A custom ocb build
Grafana Alloy only
5. A security team requires a minimal Collector binary with only the components actually used in production, and wants a Go module inventory for compliance. Which approach fits best?
Strip otelcol-contrib at runtime with environment variables
Use the OpenTelemetry Collector Builder (ocb) with a manifest listing exactly the needed components
Deploy the vendor agent and disable unused processors
Use otelcol core and write the missing components inline in YAML