Chapter 6 — Instrumentation: Manual, Automatic, and Zero-Code
Learning Objectives
Choose between manual, automatic, and zero-code instrumentation for a given language and runtime.
Add custom spans, metrics, and attributes that follow OpenTelemetry semantic conventions.
Configure auto-instrumentation in Kubernetes using the OpenTelemetry Operator and its Instrumentation CRD.
Instrumentation is the act of teaching your code to talk about itself. OpenTelemetry recognizes three strategies for producing signals: manual (developers explicitly emit spans, metrics, and logs), automatic (libraries are patched at runtime), and zero-code (an external observer — eBPF in the kernel or a Kubernetes Operator injecting agents — generates telemetry with no application awareness). Think of them as a memoir, a transcriptionist, and a hidden ceiling microphone: each captures the story differently; each has a place.
graph TD
I[Application Telemetry]
I --> M[Manual developer writes tracer.start_span]
I --> A[Automatic runtime agent wraps libraries]
I --> Z[Zero-Code eBPF kernel probes or Operator injection]
M -->|effort: high| Q1[business attributes tenant.id, order.id]
A -->|effort: low| Q2[broad HTTP/DB/RPC library coverage]
Z -->|effort: none| Q3[polyglot, no rebuild kernel-wide visibility]
Part 1 Pre-Quiz — Sections 1 & 2
Answer first, then read. You'll re-answer the same questions after the reading to measure improvement.
Pre-Reading Check — Part 1
1. Which instrumentation approach is the only one that can reliably capture a tenant.id or payment.outcome attribute?
2. You need to count active WebSocket connections (which go up and down). Which OpenTelemetry instrument fits?
3. How does the Java OpenTelemetry agent inject instrumentation into your app?
4. A Python service is auto-instrumented but produces no spans. The most common root cause is…
5. Which environment variable convention is shared across Java, Python, and Node.js auto-instrumentation?
Section 1: Manual Instrumentation
Manual instrumentation puts the developer in direct control. You acquire a Tracer, Meter, or Logger from the SDK and explicitly start spans, record measurements, or write structured log events. It is the only way to express domain context: concepts like order_id, tenant_id, payment.status, and feature_flag.variant that no auto-instrumenter could ever guess.
Acquiring Tracers and Meters
The SDK exposes TracerProvider, MeterProvider, and LoggerProvider. From each you obtain a named, versioned instance scoped to your module — conventionally the package import path. The name becomes instrumentation.scope.name on every signal, letting backends filter by which code produced the data.
// Java
Tracer tracer = GlobalOpenTelemetry.getTracer("com.acme.payments", "1.4.0");
Meter meter = GlobalOpenTelemetry.getMeter("com.acme.payments");
# Python
tracer = trace.get_tracer("acme.payments", "1.4.0")
meter = metrics.get_meter("acme.payments")
// Node.js
const tracer = trace.getTracer('acme-payments', '1.4.0');
const meter = metrics.getMeter('acme-payments');
Creating Spans and Recording Attributes
A span is a named, timed operation with attributes, events, and a status. The idiomatic pattern wraps a unit of work so the span closes even on exceptions:
# Python: idiomatic context-manager span
with tracer.start_as_current_span("authorize_payment") as span:
span.set_attribute("payment.method", "card")
span.set_attribute("tenant.id", tenant_id)
try:
approved = gateway.authorize(amount)
span.set_attribute("payment.outcome",
"approved" if approved else "declined")
except Exception as exc:
span.record_exception(exc)
span.set_status(trace.StatusCode.ERROR, str(exc))
raise
Use dot-namespaced keys (payment.method, not paymentMethod); follow semantic conventions where one exists; treat your custom namespace (acme.*) like a public API — once a dashboard depends on it, you cannot freely rename it.
Picking the Right Metric Instrument
Instrument
Direction
Aggregation
Typical Use
Counter
Monotonic up
Sum
Total requests, errors, bytes sent
UpDownCounter
Up or down
Sum
Active connections, queue depth, pool size
Histogram
Observations
Bucketed distribution
Request latency, payload size
Gauge (observable)
Sampled
Last value
CPU utilization, current temperature
Three rules: declare units (s, By, 1); let the SDK pick histogram buckets unless you really know the latency profile; remember that every attribute multiplies cardinality — an idea we revisit in Section 4.
Key Points — Section 1
Manual is the only path to business-meaningful telemetry. No auto-instrumenter can invent tenant.id, order.id, or feature_flag.variant.
Acquire named, versioned tracers and meters once per module — the name becomes the instrumentation.scope.
Wrap units of work in context-managed spans so they close on exception; always set status and record exceptions.
Match the instrument to the semantics: Counter (monotonic), UpDownCounter (ebb & flow), Histogram (distribution), Gauge (current value).
Set units, choose attribute keys carefully, and recognize that each label dimension multiplies the time-series count.
Section 2: Automatic Instrumentation
Auto-instrumentation answers "How do I get traces from libraries I did not write?" The mechanism differs by runtime because each language exposes different hooks.
Bytecode Injection: Java and .NET
The Java agent attaches via the -javaagent flag, registers a premain with the Instrumentation API, and uses ByteBuddy to rewrite classes as the classloader loads them:
The agent ships with modules for Servlet, Spring MVC/WebFlux, JAX-RS, gRPC, OkHttp, JDBC, R2DBC, Hibernate, Mongo, Cassandra, Kafka, RabbitMQ, JMS, and more. Because rewriting happens at class load, no source change is needed — but the agent must attach at JVM start and exotic classloaders sometimes need extra config. .NET uses a conceptually similar mechanism via a CLR profiler activated by CORECLR_ENABLE_PROFILING=1.
● First OTLP span emitted: http.server.request | method=GET, route=/checkout, status=200
The original class bytes (blue) load; the agent's ClassFileTransformer weaves Span.start / Span.end bytecode (green) around the method body. The very first request emits a span — with zero source changes.
Monkey-Patching: Python and Node.js
Dynamic languages let you replace functions at runtime. Python's opentelemetry-instrument CLI bootstraps every installed opentelemetry-instrumentation-* package, which monkey-patches its target library at import:
The requests instrumentation replaces Session.request with a wrapper that opens a client span, records HTTP attributes, calls the original, captures the response, and ends the span. It must run before the first import of the patched library; otherwise the cached reference is the unpatched one. Node.js relies on a require hook via require-in-the-middle:
// tracing.js — must be required first
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
// run: NODE_OPTIONS="--require ./tracing.js" node app.js
Java Agent Lifecycle (Figure 6.2)
sequenceDiagram
participant OS as OS / shell
participant JVM as JVM
participant Agent as otel-javaagent.jar
participant CL as ClassLoader
participant App as Application code
participant Col as OTLP Collector
OS->>JVM: java -javaagent:otel.jar -jar app.jar
JVM->>Agent: invoke premain(Instrumentation)
Agent->>JVM: register ClassFileTransformer (ByteBuddy)
JVM->>App: start main()
App->>CL: load HttpServlet, JdbcDriver, ...
CL->>Agent: transform(class bytes)
Agent-->>CL: rewritten bytes with span hooks
App->>App: first request enters Servlet.service()
App->>Col: OTLP span exported (http.server.request)
Cross-Language Comparison
Aspect
Java
Python
Node.js
Primary mechanism
-javaagent bytecode rewrite
Monkey-patching at import
require hook + export patching
Entry point
JVM flag
opentelemetry-instrument CLI
NODE_OPTIONS=--require
Code changes
None
None
One bootstrap file
Context propagation
Thread-locals + executor wrappers
contextvars + async wrappers
Per-library async hooks
Common pitfall
Custom classloaders
Import order before patch
Bundlers/serverless hide require
A shared environment-variable contract spans every language: OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_TRACES_EXPORTER, OTEL_METRICS_EXPORTER, OTEL_LOGS_EXPORTER, OTEL_EXPORTER_OTLP_PROTOCOL, OTEL_TRACES_SAMPLER, OTEL_PROPAGATORS, OTEL_RESOURCE_ATTRIBUTES. One ConfigMap, one vocabulary, every workload.
Debugging axioms:No traces at all usually means an exporter is set to none, the protocol is mismatched (grpc vs. http/protobuf), or the bootstrap loaded too late. Duplicate spans almost always mean a library is being captured by both auto- and manual instrumentation — disable one.
Key Points — Section 2
Java/.NET rewrite bytecode/IL at class load — agent must attach at JVM/CLR start; cannot be retrofitted to a running process.
Python & Node.js monkey-patch on import / require; instrumentation must run before the target library is loaded.
All three runtimes honor the same OTEL_* environment variables — this is what makes mixed-language fleets manageable.
No traces? Check exporter, protocol mismatch, and bootstrap timing. Duplicate spans? You stacked auto + manual on the same library.
Auto buys broad coverage; it cannot infer business meaning. Combine it with manual on critical flows.
Part 1 Post-Quiz — Sections 1 & 2
Same questions, now with context. Don't peek — the explanations reveal at the end.
Post-Reading Check — Part 1
1. Which instrumentation approach is the only one that can reliably capture a tenant.id or payment.outcome attribute?
2. You need to count active WebSocket connections (which go up and down). Which OpenTelemetry instrument fits?
3. How does the Java OpenTelemetry agent inject instrumentation into your app?
4. A Python service is auto-instrumented but produces no spans. The most common root cause is…
5. Which environment variable convention is shared across Java, Python, and Node.js auto-instrumentation?
Part 2 Pre-Quiz — Sections 3 & 4
Same routine — answer first, then read.
Pre-Reading Check — Part 2
6. What does the OpenTelemetry Operator's mutating webhook do when a pod is annotated with instrumentation.opentelemetry.io/inject-java?
7. eBPF zero-code instrumentation is great at HTTP/gRPC coverage but loses in which scenario?
8. Why are semantic conventions important?
9. Which attribute is safe to use as a metric label dimension on an HTTP latency histogram?
10. Where do attributes like service.name, service.version, and k8s.pod.name belong?
Section 3: Zero-Code Instrumentation
"Zero-code" goes further than auto-instrumentation: the developer doesn't write tracing code and the build artifact isn't modified. Two distinct technologies live under this umbrella: eBPF agents that observe processes from the Linux kernel, and the OpenTelemetry Operator that injects auto-instrumentation agents into Kubernetes pods at admission time without changing container images.
eBPF-Based Auto-Instrumentation
eBPF lets you load safe, sandboxed bytecode into the Linux kernel and attach it to kernel events — syscalls, function entry/exit, tracepoints, network events — at runtime, without recompiling the kernel. An eBPF observability agent typically:
Attaches kprobes to network kernel functions (tcp_sendmsg, tcp_cleanup_rbuf, sys_enter_sendto/recvfrom) to observe every byte crossing TCP.
Attaches uprobes to user-space functions in shared libraries — SSL_read/SSL_write in libssl, Go's HTTP runtime, JVM JNI entries — to see data before encryption.
Writes structured records into eBPF maps drained at high frequency by a user-space agent.
Reconstructs requests — pairing sends/recvs, parsing HTTP headers, gRPC HTTP/2 framing — to produce L7 metrics and OTLP spans.
flowchart TD
subgraph US[User space]
A1[App A Go binary]
A2[App B Java JVM]
A3[App C Python]
SSL[libssl.so]
DS[Beyla / Pixie DaemonSet user-space agent]
end
subgraph K[Linux kernel]
KP1[kprobe: tcp_sendmsg]
KP2[kprobe: tcp_cleanup_rbuf]
KP3[tracepoint: sys_enter_sendto]
UP[uprobe: SSL_read / SSL_write]
MAP[(eBPF map ring buffer)]
end
A1 -->|syscalls| KP1
A2 -->|syscalls| KP3
A3 -->|TLS calls| SSL
SSL --> UP
KP1 --> MAP
KP2 --> MAP
KP3 --> MAP
UP --> MAP
MAP --> DS
DS -->|OTLP spans + RED metrics| COL[OpenTelemetry Collector]
Because the hooks live in kernel and shared libraries, eBPF works for every language on the host — Go, Rust, Java, Python, Node, C++, even closed-source binaries — without touching their code. The output is typically the four golden signals plus distributed traces for common protocols.
Tool
Focus
Output
Grafana Beyla
Zero-code OTel auto-instrumentation for HTTP/gRPC/DB
OTLP traces + RED metrics
Pixie
K8s deep debugging, full request bodies, PxL scripts
In-cluster live data
Cilium Tetragon
Runtime security and policy enforcement
Process/file/network events; can block
Odigos
eBPF + SDK hybrid OTel platform
OTLP routed by policy
OpenTelemetry Operator and the Instrumentation CRD
For Kubernetes workloads, the Operator offers a different flavor of zero-code: the cluster itself injects the SDK auto-instrumentation agents we saw in Section 2, without changing your container images.
# 1. The Instrumentation CRD: a reusable recipe per language
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: default-instrumentation
namespace: production
spec:
exporter:
endpoint: http://otel-collector.observability:4317
propagators: [tracecontext, baggage]
sampler:
type: parentbased_traceidratio
argument: "0.1"
resource:
attributes:
deployment.environment: prod
service.namespace: payments
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest
# 2. The Deployment opts in via pod annotations
metadata:
annotations:
instrumentation.opentelemetry.io/inject-java: "production/default-instrumentation"
When the annotated pod is created, the mutating webhook injects an init container that copies the agent JAR into a shared volume, then patches the application container with JAVA_TOOL_OPTIONS=-javaagent:/otel-auto-instrumentation/javaagent.jar and the appropriate OTEL_* environment variables. Python and Node.js use analogous mechanisms. Write the CRD once, label deployments by language, every new pod is born observable.
Animation B — Operator Pod Injection Workflow
Step 1: pod create hits the API server. Step 2: Operator's mutating webhook reads the Instrumentation CRD plus pod annotation. Step 3: it returns a patched pod — init container copies the agent into a shared volume; app container is patched with JAVA_TOOL_OPTIONS and OTEL_* env vars. Step 4: the JVM starts with the agent and OTLP spans flow to the Collector. No container image change.
Limits of Zero-Code — Hybrid Is Production-Grade
Capability
Manual
Auto (SDK)
eBPF
HTTP/gRPC/DB calls
If coded
Yes, broad
Yes, broad
Business attrs (order_id)
Yes
No
No
Closed-source binaries
No
No
Yes
TLS-encrypted in-process
Yes
Yes
Only via libssl uprobes
Windows/macOS
Yes
Yes
Linux only
Rollout effort
High
Low
Very low (DaemonSet)
Privilege required
App identity
App identity
CAP_SYS_ADMIN/CAP_BPF
Hybrid wins. Run eBPF for horizontal, polyglot baseline coverage; use the Operator to inject SDK auto-instrumentation on every K8s pod; add manual instrumentation on the critical business flows where you need tenant_id, feature_flag, and payment.outcome.
Key Points — Section 3
"Zero-code" really means two things: eBPF kernel/library probes and the OTel Operator injecting SDK agents into K8s pods at admission.
eBPF works for every language on a Linux host — even closed-source binaries — but cannot see business meaning and needs libssl uprobes for TLS bodies.
The Operator path inherits all SDK auto-instrumentation strengths and limits — broad coverage, no business attributes — but rolls out cluster-wide via one CRD plus a pod annotation.
Hybrid is the production answer: eBPF baseline + Operator-injected SDK + manual on critical flows.
Section 4: Semantic Conventions in Practice
Instrumentation nobody can query is expensive noise. Semantic conventions are OpenTelemetry's contract that names things the same way everywhere, so a dashboard against http.response.status_code works whether the data came from a Java agent, a Python monkey-patch, a Beyla eBPF probe, or your own manual code.
Common Stable Attributes
Domain
Attribute
Example Value
Use
HTTP
http.request.method
GET, POST
Method dim. on RED metrics
HTTP
http.response.status_code
200, 503
Error rate, SLO burn
HTTP
http.route
/orders/{id}
Path grouping w/o ID explosion
HTTP
server.address
api.acme.io
Backend grouping
RPC
rpc.system
grpc
Filter by RPC family
DB
db.system
postgresql
Engine breakdown
DB
db.operation
SELECT
Latency by operation
Messaging
messaging.system
kafka
Broker breakdown
Messaging
messaging.destination.name
orders.events
Per-topic throughput
A query like "p95 HTTP latency by route and status, last 30 minutes" is just groupby(http.route, http.response.status_code) of histogram(http.server.request.duration) — same panel, across the fleet, across vendors.
Animation C — One Request, Two Services, Same Attributes
Same HTTP call hits two services using different instrumentation paths. Both attach the same attribute keys — http.request.method, http.response.status_code, server.address, http.route — because they follow semantic conventions. The blue match-lines light up: one query, one dashboard, works against both signals. That's vendor portability.
Resource Attributes — Identity of the Emitter
Where attributes describe a single signal, resource attributes describe the emitter. Set them once at SDK init via OTEL_RESOURCE_ATTRIBUTES:
The OpenTelemetry Operator can fill many of these from the pod's downward API — you should rarely set K8s resource attributes by hand.
Operator Workflow (Figure 6.4)
sequenceDiagram
participant Dev as Developer
participant API as kube-apiserver
participant Op as OTel Operator
participant WH as Mutating Webhook
participant Init as Init container
participant App as App container
participant Col as Collector
Dev->>API: kubectl apply Instrumentation CR
Op->>API: watch + cache Instrumentation CR
Dev->>API: apply Deployment w/ annotation inject-java
API->>WH: AdmissionReview (Pod create)
WH->>WH: read annotation + CR recipe
WH-->>API: patched Pod spec (init container + JAVA_TOOL_OPTIONS + OTEL_*)
API->>Init: schedule init container
Init->>App: copy javaagent.jar to shared volume
App->>App: JVM starts with -javaagent
App->>Col: OTLP spans, metrics, logs
Cardinality — The Silent Killer
Every unique combination of attribute values produces a distinct time series. Cardinality destroys pricing, retention, query speed, and cluster stability. Before attaching an attribute, ask "How many distinct values can this take?"
graph LR
subgraph BASE[Safe baseline]
B1[method ~10]
B2[status ~60]
B3[route ~200]
B1 --> BX[10 x 60 x 200 = 120K series]
B2 --> BX
B3 --> BX
end
subgraph BAD[Add user.id]
U[user.id ~1,000,000]
BX --> E[120K x 1M = 120 billion series]
U --> E
end
E -->|TSDB OOM, cost spike| X[Cardinality bomb]
Candidate attribute
Cardinality
Use as metric label?
http.request.method
~10
Yes
http.response.status_code
~60
Yes
http.route (templated)
~hundreds
Yes
service.version
~tens
Yes
tenant.id (large SaaS)
~thousands+
Carefully — often spans only
url.full with raw path
~unbounded
No on metrics; redact on spans
user.id
~unbounded
Span attribute only
trace.id / request.id
per-request
Span only — never a metric label
db.statement raw
per-call
Span only; parameterize/redact
Three mitigations:
Templates, not raw values. Use http.route=/orders/{id} on metrics; keep url.full for span-only debugging.
Drop or hash at the Collector.attributes, transform, and redaction processors can drop, truncate, or one-way-hash before export.
Separate metric and span schemas. A span can carry tenant.id and order.id; the derived metric should only carry tenant.tier and payment.method. Spans are sampled; metrics aggregate forever.
These same mitigations double as PII hygiene: url.full, url.query, client.address, and db.statement may contain personal data — redact at the SDK, allow-list at the Collector, hash where you need cardinality without identity.
Stability and Evolution
Conventions evolve under Experimental → Stable → Deprecated. The migration from http.method to http.request.method is a real example: both names exist for a period, the new is preferred, Collector transform processors normalize older signals so dashboards survive. Anchor dashboards on Stable attributes; treat Experimental as opt-in.
Key Points — Section 4
Semantic conventions are what make OpenTelemetry portable — the same attribute keys appear from Java agents, Python monkey-patches, eBPF probes, and manual code.
Resource attributes describe the emitter (service.name, service.version, k8s.pod.name); set once via OTEL_RESOURCE_ATTRIBUTES.
Cardinality is multiplicative: user.id on a metric label can produce billions of series.
Use templated routes, the Collector's processors, and a separate span/metric schema to keep cardinality and PII in check.
Anchor on Stable attributes; the Collector's transform processor can normalize older signals during convention migrations.
Part 2 Post-Quiz — Sections 3 & 4
Same questions, post-reading.
Post-Reading Check — Part 2
6. What does the OpenTelemetry Operator's mutating webhook do when a pod is annotated with instrumentation.opentelemetry.io/inject-java?
7. eBPF zero-code instrumentation is great at HTTP/gRPC coverage but loses in which scenario?
8. Why are semantic conventions important?
9. Which attribute is safe to use as a metric label dimension on an HTTP latency histogram?
10. Where do attributes like service.name, service.version, and k8s.pod.name belong?