Chapter 6 — Instrumentation: Manual, Automatic, and Zero-Code

Learning Objectives

Instrumentation is the act of teaching your code to talk about itself. OpenTelemetry recognizes three strategies for producing signals: manual (developers explicitly emit spans, metrics, and logs), automatic (libraries are patched at runtime), and zero-code (an external observer — eBPF in the kernel or a Kubernetes Operator injecting agents — generates telemetry with no application awareness). Think of them as a memoir, a transcriptionist, and a hidden ceiling microphone: each captures the story differently; each has a place.

graph TD I[Application Telemetry] I --> M[Manual
developer writes
tracer.start_span] I --> A[Automatic
runtime agent
wraps libraries] I --> Z[Zero-Code
eBPF kernel probes
or Operator injection] M -->|effort: high| Q1[business attributes
tenant.id, order.id] A -->|effort: low| Q2[broad HTTP/DB/RPC
library coverage] Z -->|effort: none| Q3[polyglot, no rebuild
kernel-wide visibility]

Part 1 Pre-Quiz — Sections 1 & 2

Answer first, then read. You'll re-answer the same questions after the reading to measure improvement.

Pre-Reading Check — Part 1

1. Which instrumentation approach is the only one that can reliably capture a tenant.id or payment.outcome attribute?

2. You need to count active WebSocket connections (which go up and down). Which OpenTelemetry instrument fits?

3. How does the Java OpenTelemetry agent inject instrumentation into your app?

4. A Python service is auto-instrumented but produces no spans. The most common root cause is…

5. Which environment variable convention is shared across Java, Python, and Node.js auto-instrumentation?

Section 1: Manual Instrumentation

Manual instrumentation puts the developer in direct control. You acquire a Tracer, Meter, or Logger from the SDK and explicitly start spans, record measurements, or write structured log events. It is the only way to express domain context: concepts like order_id, tenant_id, payment.status, and feature_flag.variant that no auto-instrumenter could ever guess.

Acquiring Tracers and Meters

The SDK exposes TracerProvider, MeterProvider, and LoggerProvider. From each you obtain a named, versioned instance scoped to your module — conventionally the package import path. The name becomes instrumentation.scope.name on every signal, letting backends filter by which code produced the data.

// Java
Tracer tracer = GlobalOpenTelemetry.getTracer("com.acme.payments", "1.4.0");
Meter  meter  = GlobalOpenTelemetry.getMeter("com.acme.payments");

# Python
tracer = trace.get_tracer("acme.payments", "1.4.0")
meter  = metrics.get_meter("acme.payments")

// Node.js
const tracer = trace.getTracer('acme-payments', '1.4.0');
const meter  = metrics.getMeter('acme-payments');

Creating Spans and Recording Attributes

A span is a named, timed operation with attributes, events, and a status. The idiomatic pattern wraps a unit of work so the span closes even on exceptions:

# Python: idiomatic context-manager span
with tracer.start_as_current_span("authorize_payment") as span:
    span.set_attribute("payment.method", "card")
    span.set_attribute("tenant.id", tenant_id)
    try:
        approved = gateway.authorize(amount)
        span.set_attribute("payment.outcome",
                           "approved" if approved else "declined")
    except Exception as exc:
        span.record_exception(exc)
        span.set_status(trace.StatusCode.ERROR, str(exc))
        raise

Use dot-namespaced keys (payment.method, not paymentMethod); follow semantic conventions where one exists; treat your custom namespace (acme.*) like a public API — once a dashboard depends on it, you cannot freely rename it.

Picking the Right Metric Instrument

InstrumentDirectionAggregationTypical Use
CounterMonotonic upSumTotal requests, errors, bytes sent
UpDownCounterUp or downSumActive connections, queue depth, pool size
HistogramObservationsBucketed distributionRequest latency, payload size
Gauge (observable)SampledLast valueCPU utilization, current temperature

Three rules: declare units (s, By, 1); let the SDK pick histogram buckets unless you really know the latency profile; remember that every attribute multiplies cardinality — an idea we revisit in Section 4.

Key Points — Section 1

Section 2: Automatic Instrumentation

Auto-instrumentation answers "How do I get traces from libraries I did not write?" The mechanism differs by runtime because each language exposes different hooks.

Bytecode Injection: Java and .NET

The Java agent attaches via the -javaagent flag, registers a premain with the Instrumentation API, and uses ByteBuddy to rewrite classes as the classloader loads them:

OTEL_SERVICE_NAME=checkout-service \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
OTEL_TRACES_EXPORTER=otlp \
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=prod,service.version=2.3.1 \
java -javaagent:/opt/otel/opentelemetry-javaagent.jar -jar /app/checkout.jar

The agent ships with modules for Servlet, Spring MVC/WebFlux, JAX-RS, gRPC, OkHttp, JDBC, R2DBC, Hibernate, Mongo, Cassandra, Kafka, RabbitMQ, JMS, and more. Because rewriting happens at class load, no source change is needed — but the agent must attach at JVM start and exotic classloaders sometimes need extra config. .NET uses a conceptually similar mechanism via a CLR profiler activated by CORECLR_ENABLE_PROFILING=1.

Animation A — Bytecode Injection Moment
Class:com.acme.CheckoutServlet.service(HttpServletRequest, HttpServletResponse)
Original: CAFEBABE 2AB40007 B60012B1 // method body
▼ -javaagent premain → ByteBuddy transform()
Rewritten: CAFEBABE B8 00 1F // Span.start("http.server.request") 2AB40007 B60012 B8 00 2A // Span.end() B1
● First OTLP span emitted: http.server.request  |  method=GET, route=/checkout, status=200
The original class bytes (blue) load; the agent's ClassFileTransformer weaves Span.start / Span.end bytecode (green) around the method body. The very first request emits a span — with zero source changes.

Monkey-Patching: Python and Node.js

Dynamic languages let you replace functions at runtime. Python's opentelemetry-instrument CLI bootstraps every installed opentelemetry-instrumentation-* package, which monkey-patches its target library at import:

pip install opentelemetry-distro opentelemetry-exporter-otlp \
            opentelemetry-instrumentation-requests \
            opentelemetry-instrumentation-psycopg2 \
            opentelemetry-instrumentation-flask

OTEL_SERVICE_NAME=orders-api \
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317 \
opentelemetry-instrument gunicorn orders.wsgi:application

The requests instrumentation replaces Session.request with a wrapper that opens a client span, records HTTP attributes, calls the original, captures the response, and ends the span. It must run before the first import of the patched library; otherwise the cached reference is the unpatched one. Node.js relies on a require hook via require-in-the-middle:

// tracing.js  — must be required first
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({}),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

// run: NODE_OPTIONS="--require ./tracing.js" node app.js

Java Agent Lifecycle (Figure 6.2)

sequenceDiagram participant OS as OS / shell participant JVM as JVM participant Agent as otel-javaagent.jar participant CL as ClassLoader participant App as Application code participant Col as OTLP Collector OS->>JVM: java -javaagent:otel.jar -jar app.jar JVM->>Agent: invoke premain(Instrumentation) Agent->>JVM: register ClassFileTransformer (ByteBuddy) JVM->>App: start main() App->>CL: load HttpServlet, JdbcDriver, ... CL->>Agent: transform(class bytes) Agent-->>CL: rewritten bytes with span hooks App->>App: first request enters Servlet.service() App->>Col: OTLP span exported (http.server.request)

Cross-Language Comparison

AspectJavaPythonNode.js
Primary mechanism-javaagent bytecode rewriteMonkey-patching at importrequire hook + export patching
Entry pointJVM flagopentelemetry-instrument CLINODE_OPTIONS=--require
Code changesNoneNoneOne bootstrap file
Context propagationThread-locals + executor wrapperscontextvars + async wrappersPer-library async hooks
Common pitfallCustom classloadersImport order before patchBundlers/serverless hide require

A shared environment-variable contract spans every language: OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_TRACES_EXPORTER, OTEL_METRICS_EXPORTER, OTEL_LOGS_EXPORTER, OTEL_EXPORTER_OTLP_PROTOCOL, OTEL_TRACES_SAMPLER, OTEL_PROPAGATORS, OTEL_RESOURCE_ATTRIBUTES. One ConfigMap, one vocabulary, every workload.

Debugging axioms: No traces at all usually means an exporter is set to none, the protocol is mismatched (grpc vs. http/protobuf), or the bootstrap loaded too late. Duplicate spans almost always mean a library is being captured by both auto- and manual instrumentation — disable one.

Key Points — Section 2

Part 1 Post-Quiz — Sections 1 & 2

Same questions, now with context. Don't peek — the explanations reveal at the end.

Post-Reading Check — Part 1

1. Which instrumentation approach is the only one that can reliably capture a tenant.id or payment.outcome attribute?

2. You need to count active WebSocket connections (which go up and down). Which OpenTelemetry instrument fits?

3. How does the Java OpenTelemetry agent inject instrumentation into your app?

4. A Python service is auto-instrumented but produces no spans. The most common root cause is…

5. Which environment variable convention is shared across Java, Python, and Node.js auto-instrumentation?

Part 2 Pre-Quiz — Sections 3 & 4

Same routine — answer first, then read.

Pre-Reading Check — Part 2

6. What does the OpenTelemetry Operator's mutating webhook do when a pod is annotated with instrumentation.opentelemetry.io/inject-java?

7. eBPF zero-code instrumentation is great at HTTP/gRPC coverage but loses in which scenario?

8. Why are semantic conventions important?

9. Which attribute is safe to use as a metric label dimension on an HTTP latency histogram?

10. Where do attributes like service.name, service.version, and k8s.pod.name belong?

Section 3: Zero-Code Instrumentation

"Zero-code" goes further than auto-instrumentation: the developer doesn't write tracing code and the build artifact isn't modified. Two distinct technologies live under this umbrella: eBPF agents that observe processes from the Linux kernel, and the OpenTelemetry Operator that injects auto-instrumentation agents into Kubernetes pods at admission time without changing container images.

eBPF-Based Auto-Instrumentation

eBPF lets you load safe, sandboxed bytecode into the Linux kernel and attach it to kernel events — syscalls, function entry/exit, tracepoints, network events — at runtime, without recompiling the kernel. An eBPF observability agent typically:

  1. Attaches kprobes to network kernel functions (tcp_sendmsg, tcp_cleanup_rbuf, sys_enter_sendto/recvfrom) to observe every byte crossing TCP.
  2. Attaches uprobes to user-space functions in shared libraries — SSL_read/SSL_write in libssl, Go's HTTP runtime, JVM JNI entries — to see data before encryption.
  3. Writes structured records into eBPF maps drained at high frequency by a user-space agent.
  4. Reconstructs requests — pairing sends/recvs, parsing HTTP headers, gRPC HTTP/2 framing — to produce L7 metrics and OTLP spans.
flowchart TD subgraph US[User space] A1[App A
Go binary] A2[App B
Java JVM] A3[App C
Python] SSL[libssl.so] DS[Beyla / Pixie DaemonSet
user-space agent] end subgraph K[Linux kernel] KP1[kprobe: tcp_sendmsg] KP2[kprobe: tcp_cleanup_rbuf] KP3[tracepoint: sys_enter_sendto] UP[uprobe: SSL_read / SSL_write] MAP[(eBPF map
ring buffer)] end A1 -->|syscalls| KP1 A2 -->|syscalls| KP3 A3 -->|TLS calls| SSL SSL --> UP KP1 --> MAP KP2 --> MAP KP3 --> MAP UP --> MAP MAP --> DS DS -->|OTLP spans + RED metrics| COL[OpenTelemetry Collector]

Because the hooks live in kernel and shared libraries, eBPF works for every language on the host — Go, Rust, Java, Python, Node, C++, even closed-source binaries — without touching their code. The output is typically the four golden signals plus distributed traces for common protocols.

ToolFocusOutput
Grafana BeylaZero-code OTel auto-instrumentation for HTTP/gRPC/DBOTLP traces + RED metrics
PixieK8s deep debugging, full request bodies, PxL scriptsIn-cluster live data
Cilium TetragonRuntime security and policy enforcementProcess/file/network events; can block
OdigoseBPF + SDK hybrid OTel platformOTLP routed by policy

OpenTelemetry Operator and the Instrumentation CRD

For Kubernetes workloads, the Operator offers a different flavor of zero-code: the cluster itself injects the SDK auto-instrumentation agents we saw in Section 2, without changing your container images.

# 1. The Instrumentation CRD: a reusable recipe per language
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: default-instrumentation
  namespace: production
spec:
  exporter:
    endpoint: http://otel-collector.observability:4317
  propagators: [tracecontext, baggage]
  sampler:
    type: parentbased_traceidratio
    argument: "0.1"
  resource:
    attributes:
      deployment.environment: prod
      service.namespace: payments
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest

# 2. The Deployment opts in via pod annotations
metadata:
  annotations:
    instrumentation.opentelemetry.io/inject-java: "production/default-instrumentation"

When the annotated pod is created, the mutating webhook injects an init container that copies the agent JAR into a shared volume, then patches the application container with JAVA_TOOL_OPTIONS=-javaagent:/otel-auto-instrumentation/javaagent.jar and the appropriate OTEL_* environment variables. Python and Node.js use analogous mechanisms. Write the CRD once, label deployments by language, every new pod is born observable.

Animation B — Operator Pod Injection Workflow
kube-apiserver Pod create admission OTel Operator mutating webhook read CR + annotation Patched Pod Init container cp javaagent.jar → /otel-auto App container JAVA_TOOL_OPTIONS= -javaagent:/otel-auto/... OTEL_SERVICE_NAME=... Collector OTLP 4317 spans + metrics OTLP span 1 2 3 4
Step 1: pod create hits the API server. Step 2: Operator's mutating webhook reads the Instrumentation CRD plus pod annotation. Step 3: it returns a patched pod — init container copies the agent into a shared volume; app container is patched with JAVA_TOOL_OPTIONS and OTEL_* env vars. Step 4: the JVM starts with the agent and OTLP spans flow to the Collector. No container image change.

Limits of Zero-Code — Hybrid Is Production-Grade

CapabilityManualAuto (SDK)eBPF
HTTP/gRPC/DB callsIf codedYes, broadYes, broad
Business attrs (order_id)YesNoNo
Closed-source binariesNoNoYes
TLS-encrypted in-processYesYesOnly via libssl uprobes
Windows/macOSYesYesLinux only
Rollout effortHighLowVery low (DaemonSet)
Privilege requiredApp identityApp identityCAP_SYS_ADMIN/CAP_BPF

Hybrid wins. Run eBPF for horizontal, polyglot baseline coverage; use the Operator to inject SDK auto-instrumentation on every K8s pod; add manual instrumentation on the critical business flows where you need tenant_id, feature_flag, and payment.outcome.

Key Points — Section 3

Section 4: Semantic Conventions in Practice

Instrumentation nobody can query is expensive noise. Semantic conventions are OpenTelemetry's contract that names things the same way everywhere, so a dashboard against http.response.status_code works whether the data came from a Java agent, a Python monkey-patch, a Beyla eBPF probe, or your own manual code.

Common Stable Attributes

DomainAttributeExample ValueUse
HTTPhttp.request.methodGET, POSTMethod dim. on RED metrics
HTTPhttp.response.status_code200, 503Error rate, SLO burn
HTTPhttp.route/orders/{id}Path grouping w/o ID explosion
HTTPserver.addressapi.acme.ioBackend grouping
RPCrpc.systemgrpcFilter by RPC family
DBdb.systempostgresqlEngine breakdown
DBdb.operationSELECTLatency by operation
Messagingmessaging.systemkafkaBroker breakdown
Messagingmessaging.destination.nameorders.eventsPer-topic throughput

A query like "p95 HTTP latency by route and status, last 30 minutes" is just groupby(http.route, http.response.status_code) of histogram(http.server.request.duration) — same panel, across the fleet, across vendors.

Animation C — One Request, Two Services, Same Attributes
Service A — Java auto-agent otel-javaagent.jar POST /checkout → HTTP/1.1 200 http.request.method "POST" http.response.status_code 200 server.address "api.acme.io" http.route "/checkout" Service B — Python manual opentelemetry-api POST /checkout → HTTP/1.1 200 http.request.method "POST" http.response.status_code 200 server.address "api.acme.io" http.route "/checkout" groupby(http.route, http.response.status_code) → One dashboard. Both services. Any vendor.
Same HTTP call hits two services using different instrumentation paths. Both attach the same attribute keyshttp.request.method, http.response.status_code, server.address, http.route — because they follow semantic conventions. The blue match-lines light up: one query, one dashboard, works against both signals. That's vendor portability.

Resource Attributes — Identity of the Emitter

Where attributes describe a single signal, resource attributes describe the emitter. Set them once at SDK init via OTEL_RESOURCE_ATTRIBUTES:

OTEL_SERVICE_NAME=checkout-api
OTEL_RESOURCE_ATTRIBUTES=\
service.namespace=payments,\
service.version=2.3.1,\
service.instance.id=checkout-api-7d4f-x9w2,\
deployment.environment=prod,\
k8s.namespace.name=production,\
k8s.deployment.name=checkout-api,\
k8s.pod.name=checkout-api-7d4f-x9w2,\
cloud.provider=aws,\
cloud.region=us-east-1

The OpenTelemetry Operator can fill many of these from the pod's downward API — you should rarely set K8s resource attributes by hand.

Operator Workflow (Figure 6.4)

sequenceDiagram participant Dev as Developer participant API as kube-apiserver participant Op as OTel Operator participant WH as Mutating Webhook participant Init as Init container participant App as App container participant Col as Collector Dev->>API: kubectl apply Instrumentation CR Op->>API: watch + cache Instrumentation CR Dev->>API: apply Deployment w/ annotation inject-java API->>WH: AdmissionReview (Pod create) WH->>WH: read annotation + CR recipe WH-->>API: patched Pod spec (init container + JAVA_TOOL_OPTIONS + OTEL_*) API->>Init: schedule init container Init->>App: copy javaagent.jar to shared volume App->>App: JVM starts with -javaagent App->>Col: OTLP spans, metrics, logs

Cardinality — The Silent Killer

Every unique combination of attribute values produces a distinct time series. Cardinality destroys pricing, retention, query speed, and cluster stability. Before attaching an attribute, ask "How many distinct values can this take?"

graph LR subgraph BASE[Safe baseline] B1[method ~10] B2[status ~60] B3[route ~200] B1 --> BX[10 x 60 x 200
= 120K series] B2 --> BX B3 --> BX end subgraph BAD[Add user.id] U[user.id
~1,000,000] BX --> E[120K x 1M
= 120 billion series] U --> E end E -->|TSDB OOM, cost spike| X[Cardinality bomb]
Candidate attributeCardinalityUse as metric label?
http.request.method~10Yes
http.response.status_code~60Yes
http.route (templated)~hundredsYes
service.version~tensYes
tenant.id (large SaaS)~thousands+Carefully — often spans only
url.full with raw path~unboundedNo on metrics; redact on spans
user.id~unboundedSpan attribute only
trace.id / request.idper-requestSpan only — never a metric label
db.statement rawper-callSpan only; parameterize/redact

Three mitigations:

  1. Templates, not raw values. Use http.route=/orders/{id} on metrics; keep url.full for span-only debugging.
  2. Drop or hash at the Collector. attributes, transform, and redaction processors can drop, truncate, or one-way-hash before export.
  3. Separate metric and span schemas. A span can carry tenant.id and order.id; the derived metric should only carry tenant.tier and payment.method. Spans are sampled; metrics aggregate forever.

These same mitigations double as PII hygiene: url.full, url.query, client.address, and db.statement may contain personal data — redact at the SDK, allow-list at the Collector, hash where you need cardinality without identity.

Stability and Evolution

Conventions evolve under Experimental → Stable → Deprecated. The migration from http.method to http.request.method is a real example: both names exist for a period, the new is preferred, Collector transform processors normalize older signals so dashboards survive. Anchor dashboards on Stable attributes; treat Experimental as opt-in.

Key Points — Section 4

Part 2 Post-Quiz — Sections 3 & 4

Same questions, post-reading.

Post-Reading Check — Part 2

6. What does the OpenTelemetry Operator's mutating webhook do when a pod is annotated with instrumentation.opentelemetry.io/inject-java?

7. eBPF zero-code instrumentation is great at HTTP/gRPC coverage but loses in which scenario?

8. Why are semantic conventions important?

9. Which attribute is safe to use as a metric label dimension on an HTTP latency histogram?

10. Where do attributes like service.name, service.version, and k8s.pod.name belong?

Your Progress

Answer Explanations