Study Guide: Chapter 3 — Prometheus Architecture and Data Model

Pre-Reading Check — Half 1 (Sections 1 & 2)

Answer first using your existing knowledge. You will get a chance to revise after reading.

1. Which subsystem inside the Prometheus binary actually opens HTTP connections to targets?

A) The rule engine

B) The retrieval / scrape loop

C) Alertmanager (bundled with Prometheus)

D) The remote write client

2. When a scrape fails (timeout or 5xx), what does Prometheus write to the TSDB?

A) Nothing — the gap is simply absent

B) The previous sample value, repeated

C) A synthetic up series with value 0

D) An entry in a dead-letter queue for retry

3. Why does Prometheus prefer pull over push for steady-state services?

A) Pull is faster on the wire than push

B) Pull gives implicit liveness, server-controlled load, and a debuggable wire protocol

C) Pull avoids the need for service discovery

D) Pull is the only model that supports labels

4. Which of these is a legitimate use of the Pushgateway?

A) Buffering high-cardinality SLO latency metrics from a long-running web service

B) Reporting last-run duration and status for a nightly batch job

C) Replacing remote write for shipping all metrics to long-term storage

D) Storing per-pod metrics for ephemeral Kubernetes pods

5. In a federation hierarchy, what should you typically pull through /federate?

A) Every raw series from every leaf Prometheus

B) Aggregated series produced by recording rules

C) The WAL files of upstream Prometheus servers

D) Alertmanager state for cross-region dedup

Section 1: Server Components

A single prometheus binary is best understood as a small distributed system that happens to run inside one process. Four cooperating subsystems share the binary:

Retrieval — the scrape loop that pulls /metrics over HTTP on a schedule.
Storage — the TSDB (head block in memory, WAL on disk, immutable 2-hour blocks).
Query — the HTTP API and PromQL engine that answers /api/v1/query and /api/v1/query_range.
Rules — the rule engine that evaluates recording and alerting rules and fires alerts to an external Alertmanager.

Figure 3.1: Prometheus server components and external integrations

flowchart LR SD[Service Discovery
K8s, Consul, EC2, DNS] --> R[Retrieval
Scrape Loop] R -->|HTTP GET /metrics| Targets[(Scrape Targets)] R --> TSDB[(TSDB
head + WAL + blocks)] TSDB --> API[HTTP API + Web UI] API --> Grafana[Grafana / Clients] TSDB --> RE[Rule Engine
recording + alerting] RE --> AM[Alertmanager
external] TSDB --> RW[Remote Write] RW --> LTS[(Long-term Storage
Thanos / Mimir / Cortex)]

The scrape loop state machine

For every target, Prometheus runs five steps on a timer: resolve address → HTTP GET /metrics → parse OpenMetrics → relabel → append to the head block. Whether or not the scrape succeeds, the loop also writes built-in meta-series: up, scrape_duration_seconds, and scrape_samples_scraped. These are the series you alert on when you want to alert on your monitoring itself.

Analogy: the scrape loop is a postal carrier walking the same route every 15 seconds. They do not wait for a letter; they pick up whatever is in the mailbox at that moment. If the house has been demolished, the carrier files a report (up = 0) and moves on. Service discovery is the address book that tells the carrier which houses exist this morning.

HTTP API endpoints worth memorizing

Endpoint	Purpose
`/api/v1/query`	Instant PromQL query
`/api/v1/query_range`	Range PromQL query (what Grafana uses)
`/api/v1/targets`	Active and dropped scrape targets
`/api/v1/rules`, `/api/v1/alerts`	Configured rules and active alerts
`/-/reload`, `/-/healthy`, `/-/ready`	Lifecycle endpoints
`/metrics`	Prometheus's own metrics (it scrapes itself)

Key Takeaways — Server Components

Prometheus = retrieval + TSDB + HTTP/PromQL + rules, all in one binary; Alertmanager is external.
The scrape loop always writes up and scrape_* meta-series — even on failure.
Service discovery (K8s, Consul, EC2, DNS, file) feeds the scrape loop a current target list with __meta_* labels.

Section 2: The Pull Model

In pull, targets expose /metrics and Prometheus calls it. In push, targets stream samples to a collector. Pull buys Prometheus a surprisingly large bundle of properties that are hard to replicate in push systems:

Implicit liveness. Cannot reach target ⇒ up=0 and existing series go stale — no "is this from a dead pod?" ambiguity.
Server-side load control. Scrape interval, timeout, and concurrency are set on Prometheus; a misbehaving target cannot flood it.
Debuggable with curl. Anyone with shell access can call /metrics and see exactly what Prometheus sees.
No client buffering / retries. Targets do not need to know about Prometheus availability.
Sharding by target. Scale horizontally by splitting the target list across servers.

Pull is awkward when targets live behind NAT, are very short-lived, or are conceptually event streams rather than stateful endpoints. Pull is therefore better for the steady state of long-lived services on trusted networks — which describes most of a Kubernetes cluster.

Animation A — The 15-second scrape pull cycle

Figure 3.2: Scrape loop sequence (pull model with implicit liveness)

sequenceDiagram participant SD as Service Discovery participant P as Prometheus
Scrape Loop participant T as Target /metrics participant DB as TSDB Head SD->>P: target list (IPs, labels) loop every 15s P->>T: HTTP GET /metrics alt target healthy T-->>P: 200 OK + OpenMetrics body P->>P: parse + relabel samples P->>DB: append samples + "up=1" else target unreachable T--xP: timeout / 5xx / parse error P->>DB: append "up=0" (stale marker) end end

Pushgateway — the narrow escape hatch

For short-lived batch jobs, the Pushgateway is a tiny HTTP server that accepts pushes, holds the current value in memory, and exposes them on its own /metrics — which Prometheus scrapes like any other target. The right shape: a nightly job pushes nightly_import_last_run_status, _duration_seconds, and _timestamp_seconds right before exit.

Anti-patterns to avoid:

Do not use Pushgateway for long-running services — no built-in staleness; dead services look healthy forever.
Do not push per-pod labels for ephemeral pods — each pod UUID becomes a permanent series.
Do not treat it as an event stream — each push replaces the current value.
Do not use it for SLO metrics — cached values hide outages.

For genuinely push-shaped workloads, the modern answer is usually the OpenTelemetry Collector (covered in chapter 5).

Federation

Federation lets one Prometheus pull a subset of series from another via /federate. The healthy hierarchy is leaf → aggregator → global, and you should pull only aggregated series (recording-rule outputs) through it. For raw data shipping, remote write is almost always the better tool.

Key Takeaways — Pull Model

Pull = implicit liveness + server-controlled load + curl-debuggable + simple sharding.
Pull's pain points: NAT/firewalled targets, very short-lived jobs, event-shaped workloads.
Pushgateway is for batch job-level metrics, never service-level SLOs.
Use federation for aggregated views; use remote write for raw long-term storage.

Post-Reading Check — Half 1 (Sections 1 & 2)

Same questions — revise your answers based on what you just read.

1. Which subsystem inside the Prometheus binary actually opens HTTP connections to targets?

A) The rule engine

B) The retrieval / scrape loop

C) Alertmanager (bundled with Prometheus)

D) The remote write client

2. When a scrape fails (timeout or 5xx), what does Prometheus write to the TSDB?

A) Nothing — the gap is simply absent

B) The previous sample value, repeated

C) A synthetic up series with value 0

D) An entry in a dead-letter queue for retry

3. Why does Prometheus prefer pull over push for steady-state services?

A) Pull is faster on the wire than push

B) Pull gives implicit liveness, server-controlled load, and a debuggable wire protocol

C) Pull avoids the need for service discovery

D) Pull is the only model that supports labels

4. Which of these is a legitimate use of the Pushgateway?

A) Buffering high-cardinality SLO latency metrics from a long-running web service

B) Reporting last-run duration and status for a nightly batch job

C) Replacing remote write for shipping all metrics to long-term storage

D) Storing per-pod metrics for ephemeral Kubernetes pods

5. In a federation hierarchy, what should you typically pull through /federate?

A) Every raw series from every leaf Prometheus

B) Aggregated series produced by recording rules

C) The WAL files of upstream Prometheus servers

D) Alertmanager state for cross-region dedup

Pre-Reading Check — Half 2 (Sections 3 & 4)

Answer first using what you already know.

6. What uniquely identifies a Prometheus time series?

A) The metric name alone

B) The metric name plus its complete label set

C) The job label combined with the value

D) The series ID assigned by the WAL

7. By default, if a target's exposed labels conflict with target labels, which set wins?

A) The target's exposed labels — they are more specific

B) Whichever label was set most recently

C) Prometheus's target labels (e.g., job, instance)

D) Both are kept under prefixed names

8. What does Prometheus do before updating the in-memory head block when it ingests a sample?

A) Compress the chunk with XOR encoding

B) Append a CRC-checksummed record to the WAL

C) Flush the previous 2h block to disk

D) POST the sample to remote_write

9. If a remote-write endpoint becomes unavailable for an extended period, what happens?

A) Prometheus blocks scrapes until the backend recovers

B) The queue fills and Prometheus eventually drops samples

C) Prometheus replays from local TSDB once the endpoint returns

D) The samples are written to a local replay log forever

10. Which long-term-storage backend is the natural fit when you already operate many Prometheus servers and want object-store-backed retention with downsampling and minimal disruption?

A) Thanos (sidecar pattern)

B) Grafana Mimir

C) Cortex

D) Pushgateway

Section 3: Data Model and Exposition Format

A time series is uniquely identified by its metric name plus its label set. Everything else — help text, type, unit — is metadata. A single sample looks like:

http_requests_total{job="api", method="GET", status="200", path="/users"}  1873  1717520400000

That decomposes into:

Metric name: http_requests_total. Conventions: _total for counters; _seconds, _bytes, _ratio for units.
Label set: {job, method, status, path}. Each is a (name, value) pair. Change any value ⇒ new time series.
Sample: a float value plus a Unix-millisecond timestamp.

This is the multi-dimensional idea: one metric name, many label dimensions, then slice at query time:

sum by (status) (rate(http_requests_total{job="api"}[5m]))

Analogy: metric name = spreadsheet, labels = columns, label values = cell contents, each unique row = a time series. Adding a high-cardinality column (user_id) is like adding a column with one row per user — cost is linear in distinct values.

OpenMetrics text exposition format

# HELP http_requests_total Total HTTP requests received.
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1873
http_requests_total{method="GET",status="500"} 4
http_requests_total{method="POST",status="200"} 219

# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 4.2848256e+07

# HELP http_request_duration_seconds HTTP request latency.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 1500
http_request_duration_seconds_bucket{le="0.5"} 1860
http_request_duration_seconds_bucket{le="1.0"} 1872
http_request_duration_seconds_bucket{le="+Inf"} 1873
http_request_duration_seconds_sum 92.7
http_request_duration_seconds_count 1873
# EOF

Rules:

# HELP and # TYPE are semantically meaningful comments.
Each sample line: metric_name{labels} value [timestamp]. Timestamp is optional — scrape time is used if omitted.
Histograms expand to multiple lines: one bucket per le boundary, plus _sum and _count — each bucket is its own series.
OpenMetrics ends with a literal # EOF line.

Honor labels, target labels, and relabeling

By default, when a target's exposed labels conflict with the labels Prometheus attaches (most importantly job and instance), the target labels win. Setting honor_labels: true flips that — useful for the Pushgateway and federation.

Two flavors of relabeling:

relabel_configs — runs before the scrape, against __meta_* SD labels. Selects which targets to scrape and rewrites their final labels.
metric_relabel_configs — runs after the scrape, against each sample. Drops high-cardinality samples or renames labels.

scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Keep only pods that opt in.
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"
      # Set the job label from the pod's annotation.
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_job]
        target_label: job
    metric_relabel_configs:
      # Drop a known cardinality-bomb metric.
      - source_labels: [__name__]
        regex: "go_gc_pauses_seconds_bucket"
        action: drop

Figure 3.3: Relabeling pipeline (target metadata to stored series)

flowchart LR SD[Service Discovery] -->|"__meta_kubernetes_*
__meta_consul_*"| RC[relabel_configs
keep / drop / rewrite] RC --> TL[Final target labels
"job, instance, namespace, pod"] TL --> SCR[Scrape /metrics] SCR --> RAW[Raw samples
"name + labels + value"] RAW --> MRC[metric_relabel_configs
drop high-cardinality] MRC --> TSDB[(TSDB head block)]

Two rules of thumb: (1) test relabeling against a known target with --log.level=debug; (2) drop unbounded labels (user IDs, request IDs, raw URL paths) before they reach the TSDB.

Key Takeaways — Data Model

metric_name + label_set = unique time series. Histograms = one series per bucket.
Target labels beat exposed labels unless honor_labels: true.
relabel_configs shapes target labels before scrape; metric_relabel_configs drops samples after.
Cardinality grows linearly with unique label-value combinations — cap unbounded labels at ingest.

Section 4: Storage, Retention, and Remote Write

Every sample follows a fixed path from scrape to retention:

WAL append. The sample (series ID + timestamp + value) is encoded as a CRC-checksummed record and written to the current WAL segment (~128 MB files: 00000001, 00000002, ...).
Head update. The TSDB looks up (or creates) the in-memory series for the label set and appends the sample to its current chunk.

The WAL is the durability story: if Prometheus crashes after step 1, the sample is replayed on restart from the most recent checkpoint.

Block layout

Every ~2 hours the head is cut into an immutable on-disk block:

01J1H1Q2K3V3Y2.../
  meta.json       # ULID, minTime, maxTime, compaction level, sources, stats
  index           # symbol table + postings lists (label=value -> series IDs)
  chunks/
    000001        # XOR-compressed chunks (mmapped at query time)
    000002
  tombstones      # optional, deletion intervals

The index is the crown jewel: a symbol table interns every label name and value, every series gets an integer ID, and postings lists hold the sorted series IDs for each label=value. A PromQL selector like {__name__="http_requests_total", job="api"} resolves by intersecting two postings lists — milliseconds, no table scan.

Chunks themselves use delta-of-delta encoding for timestamps and Gorilla-style XOR for floats — typically ~1-2 bytes per sample. Chunks are memory-mapped, which is why Prometheus often shows modest RSS but a large OS page cache.

Animation B — Block compaction over time

Figure 3.4: TSDB block lifecycle from scrape to retention

stateDiagram-v2 [*] --> Scrape: sample produced Scrape --> WAL: append CRC record WAL --> Head: update in-memory head chunk Head --> Head: accumulate ~2h of samples Head --> Block2h: cut immutable block
(meta.json + index + chunks) Block2h --> Block8h: compact adjacent blocks Block8h --> Block24h: compact further Block24h --> BlockMulti: compact multi-day BlockMulti --> Deleted: retention horizon passed Deleted --> [*] Head --> Recovery: crash Recovery --> Head: replay WAL from checkpoint.N

Crash recovery worked example

Suppose Prometheus has wal/checkpoint.10/ plus segments 00000011, 00000012, 00000013, and crashes mid-write on segment 13. On startup:

Existing immutable blocks load normally — no WAL needed.
The head is rebuilt from checkpoint.10.
Segments 11, 12, and 13 are replayed; recovery stops at the first record with a failing CRC.
Any incomplete writes are lost, but the head is consistent and queries work immediately.

Slow startup almost always means either checkpoints are not happening (frequent restarts, OOM kills) or the WAL grew during a long outage. The fix: keep Prometheus healthy long enough to checkpoint.

Remote write

Local TSDB is intentionally short-term (weeks). For longer retention or central aggregation, remote write POSTs Snappy-compressed protobuf batches to a configured URL:

remote_write:
  - url: https://mimir.example.com/api/v1/push
    basic_auth:
      username: tenant-42
      password_file: /etc/prometheus/mimir-token
    queue_config:
      capacity: 10000
      max_shards: 30
      min_backoff: 30ms
      max_backoff: 5s
    write_relabel_configs:
      - source_labels: [__name__]
        regex: "go_.*"
        action: drop   # don't ship Go runtime metrics

Critical properties:

Lossy under backpressure. If the remote endpoint cannot keep up, Prometheus's queue fills and eventually drops samples. Watch the prometheus_remote_storage_* metrics.
Not a replay protocol. If you enable it on day 30, the backend sees day 30 onward — not the previous 30 days from local TSDB.
write_relabel_configs can drop or rewrite labels on the way out — useful for stripping cardinality before it leaves your network.

Animation C — Remote write fanout

Thanos vs. Cortex vs. Mimir

Aspect	Thanos	Cortex	Grafana Mimir
Ingest model	Sidecar uploads blocks; optional Receive for remote write	Remote write only	Remote write only; simplified Cortex
Multi-tenancy	Label-based; basic	First-class tenant IDs	First-class with shuffle-sharding
HA dedup	Query-time	Ingest-time	Ingest-time, improved
Downsampling	Native (5m, 1h) via compactor	None at storage	None at storage
Object storage	S3/GCS/Azure/Swift	Blocks or chunks engine	Blocks only
Best fit (2024-25)	Bolt LTS onto existing Prom	Legacy deployments	New large-scale, multi-tenant

Rules of thumb:

Thanos — topology is "many independent Prometheus servers + give us LTS and a global view." Sidecar adds minimal moving parts; downsampling makes year-long dashboards fast.
Mimir — you are building a multi-tenant metrics service with per-tenant quotas, isolation, and SLOs.
Cortex — mostly if you already run it; for greenfield, Mimir is usually better.
VictoriaMetrics — simpler single-binary alternative for teams without dedicated SREs.

Figure 3.5: Long-term storage architectures — Thanos vs. Mimir/Cortex

flowchart TB subgraph Thanos["Thanos (sidecar pattern)"] direction LR TP[Prometheus] --> TS[Thanos Sidecar] TS -->|upload 2h blocks| TOS[(Object Store
S3 / GCS)] TQ[Thanos Querier] --> TS TQ --> TSG[Store Gateway] TSG --> TOS TC[Compactor
downsample 5m / 1h] --> TOS end subgraph Mimir["Mimir / Cortex (remote-write platform)"] direction LR MP[Prometheus / Agent] -->|remote_write| MD[Distributor] MD --> MI[Ingester
head + WAL] MI -->|flush blocks| MOS[(Object Store)] MQF[Query Frontend
+ cache] --> MQ[Querier] MQ --> MI MQ --> MSG[Store Gateway] MSG --> MOS end

Key Takeaways — Storage & Remote Write

Ingest order: WAL first, head second — WAL is the durability story.
Head → 2h block → compact (6h, 24h, multi-day) → retention deletes whole blocks.
Index = symbol table + postings lists → PromQL selectors resolve via list intersection.
Remote write is lossy under backpressure and is not a replay protocol.
Choose Thanos for "add LTS to existing Prom," Mimir for multi-tenant platforms, Cortex for legacy.

Post-Reading Check — Half 2 (Sections 3 & 4)

Same questions — revise after reading.