Chapter 3 — Prometheus Architecture and Data Model
Learning Objectives
Diagram the Prometheus server, the scrape loop, the TSDB, and Alertmanager, and describe how they interact during a normal scrape and alert cycle.
Explain the pull-based scraping model and the specific trade-offs it makes versus a push model — including where the Pushgateway fits in.
Map a metric name, its label set, and its timestamp onto Prometheus's multi-dimensional data model, and read or write a valid OpenMetrics exposition payload.
Reason about TSDB blocks, the WAL, retention, and remote-write integrations like Thanos, Cortex, and Mimir well enough to plan capacity and debug failures.
Pre-Reading Check — Half 1 (Sections 1 & 2)
Answer first using your existing knowledge. You will get a chance to revise after reading.
1. Which subsystem inside the Prometheus binary actually opens HTTP connections to targets?
A) The rule engine
B) The retrieval / scrape loop
C) Alertmanager (bundled with Prometheus)
D) The remote write client
2. When a scrape fails (timeout or 5xx), what does Prometheus write to the TSDB?
A) Nothing — the gap is simply absent
B) The previous sample value, repeated
C) A synthetic up series with value 0
D) An entry in a dead-letter queue for retry
3. Why does Prometheus prefer pull over push for steady-state services?
A) Pull is faster on the wire than push
B) Pull gives implicit liveness, server-controlled load, and a debuggable wire protocol
C) Pull avoids the need for service discovery
D) Pull is the only model that supports labels
4. Which of these is a legitimate use of the Pushgateway?
A) Buffering high-cardinality SLO latency metrics from a long-running web service
B) Reporting last-run duration and status for a nightly batch job
C) Replacing remote write for shipping all metrics to long-term storage
D) Storing per-pod metrics for ephemeral Kubernetes pods
5. In a federation hierarchy, what should you typically pull through /federate?
A) Every raw series from every leaf Prometheus
B) Aggregated series produced by recording rules
C) The WAL files of upstream Prometheus servers
D) Alertmanager state for cross-region dedup
Section 1: Server Components
A single prometheus binary is best understood as a small distributed system that happens to run inside one process. Four cooperating subsystems share the binary:
Retrieval — the scrape loop that pulls /metrics over HTTP on a schedule.
Storage — the TSDB (head block in memory, WAL on disk, immutable 2-hour blocks).
Query — the HTTP API and PromQL engine that answers /api/v1/query and /api/v1/query_range.
Rules — the rule engine that evaluates recording and alerting rules and fires alerts to an external Alertmanager.
Figure 3.1: Prometheus server components and external integrations
flowchart LR
SD[Service Discovery K8s, Consul, EC2, DNS] --> R[Retrieval Scrape Loop]
R -->|HTTP GET /metrics| Targets[(Scrape Targets)]
R --> TSDB[(TSDB head + WAL + blocks)]
TSDB --> API[HTTP API + Web UI]
API --> Grafana[Grafana / Clients]
TSDB --> RE[Rule Engine recording + alerting]
RE --> AM[Alertmanager external]
TSDB --> RW[Remote Write]
RW --> LTS[(Long-term Storage Thanos / Mimir / Cortex)]
The scrape loop state machine
For every target, Prometheus runs five steps on a timer: resolve address → HTTP GET /metrics → parse OpenMetrics → relabel → append to the head block. Whether or not the scrape succeeds, the loop also writes built-in meta-series: up, scrape_duration_seconds, and scrape_samples_scraped. These are the series you alert on when you want to alert on your monitoring itself.
Analogy: the scrape loop is a postal carrier walking the same route every 15 seconds. They do not wait for a letter; they pick up whatever is in the mailbox at that moment. If the house has been demolished, the carrier files a report (up = 0) and moves on. Service discovery is the address book that tells the carrier which houses exist this morning.
HTTP API endpoints worth memorizing
Endpoint
Purpose
/api/v1/query
Instant PromQL query
/api/v1/query_range
Range PromQL query (what Grafana uses)
/api/v1/targets
Active and dropped scrape targets
/api/v1/rules, /api/v1/alerts
Configured rules and active alerts
/-/reload, /-/healthy, /-/ready
Lifecycle endpoints
/metrics
Prometheus's own metrics (it scrapes itself)
Key Takeaways — Server Components
Prometheus = retrieval + TSDB + HTTP/PromQL + rules, all in one binary; Alertmanager is external.
The scrape loop always writes up and scrape_* meta-series — even on failure.
Service discovery (K8s, Consul, EC2, DNS, file) feeds the scrape loop a current target list with __meta_* labels.
Section 2: The Pull Model
In pull, targets expose /metrics and Prometheus calls it. In push, targets stream samples to a collector. Pull buys Prometheus a surprisingly large bundle of properties that are hard to replicate in push systems:
Implicit liveness. Cannot reach target ⇒ up=0 and existing series go stale — no "is this from a dead pod?" ambiguity.
Server-side load control. Scrape interval, timeout, and concurrency are set on Prometheus; a misbehaving target cannot flood it.
Debuggable with curl. Anyone with shell access can call /metrics and see exactly what Prometheus sees.
No client buffering / retries. Targets do not need to know about Prometheus availability.
Sharding by target. Scale horizontally by splitting the target list across servers.
Pull is awkward when targets live behind NAT, are very short-lived, or are conceptually event streams rather than stateful endpoints. Pull is therefore better for the steady state of long-lived services on trusted networks — which describes most of a Kubernetes cluster.
Animation A — The 15-second scrape pull cycle
Scrape pull cycle
Every ~15s the server fires HTTP GETs at all targets; OpenMetrics responses become samples in the TSDB head.
Figure 3.2: Scrape loop sequence (pull model with implicit liveness)
sequenceDiagram
participant SD as Service Discovery
participant P as Prometheus Scrape Loop
participant T as Target /metrics
participant DB as TSDB Head
SD->>P: target list (IPs, labels)
loop every 15s
P->>T: HTTP GET /metrics
alt target healthy
T-->>P: 200 OK + OpenMetrics body
P->>P: parse + relabel samples
P->>DB: append samples + "up=1"
else target unreachable
T--xP: timeout / 5xx / parse error
P->>DB: append "up=0" (stale marker)
end
end
Pushgateway — the narrow escape hatch
For short-lived batch jobs, the Pushgateway is a tiny HTTP server that accepts pushes, holds the current value in memory, and exposes them on its own /metrics — which Prometheus scrapes like any other target. The right shape: a nightly job pushes nightly_import_last_run_status, _duration_seconds, and _timestamp_seconds right before exit.
Anti-patterns to avoid:
Do not use Pushgateway for long-running services — no built-in staleness; dead services look healthy forever.
Do not push per-pod labels for ephemeral pods — each pod UUID becomes a permanent series.
Do not treat it as an event stream — each push replaces the current value.
Do not use it for SLO metrics — cached values hide outages.
For genuinely push-shaped workloads, the modern answer is usually the OpenTelemetry Collector (covered in chapter 5).
Federation
Federation lets one Prometheus pull a subset of series from another via /federate. The healthy hierarchy is leaf → aggregator → global, and you should pull only aggregated series (recording-rule outputs) through it. For raw data shipping, remote write is almost always the better tool.
Pull's pain points: NAT/firewalled targets, very short-lived jobs, event-shaped workloads.
Pushgateway is for batch job-level metrics, never service-level SLOs.
Use federation for aggregated views; use remote write for raw long-term storage.
Post-Reading Check — Half 1 (Sections 1 & 2)
Same questions — revise your answers based on what you just read.
1. Which subsystem inside the Prometheus binary actually opens HTTP connections to targets?
A) The rule engine
B) The retrieval / scrape loop
C) Alertmanager (bundled with Prometheus)
D) The remote write client
2. When a scrape fails (timeout or 5xx), what does Prometheus write to the TSDB?
A) Nothing — the gap is simply absent
B) The previous sample value, repeated
C) A synthetic up series with value 0
D) An entry in a dead-letter queue for retry
3. Why does Prometheus prefer pull over push for steady-state services?
A) Pull is faster on the wire than push
B) Pull gives implicit liveness, server-controlled load, and a debuggable wire protocol
C) Pull avoids the need for service discovery
D) Pull is the only model that supports labels
4. Which of these is a legitimate use of the Pushgateway?
A) Buffering high-cardinality SLO latency metrics from a long-running web service
B) Reporting last-run duration and status for a nightly batch job
C) Replacing remote write for shipping all metrics to long-term storage
D) Storing per-pod metrics for ephemeral Kubernetes pods
5. In a federation hierarchy, what should you typically pull through /federate?
A) Every raw series from every leaf Prometheus
B) Aggregated series produced by recording rules
C) The WAL files of upstream Prometheus servers
D) Alertmanager state for cross-region dedup
Pre-Reading Check — Half 2 (Sections 3 & 4)
Answer first using what you already know.
6. What uniquely identifies a Prometheus time series?
A) The metric name alone
B) The metric name plus its complete label set
C) The job label combined with the value
D) The series ID assigned by the WAL
7. By default, if a target's exposed labels conflict with target labels, which set wins?
A) The target's exposed labels — they are more specific
B) Whichever label was set most recently
C) Prometheus's target labels (e.g., job, instance)
D) Both are kept under prefixed names
8. What does Prometheus do before updating the in-memory head block when it ingests a sample?
A) Compress the chunk with XOR encoding
B) Append a CRC-checksummed record to the WAL
C) Flush the previous 2h block to disk
D) POST the sample to remote_write
9. If a remote-write endpoint becomes unavailable for an extended period, what happens?
A) Prometheus blocks scrapes until the backend recovers
B) The queue fills and Prometheus eventually drops samples
C) Prometheus replays from local TSDB once the endpoint returns
D) The samples are written to a local replay log forever
10. Which long-term-storage backend is the natural fit when you already operate many Prometheus servers and want object-store-backed retention with downsampling and minimal disruption?
A) Thanos (sidecar pattern)
B) Grafana Mimir
C) Cortex
D) Pushgateway
Section 3: Data Model and Exposition Format
A time series is uniquely identified by its metric name plus its label set. Everything else — help text, type, unit — is metadata. A single sample looks like:
Metric name:http_requests_total. Conventions: _total for counters; _seconds, _bytes, _ratio for units.
Label set:{job, method, status, path}. Each is a (name, value) pair. Change any value ⇒ new time series.
Sample: a float value plus a Unix-millisecond timestamp.
This is the multi-dimensional idea: one metric name, many label dimensions, then slice at query time:
sum by (status) (rate(http_requests_total{job="api"}[5m]))
Analogy: metric name = spreadsheet, labels = columns, label values = cell contents, each unique row = a time series. Adding a high-cardinality column (user_id) is like adding a column with one row per user — cost is linear in distinct values.
OpenMetrics text exposition format
# HELP http_requests_total Total HTTP requests received.
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1873
http_requests_total{method="GET",status="500"} 4
http_requests_total{method="POST",status="200"} 219
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 4.2848256e+07
# HELP http_request_duration_seconds HTTP request latency.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 1500
http_request_duration_seconds_bucket{le="0.5"} 1860
http_request_duration_seconds_bucket{le="1.0"} 1872
http_request_duration_seconds_bucket{le="+Inf"} 1873
http_request_duration_seconds_sum 92.7
http_request_duration_seconds_count 1873
# EOF
Rules:
# HELP and # TYPE are semantically meaningful comments.
Each sample line: metric_name{labels} value [timestamp]. Timestamp is optional — scrape time is used if omitted.
Histograms expand to multiple lines: one bucket per le boundary, plus _sum and _count — each bucket is its own series.
OpenMetrics ends with a literal # EOF line.
Honor labels, target labels, and relabeling
By default, when a target's exposed labels conflict with the labels Prometheus attaches (most importantly job and instance), the target labels win. Setting honor_labels: true flips that — useful for the Pushgateway and federation.
Two flavors of relabeling:
relabel_configs — runs before the scrape, against __meta_* SD labels. Selects which targets to scrape and rewrites their final labels.
metric_relabel_configs — runs after the scrape, against each sample. Drops high-cardinality samples or renames labels.
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Keep only pods that opt in.
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
# Set the job label from the pod's annotation.
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_job]
target_label: job
metric_relabel_configs:
# Drop a known cardinality-bomb metric.
- source_labels: [__name__]
regex: "go_gc_pauses_seconds_bucket"
action: drop
Figure 3.3: Relabeling pipeline (target metadata to stored series)
flowchart LR
SD[Service Discovery] -->|"__meta_kubernetes_* __meta_consul_*"| RC[relabel_configs keep / drop / rewrite]
RC --> TL[Final target labels "job, instance, namespace, pod"]
TL --> SCR[Scrape /metrics]
SCR --> RAW[Raw samples "name + labels + value"]
RAW --> MRC[metric_relabel_configs drop high-cardinality]
MRC --> TSDB[(TSDB head block)]
Two rules of thumb: (1) test relabeling against a known target with --log.level=debug; (2) drop unbounded labels (user IDs, request IDs, raw URL paths) before they reach the TSDB.
Key Takeaways — Data Model
metric_name + label_set = unique time series. Histograms = one series per bucket.
relabel_configs shapes target labels before scrape; metric_relabel_configs drops samples after.
Cardinality grows linearly with unique label-value combinations — cap unbounded labels at ingest.
Section 4: Storage, Retention, and Remote Write
Every sample follows a fixed path from scrape to retention:
WAL append. The sample (series ID + timestamp + value) is encoded as a CRC-checksummed record and written to the current WAL segment (~128 MB files: 00000001, 00000002, ...).
Head update. The TSDB looks up (or creates) the in-memory series for the label set and appends the sample to its current chunk.
The WAL is the durability story: if Prometheus crashes after step 1, the sample is replayed on restart from the most recent checkpoint.
Block layout
Every ~2 hours the head is cut into an immutable on-disk block:
01J1H1Q2K3V3Y2.../
meta.json # ULID, minTime, maxTime, compaction level, sources, stats
index # symbol table + postings lists (label=value -> series IDs)
chunks/
000001 # XOR-compressed chunks (mmapped at query time)
000002
tombstones # optional, deletion intervals
The index is the crown jewel: a symbol table interns every label name and value, every series gets an integer ID, and postings lists hold the sorted series IDs for each label=value. A PromQL selector like {__name__="http_requests_total", job="api"} resolves by intersecting two postings lists — milliseconds, no table scan.
Chunks themselves use delta-of-delta encoding for timestamps and Gorilla-style XOR for floats — typically ~1-2 bytes per sample. Chunks are memory-mapped, which is why Prometheus often shows modest RSS but a large OS page cache.
Animation B — Block compaction over time
TSDB head → 2h blocks → compacted blocks
Head (in memory) accumulates ~2h of samples, then cuts to an immutable 2h block; compactor merges adjacent blocks into 6h, then 24h, then multi-day.
Figure 3.4: TSDB block lifecycle from scrape to retention
stateDiagram-v2
[*] --> Scrape: sample produced
Scrape --> WAL: append CRC record
WAL --> Head: update in-memory head chunk
Head --> Head: accumulate ~2h of samples
Head --> Block2h: cut immutable block (meta.json + index + chunks)
Block2h --> Block8h: compact adjacent blocks
Block8h --> Block24h: compact further
Block24h --> BlockMulti: compact multi-day
BlockMulti --> Deleted: retention horizon passed
Deleted --> [*]
Head --> Recovery: crash
Recovery --> Head: replay WAL from checkpoint.N
Crash recovery worked example
Suppose Prometheus has wal/checkpoint.10/ plus segments 00000011, 00000012, 00000013, and crashes mid-write on segment 13. On startup:
Existing immutable blocks load normally — no WAL needed.
The head is rebuilt from checkpoint.10.
Segments 11, 12, and 13 are replayed; recovery stops at the first record with a failing CRC.
Any incomplete writes are lost, but the head is consistent and queries work immediately.
Slow startup almost always means either checkpoints are not happening (frequent restarts, OOM kills) or the WAL grew during a long outage. The fix: keep Prometheus healthy long enough to checkpoint.
Remote write
Local TSDB is intentionally short-term (weeks). For longer retention or central aggregation, remote write POSTs Snappy-compressed protobuf batches to a configured URL:
Lossy under backpressure. If the remote endpoint cannot keep up, Prometheus's queue fills and eventually drops samples. Watch the prometheus_remote_storage_* metrics.
Not a replay protocol. If you enable it on day 30, the backend sees day 30 onward — not the previous 30 days from local TSDB.
write_relabel_configs can drop or rewrite labels on the way out — useful for stripping cardinality before it leaves your network.
Animation C — Remote write fanout
Remote write fanout to long-term storage
Local Prometheus batches samples into a remote_write queue; each batch can fan out to multiple backends (Thanos / Mimir / Cortex).
Thanos vs. Cortex vs. Mimir
Aspect
Thanos
Cortex
Grafana Mimir
Ingest model
Sidecar uploads blocks; optional Receive for remote write
Remote write only
Remote write only; simplified Cortex
Multi-tenancy
Label-based; basic
First-class tenant IDs
First-class with shuffle-sharding
HA dedup
Query-time
Ingest-time
Ingest-time, improved
Downsampling
Native (5m, 1h) via compactor
None at storage
None at storage
Object storage
S3/GCS/Azure/Swift
Blocks or chunks engine
Blocks only
Best fit (2024-25)
Bolt LTS onto existing Prom
Legacy deployments
New large-scale, multi-tenant
Rules of thumb:
Thanos — topology is "many independent Prometheus servers + give us LTS and a global view." Sidecar adds minimal moving parts; downsampling makes year-long dashboards fast.
Mimir — you are building a multi-tenant metrics service with per-tenant quotas, isolation, and SLOs.
Cortex — mostly if you already run it; for greenfield, Mimir is usually better.
VictoriaMetrics — simpler single-binary alternative for teams without dedicated SREs.
Figure 3.5: Long-term storage architectures — Thanos vs. Mimir/Cortex
flowchart TB
subgraph Thanos["Thanos (sidecar pattern)"]
direction LR
TP[Prometheus] --> TS[Thanos Sidecar]
TS -->|upload 2h blocks| TOS[(Object Store S3 / GCS)]
TQ[Thanos Querier] --> TS
TQ --> TSG[Store Gateway]
TSG --> TOS
TC[Compactor downsample 5m / 1h] --> TOS
end
subgraph Mimir["Mimir / Cortex (remote-write platform)"]
direction LR
MP[Prometheus / Agent] -->|remote_write| MD[Distributor]
MD --> MI[Ingester head + WAL]
MI -->|flush blocks| MOS[(Object Store)]
MQF[Query Frontend + cache] --> MQ[Querier]
MQ --> MI
MQ --> MSG[Store Gateway]
MSG --> MOS
end
Key Takeaways — Storage & Remote Write
Ingest order: WAL first, head second — WAL is the durability story.
Index = symbol table + postings lists → PromQL selectors resolve via list intersection.
Remote write is lossy under backpressure and is not a replay protocol.
Choose Thanos for "add LTS to existing Prom," Mimir for multi-tenant platforms, Cortex for legacy.
Post-Reading Check — Half 2 (Sections 3 & 4)
Same questions — revise after reading.
6. What uniquely identifies a Prometheus time series?
A) The metric name alone
B) The metric name plus its complete label set
C) The job label combined with the value
D) The series ID assigned by the WAL
7. By default, if a target's exposed labels conflict with target labels, which set wins?
A) The target's exposed labels — they are more specific
B) Whichever label was set most recently
C) Prometheus's target labels (e.g., job, instance)
D) Both are kept under prefixed names
8. What does Prometheus do before updating the in-memory head block when it ingests a sample?
A) Compress the chunk with XOR encoding
B) Append a CRC-checksummed record to the WAL
C) Flush the previous 2h block to disk
D) POST the sample to remote_write
9. If a remote-write endpoint becomes unavailable for an extended period, what happens?
A) Prometheus blocks scrapes until the backend recovers
B) The queue fills and Prometheus eventually drops samples
C) Prometheus replays from local TSDB once the endpoint returns
D) The samples are written to a local replay log forever
10. Which long-term-storage backend is the natural fit when you already operate many Prometheus servers and want object-store-backed retention with downsampling and minimal disruption?