Chapter 3 — Prometheus Architecture and Data Model

Learning Objectives

Pre-Reading Check — Half 1 (Sections 1 & 2)

Answer first using your existing knowledge. You will get a chance to revise after reading.

1. Which subsystem inside the Prometheus binary actually opens HTTP connections to targets?

A) The rule engine
B) The retrieval / scrape loop
C) Alertmanager (bundled with Prometheus)
D) The remote write client

2. When a scrape fails (timeout or 5xx), what does Prometheus write to the TSDB?

A) Nothing — the gap is simply absent
B) The previous sample value, repeated
C) A synthetic up series with value 0
D) An entry in a dead-letter queue for retry

3. Why does Prometheus prefer pull over push for steady-state services?

A) Pull is faster on the wire than push
B) Pull gives implicit liveness, server-controlled load, and a debuggable wire protocol
C) Pull avoids the need for service discovery
D) Pull is the only model that supports labels

4. Which of these is a legitimate use of the Pushgateway?

A) Buffering high-cardinality SLO latency metrics from a long-running web service
B) Reporting last-run duration and status for a nightly batch job
C) Replacing remote write for shipping all metrics to long-term storage
D) Storing per-pod metrics for ephemeral Kubernetes pods

5. In a federation hierarchy, what should you typically pull through /federate?

A) Every raw series from every leaf Prometheus
B) Aggregated series produced by recording rules
C) The WAL files of upstream Prometheus servers
D) Alertmanager state for cross-region dedup

Section 1: Server Components

A single prometheus binary is best understood as a small distributed system that happens to run inside one process. Four cooperating subsystems share the binary:

Figure 3.1: Prometheus server components and external integrations

flowchart LR SD[Service Discovery
K8s, Consul, EC2, DNS] --> R[Retrieval
Scrape Loop] R -->|HTTP GET /metrics| Targets[(Scrape Targets)] R --> TSDB[(TSDB
head + WAL + blocks)] TSDB --> API[HTTP API + Web UI] API --> Grafana[Grafana / Clients] TSDB --> RE[Rule Engine
recording + alerting] RE --> AM[Alertmanager
external] TSDB --> RW[Remote Write] RW --> LTS[(Long-term Storage
Thanos / Mimir / Cortex)]

The scrape loop state machine

For every target, Prometheus runs five steps on a timer: resolve address → HTTP GET /metrics → parse OpenMetrics → relabel → append to the head block. Whether or not the scrape succeeds, the loop also writes built-in meta-series: up, scrape_duration_seconds, and scrape_samples_scraped. These are the series you alert on when you want to alert on your monitoring itself.

Analogy: the scrape loop is a postal carrier walking the same route every 15 seconds. They do not wait for a letter; they pick up whatever is in the mailbox at that moment. If the house has been demolished, the carrier files a report (up = 0) and moves on. Service discovery is the address book that tells the carrier which houses exist this morning.

HTTP API endpoints worth memorizing

EndpointPurpose
/api/v1/queryInstant PromQL query
/api/v1/query_rangeRange PromQL query (what Grafana uses)
/api/v1/targetsActive and dropped scrape targets
/api/v1/rules, /api/v1/alertsConfigured rules and active alerts
/-/reload, /-/healthy, /-/readyLifecycle endpoints
/metricsPrometheus's own metrics (it scrapes itself)

Key Takeaways — Server Components

Section 2: The Pull Model

In pull, targets expose /metrics and Prometheus calls it. In push, targets stream samples to a collector. Pull buys Prometheus a surprisingly large bundle of properties that are hard to replicate in push systems:

Pull is awkward when targets live behind NAT, are very short-lived, or are conceptually event streams rather than stateful endpoints. Pull is therefore better for the steady state of long-lived services on trusted networks — which describes most of a Kubernetes cluster.

Animation A — The 15-second scrape pull cycle

Scrape pull cycle

Every ~15s the server fires HTTP GETs at all targets; OpenMetrics responses become samples in the TSDB head.
Prometheus scrape loop node_exporter 10.0.4.17:9100/metrics api-server 10.0.4.22:8080/metrics kube-state-metrics 10.0.4.31:8080/metrics GET GET GET 200 OK 200 OK 200 OK TSDB head up{job="node"} 1 http_requests_total{...} 1873 scrape_duration 0.04

Figure 3.2: Scrape loop sequence (pull model with implicit liveness)

sequenceDiagram participant SD as Service Discovery participant P as Prometheus
Scrape Loop participant T as Target /metrics participant DB as TSDB Head SD->>P: target list (IPs, labels) loop every 15s P->>T: HTTP GET /metrics alt target healthy T-->>P: 200 OK + OpenMetrics body P->>P: parse + relabel samples P->>DB: append samples + "up=1" else target unreachable T--xP: timeout / 5xx / parse error P->>DB: append "up=0" (stale marker) end end

Pushgateway — the narrow escape hatch

For short-lived batch jobs, the Pushgateway is a tiny HTTP server that accepts pushes, holds the current value in memory, and exposes them on its own /metrics — which Prometheus scrapes like any other target. The right shape: a nightly job pushes nightly_import_last_run_status, _duration_seconds, and _timestamp_seconds right before exit.

Anti-patterns to avoid:

For genuinely push-shaped workloads, the modern answer is usually the OpenTelemetry Collector (covered in chapter 5).

Federation

Federation lets one Prometheus pull a subset of series from another via /federate. The healthy hierarchy is leaf → aggregator → global, and you should pull only aggregated series (recording-rule outputs) through it. For raw data shipping, remote write is almost always the better tool.

Key Takeaways — Pull Model

Post-Reading Check — Half 1 (Sections 1 & 2)

Same questions — revise your answers based on what you just read.

1. Which subsystem inside the Prometheus binary actually opens HTTP connections to targets?

A) The rule engine
B) The retrieval / scrape loop
C) Alertmanager (bundled with Prometheus)
D) The remote write client

2. When a scrape fails (timeout or 5xx), what does Prometheus write to the TSDB?

A) Nothing — the gap is simply absent
B) The previous sample value, repeated
C) A synthetic up series with value 0
D) An entry in a dead-letter queue for retry

3. Why does Prometheus prefer pull over push for steady-state services?

A) Pull is faster on the wire than push
B) Pull gives implicit liveness, server-controlled load, and a debuggable wire protocol
C) Pull avoids the need for service discovery
D) Pull is the only model that supports labels

4. Which of these is a legitimate use of the Pushgateway?

A) Buffering high-cardinality SLO latency metrics from a long-running web service
B) Reporting last-run duration and status for a nightly batch job
C) Replacing remote write for shipping all metrics to long-term storage
D) Storing per-pod metrics for ephemeral Kubernetes pods

5. In a federation hierarchy, what should you typically pull through /federate?

A) Every raw series from every leaf Prometheus
B) Aggregated series produced by recording rules
C) The WAL files of upstream Prometheus servers
D) Alertmanager state for cross-region dedup
Pre-Reading Check — Half 2 (Sections 3 & 4)

Answer first using what you already know.

6. What uniquely identifies a Prometheus time series?

A) The metric name alone
B) The metric name plus its complete label set
C) The job label combined with the value
D) The series ID assigned by the WAL

7. By default, if a target's exposed labels conflict with target labels, which set wins?

A) The target's exposed labels — they are more specific
B) Whichever label was set most recently
C) Prometheus's target labels (e.g., job, instance)
D) Both are kept under prefixed names

8. What does Prometheus do before updating the in-memory head block when it ingests a sample?

A) Compress the chunk with XOR encoding
B) Append a CRC-checksummed record to the WAL
C) Flush the previous 2h block to disk
D) POST the sample to remote_write

9. If a remote-write endpoint becomes unavailable for an extended period, what happens?

A) Prometheus blocks scrapes until the backend recovers
B) The queue fills and Prometheus eventually drops samples
C) Prometheus replays from local TSDB once the endpoint returns
D) The samples are written to a local replay log forever

10. Which long-term-storage backend is the natural fit when you already operate many Prometheus servers and want object-store-backed retention with downsampling and minimal disruption?

A) Thanos (sidecar pattern)
B) Grafana Mimir
C) Cortex
D) Pushgateway

Section 3: Data Model and Exposition Format

A time series is uniquely identified by its metric name plus its label set. Everything else — help text, type, unit — is metadata. A single sample looks like:

http_requests_total{job="api", method="GET", status="200", path="/users"}  1873  1717520400000

That decomposes into:

This is the multi-dimensional idea: one metric name, many label dimensions, then slice at query time:

sum by (status) (rate(http_requests_total{job="api"}[5m]))

Analogy: metric name = spreadsheet, labels = columns, label values = cell contents, each unique row = a time series. Adding a high-cardinality column (user_id) is like adding a column with one row per user — cost is linear in distinct values.

OpenMetrics text exposition format

# HELP http_requests_total Total HTTP requests received.
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1873
http_requests_total{method="GET",status="500"} 4
http_requests_total{method="POST",status="200"} 219

# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 4.2848256e+07

# HELP http_request_duration_seconds HTTP request latency.
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 1500
http_request_duration_seconds_bucket{le="0.5"} 1860
http_request_duration_seconds_bucket{le="1.0"} 1872
http_request_duration_seconds_bucket{le="+Inf"} 1873
http_request_duration_seconds_sum 92.7
http_request_duration_seconds_count 1873
# EOF

Rules:

Honor labels, target labels, and relabeling

By default, when a target's exposed labels conflict with the labels Prometheus attaches (most importantly job and instance), the target labels win. Setting honor_labels: true flips that — useful for the Pushgateway and federation.

Two flavors of relabeling:

scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Keep only pods that opt in.
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"
      # Set the job label from the pod's annotation.
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_job]
        target_label: job
    metric_relabel_configs:
      # Drop a known cardinality-bomb metric.
      - source_labels: [__name__]
        regex: "go_gc_pauses_seconds_bucket"
        action: drop

Figure 3.3: Relabeling pipeline (target metadata to stored series)

flowchart LR SD[Service Discovery] -->|"__meta_kubernetes_*
__meta_consul_*"| RC[relabel_configs
keep / drop / rewrite] RC --> TL[Final target labels
"job, instance, namespace, pod"] TL --> SCR[Scrape /metrics] SCR --> RAW[Raw samples
"name + labels + value"] RAW --> MRC[metric_relabel_configs
drop high-cardinality] MRC --> TSDB[(TSDB head block)]

Two rules of thumb: (1) test relabeling against a known target with --log.level=debug; (2) drop unbounded labels (user IDs, request IDs, raw URL paths) before they reach the TSDB.

Key Takeaways — Data Model

Section 4: Storage, Retention, and Remote Write

Every sample follows a fixed path from scrape to retention:

  1. WAL append. The sample (series ID + timestamp + value) is encoded as a CRC-checksummed record and written to the current WAL segment (~128 MB files: 00000001, 00000002, ...).
  2. Head update. The TSDB looks up (or creates) the in-memory series for the label set and appends the sample to its current chunk.

The WAL is the durability story: if Prometheus crashes after step 1, the sample is replayed on restart from the most recent checkpoint.

Block layout

Every ~2 hours the head is cut into an immutable on-disk block:

01J1H1Q2K3V3Y2.../
  meta.json       # ULID, minTime, maxTime, compaction level, sources, stats
  index           # symbol table + postings lists (label=value -> series IDs)
  chunks/
    000001        # XOR-compressed chunks (mmapped at query time)
    000002
  tombstones      # optional, deletion intervals

The index is the crown jewel: a symbol table interns every label name and value, every series gets an integer ID, and postings lists hold the sorted series IDs for each label=value. A PromQL selector like {__name__="http_requests_total", job="api"} resolves by intersecting two postings lists — milliseconds, no table scan.

Chunks themselves use delta-of-delta encoding for timestamps and Gorilla-style XOR for floats — typically ~1-2 bytes per sample. Chunks are memory-mapped, which is why Prometheus often shows modest RSS but a large OS page cache.

Animation B — Block compaction over time

TSDB head → 2h blocks → compacted blocks

Head (in memory) accumulates ~2h of samples, then cuts to an immutable 2h block; compactor merges adjacent blocks into 6h, then 24h, then multi-day.
now older → Head (in-memory) 2h blocks 6h compacted 24h compacted head 2h 2h 2h 2h 6h 6h 24h+ compacted

Figure 3.4: TSDB block lifecycle from scrape to retention

stateDiagram-v2 [*] --> Scrape: sample produced Scrape --> WAL: append CRC record WAL --> Head: update in-memory head chunk Head --> Head: accumulate ~2h of samples Head --> Block2h: cut immutable block
(meta.json + index + chunks) Block2h --> Block8h: compact adjacent blocks Block8h --> Block24h: compact further Block24h --> BlockMulti: compact multi-day BlockMulti --> Deleted: retention horizon passed Deleted --> [*] Head --> Recovery: crash Recovery --> Head: replay WAL from checkpoint.N

Crash recovery worked example

Suppose Prometheus has wal/checkpoint.10/ plus segments 00000011, 00000012, 00000013, and crashes mid-write on segment 13. On startup:

  1. Existing immutable blocks load normally — no WAL needed.
  2. The head is rebuilt from checkpoint.10.
  3. Segments 11, 12, and 13 are replayed; recovery stops at the first record with a failing CRC.
  4. Any incomplete writes are lost, but the head is consistent and queries work immediately.

Slow startup almost always means either checkpoints are not happening (frequent restarts, OOM kills) or the WAL grew during a long outage. The fix: keep Prometheus healthy long enough to checkpoint.

Remote write

Local TSDB is intentionally short-term (weeks). For longer retention or central aggregation, remote write POSTs Snappy-compressed protobuf batches to a configured URL:

remote_write:
  - url: https://mimir.example.com/api/v1/push
    basic_auth:
      username: tenant-42
      password_file: /etc/prometheus/mimir-token
    queue_config:
      capacity: 10000
      max_shards: 30
      min_backoff: 30ms
      max_backoff: 5s
    write_relabel_configs:
      - source_labels: [__name__]
        regex: "go_.*"
        action: drop   # don't ship Go runtime metrics

Critical properties:

Animation C — Remote write fanout

Remote write fanout to long-term storage

Local Prometheus batches samples into a remote_write queue; each batch can fan out to multiple backends (Thanos / Mimir / Cortex).
Prometheus local TSDB remote_write queue + shards Thanos Receive S3 blocks Grafana Mimir distributor → ingester Cortex multi-tenant s s s s s s s

Thanos vs. Cortex vs. Mimir

AspectThanosCortexGrafana Mimir
Ingest modelSidecar uploads blocks; optional Receive for remote writeRemote write onlyRemote write only; simplified Cortex
Multi-tenancyLabel-based; basicFirst-class tenant IDsFirst-class with shuffle-sharding
HA dedupQuery-timeIngest-timeIngest-time, improved
DownsamplingNative (5m, 1h) via compactorNone at storageNone at storage
Object storageS3/GCS/Azure/SwiftBlocks or chunks engineBlocks only
Best fit (2024-25)Bolt LTS onto existing PromLegacy deploymentsNew large-scale, multi-tenant

Rules of thumb:

Figure 3.5: Long-term storage architectures — Thanos vs. Mimir/Cortex

flowchart TB subgraph Thanos["Thanos (sidecar pattern)"] direction LR TP[Prometheus] --> TS[Thanos Sidecar] TS -->|upload 2h blocks| TOS[(Object Store
S3 / GCS)] TQ[Thanos Querier] --> TS TQ --> TSG[Store Gateway] TSG --> TOS TC[Compactor
downsample 5m / 1h] --> TOS end subgraph Mimir["Mimir / Cortex (remote-write platform)"] direction LR MP[Prometheus / Agent] -->|remote_write| MD[Distributor] MD --> MI[Ingester
head + WAL] MI -->|flush blocks| MOS[(Object Store)] MQF[Query Frontend
+ cache] --> MQ[Querier] MQ --> MI MQ --> MSG[Store Gateway] MSG --> MOS end

Key Takeaways — Storage & Remote Write

Post-Reading Check — Half 2 (Sections 3 & 4)

Same questions — revise after reading.

6. What uniquely identifies a Prometheus time series?

A) The metric name alone
B) The metric name plus its complete label set
C) The job label combined with the value
D) The series ID assigned by the WAL

7. By default, if a target's exposed labels conflict with target labels, which set wins?

A) The target's exposed labels — they are more specific
B) Whichever label was set most recently
C) Prometheus's target labels (e.g., job, instance)
D) Both are kept under prefixed names

8. What does Prometheus do before updating the in-memory head block when it ingests a sample?

A) Compress the chunk with XOR encoding
B) Append a CRC-checksummed record to the WAL
C) Flush the previous 2h block to disk
D) POST the sample to remote_write

9. If a remote-write endpoint becomes unavailable for an extended period, what happens?

A) Prometheus blocks scrapes until the backend recovers
B) The queue fills and Prometheus eventually drops samples
C) Prometheus replays from local TSDB once the endpoint returns
D) The samples are written to a local replay log forever

10. Which long-term-storage backend is the natural fit when you already operate many Prometheus servers and want object-store-backed retention with downsampling and minimal disruption?

A) Thanos (sidecar pattern)
B) Grafana Mimir
C) Cortex
D) Pushgateway

Your Progress

Answer Explanations