Chapter 4 — PromQL: Querying Time-Series Data

Learning Objectives

Pre-Quiz — Half 1 (Sections 1-2)

1. You write the expression http_requests_total[5m] > 100 in the Prometheus expression browser and it errors. Why?

The comparison operator > rejects range vectors; it only works on instant vectors and scalars. The metric must be wrapped in rate() before any comparison. The [5m] range modifier is invalid syntax in the expression browser. Comparisons in PromQL require a scalar on the left side, not a vector.

2. Which expression correctly produces a smooth, dashboard-friendly per-second request rate aggregated by service?

irate(http_requests_total[5m]) sum by (service) (rate(http_requests_total[5m])) rate(sum by (service) (http_requests_total)[5m]) increase(http_requests_total[5m]) / 300 grouped by service

3. The canonical pattern for a p99 latency from classic histogram buckets is:

sum by (service) (histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))) histogram_quantile(0.99, sum by (service) (rate(http_request_duration_seconds_bucket[5m]))) histogram_quantile(0.99, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))) avg by (service) (histogram_quantile(0.99, http_request_duration_seconds_bucket))

4. You want a spike-detection alert that fires when the instantaneous 5xx rate exceeds 10%. Which function fits best?

rate with a 5-minute window — it is the smoothest signal available. irate with a 1-minute window — uses the last two samples for fast, spiky detection. increase with a 30-day window — the only function that handles counter resets. delta with a 1-minute window — designed for monotonic counters.

5. To compute the request rate as it appeared one week ago, you use:

rate(http_requests_total[5m] offset 1w) rate(http_requests_total[5m]) offset 1w at the outer level only. rate(http_requests_total[5m] @ 1w) — the @ modifier accepts durations. rate(http_requests_total[5m] - 7d) using duration subtraction.

1. PromQL Fundamentals

PromQL is a spreadsheet-formula language for time. Where a spreadsheet operates on static rows and columns, PromQL operates on labeled streams of timestamps and floats. Every expression returns a vector, and every vector has a shape — number of series, set of labels, and a temporal extent. Master the shape, and the language follows.

1.1 Instant vectors, range vectors, scalars

PromQL has four expression types; three dominate everyday work:

TypeWhat it isExampleUse it for
Instant vectorSeries with one sample at eval timehttp_requests_totalDashboard panels, alert conditions, most operators
Range vectorSeries with a range of samples going back in timehttp_requests_total[5m]Input to rate, increase, *_over_time
ScalarA single number, no labels0.99, time()Thresholds, quantile arguments, constants

The crucial rule: most "PromQL math" operators require an instant vector. Comparison (>), binary arithmetic (+), and aggregation all reject range vectors. Range vectors exist almost exclusively to feed time-windowed functions like rate(), avg_over_time(), and increase().

1.2 Selectors and label matchers

Every PromQL query begins with a selector: a metric name plus optional matchers in {...}. Four matcher operators exist: =, !=, =~ (regex), !~ (negative regex). Regexes are anchored on both ends: code=~"5.." matches 500 and 503 but not 5000.

# Filter exact + regex
http_requests_total{job="api", code=~"5.."}

# Negative regex
http_requests_total{code!~"2..|3.."}

# Metric name as label
{__name__=~"http_.*", job="api"}

1.3 Offset and @ modifiers

The offset modifier shifts a query back by a relative duration; the @ modifier pins it to an absolute Unix timestamp (Prometheus 2.25+). Combined, they enable week-over-week comparisons and reproducible post-incident analysis.

# Week-over-week ratio
rate(http_requests_total[5m])
  /
rate(http_requests_total[5m] offset 1w)

Subqueries ([1h:1m]) build a range vector out of an instant-vector expression evaluated at a step you specify. Useful, but expensive — prefer recording rules for anything you run repeatedly.

1.4 Data-shape transitions

graph TD RAW["Raw samples in TSDB
(timestamp, value, labels)"] SEL["Selector
http_requests_total{job="api"}"] IV["Instant Vector
one sample per series at eval time"] RV["Range Vector
append [5m] - many samples per series"] AGG["Aggregated Instant Vector
sum/avg/topk by labels"] S["Scalar
single number, no labels"] RAW --> SEL SEL --> IV IV -->|"append [duration]"| RV RV -->|"rate, increase, *_over_time"| IV IV -->|"sum by, avg by, topk"| AGG AGG -->|"scalar()"| S S -->|"comparison, arithmetic"| IV

Key Takeaway

2. Functions and Operators

2.1 rate, irate, and increase on counters

Counters are monotonically increasing numbers (http_requests_total, node_network_transmit_bytes_total). Raw counter values are almost never useful — you care about growth rate. Three functions answer that question with subtly different semantics:

FunctionOutputBest forAvoid for
rate(v[w])Per-second average over w via linear regression with extrapolationDashboards, most alertsBrief spike detection
irate(v[w])Instantaneous rate from the last two samplesSpike detection, short alertsLong-term graphs (too noisy)
increase(v[w])Total increments across w = rate(v[w]) * w_secondsSLO budgets, "how many"Real-time rates; expecting integers

All three operate only on counters and automatically handle counter resets — when a value drops (e.g., pod restart from 12345 back to 0), the negative jump is ignored and only post-reset increments count. None of them work correctly on gauges.

Animation A — rate() over a 5-minute sliding window
t-5m t-4m t-3m t-2m t-1m now 0 N counter [5m] window = 300 seconds delta = 155 units / 300 s rate() = (last - first) / window_seconds ≈ 0.52 ops/s
Counter samples arrive over 5 minutes. rate() draws a regression through them and divides the total delta by 300 s to get a per-second average. Long windows smooth jitter; short windows expose spikes.

A frequent stumble is dividing before aggregating. The ratio of two rates is not the rate of two ratios — aggregate numerator and denominator separately, then divide.

# WRONG - per-instance ratios summed nonsensically
sum by (service) (
  rate(http_requests_total{code=~"5.."}[5m])
    /
  rate(http_requests_total[5m])
)

# RIGHT - aggregate first, then divide
sum by (service) (rate(http_requests_total{code=~"5.."}[5m]))
  /
sum by (service) (rate(http_requests_total[5m]))

Window-size rule of thumb: 3-5x the scrape interval. With a 15 s scrape, [1m] is the minimum meaningful window for rate; [5m] is the standard dashboard window.

flowchart LR C["Counter samples in [5m] window
t-5m ... t-4m ... t-3m ... t-2m ... t-1m ... t"] C --> R["rate(v[5m])
linear regression across
all samples in window"] C --> I["irate(v[5m])
uses only the LAST
two samples"] C --> N["increase(v[5m])
= rate(v[5m]) * 300s
total increments in window"] R --> RO["Smooth per-second rate
good for dashboards and alerts"] I --> IO["Spiky per-second rate
good for short spike detection"] N --> NO["Total event count over window
good for SLO budgets"]

2.2 histogram_quantile and bucket math

Latency, response sizes, and queue depths use histograms. A classic histogram exposes three families per metric: *_bucket{le="..."} cumulative counters at each upper bound, *_sum total of observations, and *_count.

The histogram_quantile() function reconstructs an approximate distribution from buckets and linearly interpolates within the bucket containing the target rank, assuming a uniform distribution inside each bucket.

Four rules must hold for a meaningful answer:

  1. Apply rate() to the buckets first. Buckets are counters; you want a rate of observations, not raw cumulative count.
  2. Preserve le in aggregation. Dropping le destroys the histogram structure.
  3. Aggregate before taking the quantile. Quantiles are not linear — you cannot average p99s across instances.
  4. All instances must share bucket boundaries. Mixed layouts produce meaningless interpolation.
# RIGHT - aggregate buckets preserving le, then one quantile
histogram_quantile(
  0.99,
  sum by (le, service) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

Bucket-design tip: cluster boundaries tightly around your SLO threshold. If your SLO is 300 ms p99, choose buckets like 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 1.0 rather than the defaults. If p99 falls in the +Inf bucket, histogram_quantile() returns +Inf — fix it with better buckets, not by changing the query.

2.3 Aggregation operators

PromQL has a fixed set: sum, avg, min, max, count, count_values, stddev, stdvar, topk, bottomk, quantile, group. Each is modified by by (...) (keep these labels) or without (...) (drop these). In high-cardinality environments, without is often safer — it doesn't hide labels you forgot to mention.

Key Takeaway

Post-Quiz — Half 1 (Sections 1-2)

1. You write the expression http_requests_total[5m] > 100 in the Prometheus expression browser and it errors. Why?

The comparison operator > rejects range vectors; it only works on instant vectors and scalars. The metric must be wrapped in rate() before any comparison. The [5m] range modifier is invalid syntax in the expression browser. Comparisons in PromQL require a scalar on the left side, not a vector.

2. Which expression correctly produces a smooth, dashboard-friendly per-second request rate aggregated by service?

irate(http_requests_total[5m]) sum by (service) (rate(http_requests_total[5m])) rate(sum by (service) (http_requests_total)[5m]) increase(http_requests_total[5m]) / 300 grouped by service

3. The canonical pattern for a p99 latency from classic histogram buckets is:

sum by (service) (histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))) histogram_quantile(0.99, sum by (service) (rate(http_request_duration_seconds_bucket[5m]))) histogram_quantile(0.99, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))) avg by (service) (histogram_quantile(0.99, http_request_duration_seconds_bucket))

4. You want a spike-detection alert that fires when the instantaneous 5xx rate exceeds 10%. Which function fits best?

rate with a 5-minute window — it is the smoothest signal available. irate with a 1-minute window — uses the last two samples for fast, spiky detection. increase with a 30-day window — the only function that handles counter resets. delta with a 1-minute window — designed for monotonic counters.

5. To compute the request rate as it appeared one week ago, you use:

rate(http_requests_total[5m] offset 1w) rate(http_requests_total[5m]) offset 1w at the outer level only. rate(http_requests_total[5m] @ 1w) — the @ modifier accepts durations. rate(http_requests_total[5m] - 7d) using duration subtraction.
Pre-Quiz — Half 2 (Sections 3-4)

1. The Prometheus level:metric:operation naming convention is intended primarily to:

Compress series storage by hashing the rule name. Identify the aggregation scope (e.g., job, service, cluster), the underlying metric, and the transformation applied. Allow Alertmanager to auto-route by the leading level token. Enable the Prometheus UI to colour-code recording rules differently from raw metrics.

2. An alert is flapping every 60 seconds. The team adds for: 30m to silence it. Why is this an anti-pattern?

A for clause longer than 10 minutes is rejected by Alertmanager. The for clause delays alerts; it does not improve query accuracy. The root cause is a noisy query (too-short rate window, too little aggregation) that should be fixed instead. Long for values disable the alert's pending state in the Prometheus UI. Recording rules cannot feed alerts that have for values above 5 minutes.

3. You have a counter that resets at midnight (process restart). Which expression handles the reset correctly without distortion?

rate(sum by (job)(http_requests_total)[5m:]) — aggregate first, then rate. sum by (job)(rate(http_requests_total[5m])) — rate per series first, then aggregate. increase(sum by (job)(http_requests_total)[5m])increase ignores resets after aggregation. delta(http_requests_total[5m])delta automatically accounts for restarts on counters.

4. A scrape target stops being targeted by Prometheus. After 5 minutes, an alert up == 0 stops firing. Why?

The up series itself becomes stale and returns no result after the lookback delta — the alert silently disables itself. Use absent(up{...}) to detect missing series. Alertmanager auto-resolves any alert that has not been refreshed in 5 minutes. The up metric flips to NaN, which is treated as truthy by PromQL. Prometheus assumes the target is healthy when no samples have been received.

5. Which label is the textbook example of a cardinality explosion waiting to happen?

method with values GET, POST, PUT, DELETE. status_code with ~40 possible HTTP codes. service with tens of values. user_id or trace_id — effectively unbounded values that produce one series per user/request.

3. Recording and Alerting Rules

PromQL queries can be slow. A histogram_quantile() over a 30-day window touching millions of series may take seconds — far too slow for a 10 s dashboard refresh or a 30 s alert evaluation. The fix is rules: PromQL expressions evaluated on a fixed schedule, storing results as either new metrics (recording rules) or alert states (alerting rules).

3.1 Recording rules

groups:
  - name: http_slos
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))

      - record: job:http_error_ratio:5m
        expr: |
          job:http_errors:rate5m
            /
          job:http_requests:rate5m

Naming convention: level:metric:operation. The level identifies aggregation scope (job, namespace, cluster), the metric identifies what is measured, and the operation describes the transformation (rate5m, histogram_quantile99).

Rule-chain guidelines:

3.2 Alerting rules and the for clause

- alert: HighRequestErrorRate
  expr: job:http_error_ratio:5m > 0.05
  for: 5m
  labels:
    severity: critical
    team: payments
  annotations:
    summary: "High error rate on {{ $labels.job }}"
    runbook_url: "https://runbooks.example.com/HighRequestErrorRate"

The for clause requires the condition to be continuously true for the specified duration before firing. for: 5m turns a 60 s blip into a non-event; a sustained 5-minute issue still pages.

Alert categoryTypical forReasoning
User-visible symptom (errors, latency)1-5mFast page; users already see it
Resource saturation (CPU, memory, disk)5-15mAvoid paging on transient spikes
SLO burn rate (fast window)2-5mCatches rapid budget burn
SLO burn rate (slow window)30-60mLong, sustained budget drift
Capacity / filling-up trends1h+Days-ahead warnings, not pages

Anti-pattern: do not use for to mask noisy query design. If a query flaps because the rate window is too short, fix the query — for only delays alerts, it does not improve accuracy.

3.3 Best practices for rule organisation

sequenceDiagram participant T as "Target /metrics" participant P as "Prometheus scraper" participant R1 as "Recording rule (base)
job:http_requests:rate5m" participant R2 as "Recording rule (ratio)
job:http_error_ratio:5m" participant A as "Alerting rule
HighRequestErrorRate" participant AM as "Alertmanager" T->>P: "scrape every 15s" P->>P: "write samples to TSDB" Note over R1: "every 30s eval interval" P->>R1: "read raw counters" R1->>R1: "sum by (job) (rate(...))" R1->>P: "write new series" Note over R2: "every 30s eval interval" R1->>R2: "read rate5m series" R2->>R2: "errors / requests" R2->>P: "write ratio series" Note over A: "every 30s eval interval" R2->>A: "read ratio series" A->>A: "expr > 0.05 sustained for 5m" A->>AM: "fire alert with labels + annotations"

Key Takeaway

4. Common Pitfalls

4.1 Counter resets across restarts

Counters should never decrease, but processes restart. When a counter drops from 12345 to 0, every rate-family function detects the drop and treats it as a reset, counting only post-reset increase. This works automatically — one of PromQL's most pleasant surprises — but edge cases bite:

# WRONG - sum then rate. A reset on one instance corrupts the sum.
rate(sum by (job)(http_requests_total)[5m:])

# RIGHT - rate per series first, then aggregate
sum by (job)(rate(http_requests_total[5m]))
Animation B — Counter reset detection
t0 now 0 N RESTART — counter resets to 0 Naive delta = last - first = 60 - 0 = 60 (misses the reset!) rate() detects drop, extrapolates = (110 pre) + (60 post) over window rate(counter[w]) ignores negative deltas - resets are not subtracted.
Counter rises to ~110, restarts to 0, then rises again. PromQL's rate-family functions detect the negative jump and treat it as a reset, summing pre-reset and post-reset increments rather than reporting a (negative) total.

4.2 Staleness markers and missing samples

Prometheus 2.0+ writes an explicit staleness marker when a target disappears or a series stops being reported. Five minutes after the last sample (the default lookback delta), instant queries return no result for that series — not the last known value. Consequences:

# Detect when a critical metric is missing
absent(up{job="payments-api"})

# Default to zero when no data
sum by (service)(rate(payment_failures_total[5m])) or vector(0)

The lookback-delta setting (default 5m) controls how far back PromQL searches for the most recent sample. Do not change it without strong reason — it cascades to every query in the environment.

4.3 Cardinality explosions

Cardinality — the number of unique time series — is the single largest scaling constraint in Prometheus. A metric with 10 labels each having 100 possible values can in principle produce 1020 series. A few hundred thousand series per Prometheus is healthy; millions is painful; tens of millions usually crashes.

LabelCardinalitySafe in metrics?
method~10Yes
status_code~40Yes
servicetensYes
podhundreds-thousands, churnsRisky
request_path (raw URL)unboundedNo
user_id / emailmillionsNo
trace_idevery requestAbsolutely not

Normalise at the source: bucket paths into templates (/users/:id/posts/:id), drop high-cardinality labels before exposing them, or move that information into logs and traces (where cardinality is cheap) instead of metrics.

Animation C — Cardinality explosion
1 metric no labels 1 series + method ~10 values 10 series + status_code ~40 values 400 series + pod (churns) ~500 values 200,000 series + user_id (unbounded, millions) > 200,000,000 series — Prometheus OOM
Each new label multiplies series count by its cardinality. Bounded labels (method, status_code) are safe; unbounded labels (user_id, trace_id, raw paths) explode the series count and crash Prometheus.

Diagnosis tools:

# Top 20 metrics by series count
topk(20, count by (__name__)({__name__=~".+"}))

# Distinct values for a specific label
count(count by (label_name)(metric_name))

# Total active series - alert when this grows fast
prometheus_tsdb_head_series
graph LR M["http_requests_total
1 metric, no labels
1 series"] M --> L1["+ method
(~10 values)
10 series"] L1 --> L2["+ status_code
(~40 values)
400 series"] L2 --> L3["+ pod
(~500 churning values)
200,000 series"] L3 --> L4["+ request_path raw URL
(unbounded, 50k+)
10,000,000+ series
Prometheus OOM"] L4 --> FIX["Fix: normalize path to template
route="/users/:id/posts/:id"
or drop label entirely"]

Key Takeaway

Post-Quiz — Half 2 (Sections 3-4)

1. The Prometheus level:metric:operation naming convention is intended primarily to:

Compress series storage by hashing the rule name. Identify the aggregation scope (e.g., job, service, cluster), the underlying metric, and the transformation applied. Allow Alertmanager to auto-route by the leading level token. Enable the Prometheus UI to colour-code recording rules differently from raw metrics.

2. An alert is flapping every 60 seconds. The team adds for: 30m to silence it. Why is this an anti-pattern?

A for clause longer than 10 minutes is rejected by Alertmanager. The for clause delays alerts; it does not improve query accuracy. The root cause is a noisy query (too-short rate window, too little aggregation) that should be fixed instead. Long for values disable the alert's pending state in the Prometheus UI. Recording rules cannot feed alerts that have for values above 5 minutes.

3. You have a counter that resets at midnight (process restart). Which expression handles the reset correctly without distortion?

rate(sum by (job)(http_requests_total)[5m:]) — aggregate first, then rate. sum by (job)(rate(http_requests_total[5m])) — rate per series first, then aggregate. increase(sum by (job)(http_requests_total)[5m])increase ignores resets after aggregation. delta(http_requests_total[5m])delta automatically accounts for restarts on counters.

4. A scrape target stops being targeted by Prometheus. After 5 minutes, an alert up == 0 stops firing. Why?

The up series itself becomes stale and returns no result after the lookback delta — the alert silently disables itself. Use absent(up{...}) to detect missing series. Alertmanager auto-resolves any alert that has not been refreshed in 5 minutes. The up metric flips to NaN, which is treated as truthy by PromQL. Prometheus assumes the target is healthy when no samples have been received.

5. Which label is the textbook example of a cardinality explosion waiting to happen?

method with values GET, POST, PUT, DELETE. status_code with ~40 possible HTTP codes. service with tens of values. user_id or trace_id — effectively unbounded values that produce one series per user/request.

Your Progress

Answer Explanations