Study Guide: Chapter 4 — PromQL: Querying Time-Series Data

Pre-Quiz — Half 1 (Sections 1-2)

1. You write the expression http_requests_total[5m] > 100 in the Prometheus expression browser and it errors. Why?

The comparison operator > rejects range vectors; it only works on instant vectors and scalars. The metric must be wrapped in rate() before any comparison. The [5m] range modifier is invalid syntax in the expression browser. Comparisons in PromQL require a scalar on the left side, not a vector.

2. Which expression correctly produces a smooth, dashboard-friendly per-second request rate aggregated by service?

irate(http_requests_total[5m]) sum by (service) (rate(http_requests_total[5m])) rate(sum by (service) (http_requests_total)[5m]) increase(http_requests_total[5m]) / 300 grouped by service

3. The canonical pattern for a p99 latency from classic histogram buckets is:

sum by (service) (histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))) histogram_quantile(0.99, sum by (service) (rate(http_request_duration_seconds_bucket[5m]))) histogram_quantile(0.99, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))) avg by (service) (histogram_quantile(0.99, http_request_duration_seconds_bucket))

4. You want a spike-detection alert that fires when the instantaneous 5xx rate exceeds 10%. Which function fits best?

rate with a 5-minute window — it is the smoothest signal available. irate with a 1-minute window — uses the last two samples for fast, spiky detection. increase with a 30-day window — the only function that handles counter resets. delta with a 1-minute window — designed for monotonic counters.

5. To compute the request rate as it appeared one week ago, you use:

rate(http_requests_total[5m] offset 1w) rate(http_requests_total[5m]) offset 1w at the outer level only. rate(http_requests_total[5m] @ 1w) — the @ modifier accepts durations. rate(http_requests_total[5m] - 7d) using duration subtraction.

1. PromQL Fundamentals

PromQL is a spreadsheet-formula language for time. Where a spreadsheet operates on static rows and columns, PromQL operates on labeled streams of timestamps and floats. Every expression returns a vector, and every vector has a shape — number of series, set of labels, and a temporal extent. Master the shape, and the language follows.

1.1 Instant vectors, range vectors, scalars

PromQL has four expression types; three dominate everyday work:

Type	What it is	Example	Use it for
Instant vector	Series with one sample at eval time	`http_requests_total`	Dashboard panels, alert conditions, most operators
Range vector	Series with a range of samples going back in time	`http_requests_total[5m]`	Input to `rate`, `increase`, `*_over_time`
Scalar	A single number, no labels	`0.99`, `time()`	Thresholds, quantile arguments, constants

The crucial rule: most "PromQL math" operators require an instant vector. Comparison (>), binary arithmetic (+), and aggregation all reject range vectors. Range vectors exist almost exclusively to feed time-windowed functions like rate(), avg_over_time(), and increase().

1.2 Selectors and label matchers

Every PromQL query begins with a selector: a metric name plus optional matchers in {...}. Four matcher operators exist: =, !=, =~ (regex), !~ (negative regex). Regexes are anchored on both ends: code=~"5.." matches 500 and 503 but not 5000.

# Filter exact + regex
http_requests_total{job="api", code=~"5.."}

# Negative regex
http_requests_total{code!~"2..|3.."}

# Metric name as label
{__name__=~"http_.*", job="api"}

1.3 Offset and `@` modifiers

The offset modifier shifts a query back by a relative duration; the @ modifier pins it to an absolute Unix timestamp (Prometheus 2.25+). Combined, they enable week-over-week comparisons and reproducible post-incident analysis.

# Week-over-week ratio
rate(http_requests_total[5m])
  /
rate(http_requests_total[5m] offset 1w)

Subqueries ([1h:1m]) build a range vector out of an instant-vector expression evaluated at a step you specify. Useful, but expensive — prefer recording rules for anything you run repeatedly.

1.4 Data-shape transitions

graph TD RAW["Raw samples in TSDB
(timestamp, value, labels)"] SEL["Selector
http_requests_total{job="api"}"] IV["Instant Vector
one sample per series at eval time"] RV["Range Vector
append [5m] - many samples per series"] AGG["Aggregated Instant Vector
sum/avg/topk by labels"] S["Scalar
single number, no labels"] RAW --> SEL SEL --> IV IV -->|"append [duration]"| RV RV -->|"rate, increase, *_over_time"| IV IV -->|"sum by, avg by, topk"| AGG AGG -->|"scalar()"| S S -->|"comparison, arithmetic"| IV

Key Takeaway

Every PromQL expression has a shape: instant vector, range vector, or scalar.
Most operators require instant vectors; range vectors exist to feed time-window functions.
Selector regexes are auto-anchored on both ends.
Use offset for relative time-shifts, @ for absolute timestamps.

2. Functions and Operators

2.1 `rate`, `irate`, and `increase` on counters

Counters are monotonically increasing numbers (http_requests_total, node_network_transmit_bytes_total). Raw counter values are almost never useful — you care about growth rate. Three functions answer that question with subtly different semantics:

Function	Output	Best for	Avoid for
`rate(v[w])`	Per-second average over `w` via linear regression with extrapolation	Dashboards, most alerts	Brief spike detection
`irate(v[w])`	Instantaneous rate from the last two samples	Spike detection, short alerts	Long-term graphs (too noisy)
`increase(v[w])`	Total increments across `w` = `rate(v[w]) * w_seconds`	SLO budgets, "how many"	Real-time rates; expecting integers

All three operate only on counters and automatically handle counter resets — when a value drops (e.g., pod restart from 12345 back to 0), the negative jump is ignored and only post-reset increments count. None of them work correctly on gauges.

A frequent stumble is dividing before aggregating. The ratio of two rates is not the rate of two ratios — aggregate numerator and denominator separately, then divide.

# WRONG - per-instance ratios summed nonsensically
sum by (service) (
  rate(http_requests_total{code=~"5.."}[5m])
    /
  rate(http_requests_total[5m])
)

# RIGHT - aggregate first, then divide
sum by (service) (rate(http_requests_total{code=~"5.."}[5m]))
  /
sum by (service) (rate(http_requests_total[5m]))

Window-size rule of thumb: 3-5x the scrape interval. With a 15 s scrape, [1m] is the minimum meaningful window for rate; [5m] is the standard dashboard window.

flowchart LR C["Counter samples in [5m] window
t-5m ... t-4m ... t-3m ... t-2m ... t-1m ... t"] C --> R["rate(v[5m])
linear regression across
all samples in window"] C --> I["irate(v[5m])
uses only the LAST
two samples"] C --> N["increase(v[5m])
= rate(v[5m]) * 300s
total increments in window"] R --> RO["Smooth per-second rate
good for dashboards and alerts"] I --> IO["Spiky per-second rate
good for short spike detection"] N --> NO["Total event count over window
good for SLO budgets"]

2.2 `histogram_quantile` and bucket math

Latency, response sizes, and queue depths use histograms. A classic histogram exposes three families per metric: *_bucket{le="..."} cumulative counters at each upper bound, *_sum total of observations, and *_count.

The histogram_quantile() function reconstructs an approximate distribution from buckets and linearly interpolates within the bucket containing the target rank, assuming a uniform distribution inside each bucket.

Four rules must hold for a meaningful answer:

Apply rate() to the buckets first. Buckets are counters; you want a rate of observations, not raw cumulative count.
Preserve le in aggregation. Dropping le destroys the histogram structure.
Aggregate before taking the quantile. Quantiles are not linear — you cannot average p99s across instances.
All instances must share bucket boundaries. Mixed layouts produce meaningless interpolation.

# RIGHT - aggregate buckets preserving le, then one quantile
histogram_quantile(
  0.99,
  sum by (le, service) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

Bucket-design tip: cluster boundaries tightly around your SLO threshold. If your SLO is 300 ms p99, choose buckets like 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 1.0 rather than the defaults. If p99 falls in the +Inf bucket, histogram_quantile() returns +Inf — fix it with better buckets, not by changing the query.

2.3 Aggregation operators

PromQL has a fixed set: sum, avg, min, max, count, count_values, stddev, stdvar, topk, bottomk, quantile, group. Each is modified by by (...) (keep these labels) or without (...) (drop these). In high-cardinality environments, without is often safer — it doesn't hide labels you forgot to mention.

Key Takeaway

Use rate() for smooth dashboards, irate() for spike alerts, increase() for "how many events" SLO budgets.
For histograms: rate the buckets first, preserve le, compute quantile after aggregating.
For ratios: aggregate numerator and denominator separately, then divide.
Rate windows should be 3-5x the scrape interval.

Post-Quiz — Half 1 (Sections 1-2)

1. You write the expression http_requests_total[5m] > 100 in the Prometheus expression browser and it errors. Why?

2. Which expression correctly produces a smooth, dashboard-friendly per-second request rate aggregated by service?

irate(http_requests_total[5m]) sum by (service) (rate(http_requests_total[5m])) rate(sum by (service) (http_requests_total)[5m]) increase(http_requests_total[5m]) / 300 grouped by service

3. The canonical pattern for a p99 latency from classic histogram buckets is:

4. You want a spike-detection alert that fires when the instantaneous 5xx rate exceeds 10%. Which function fits best?

5. To compute the request rate as it appeared one week ago, you use:

Pre-Quiz — Half 2 (Sections 3-4)

1. The Prometheus level:metric:operation naming convention is intended primarily to:

Compress series storage by hashing the rule name. Identify the aggregation scope (e.g., job, service, cluster), the underlying metric, and the transformation applied. Allow Alertmanager to auto-route by the leading level token. Enable the Prometheus UI to colour-code recording rules differently from raw metrics.

2. An alert is flapping every 60 seconds. The team adds for: 30m to silence it. Why is this an anti-pattern?

A for clause longer than 10 minutes is rejected by Alertmanager. The for clause delays alerts; it does not improve query accuracy. The root cause is a noisy query (too-short rate window, too little aggregation) that should be fixed instead. Long for values disable the alert's pending state in the Prometheus UI. Recording rules cannot feed alerts that have for values above 5 minutes.

3. You have a counter that resets at midnight (process restart). Which expression handles the reset correctly without distortion?

rate(sum by (job)(http_requests_total)[5m:]) — aggregate first, then rate. sum by (job)(rate(http_requests_total[5m])) — rate per series first, then aggregate. increase(sum by (job)(http_requests_total)[5m]) — increase ignores resets after aggregation. delta(http_requests_total[5m]) — delta automatically accounts for restarts on counters.

4. A scrape target stops being targeted by Prometheus. After 5 minutes, an alert up == 0 stops firing. Why?

The up series itself becomes stale and returns no result after the lookback delta — the alert silently disables itself. Use absent(up{...}) to detect missing series. Alertmanager auto-resolves any alert that has not been refreshed in 5 minutes. The up metric flips to NaN, which is treated as truthy by PromQL. Prometheus assumes the target is healthy when no samples have been received.

5. Which label is the textbook example of a cardinality explosion waiting to happen?

method with values GET, POST, PUT, DELETE. status_code with ~40 possible HTTP codes. service with tens of values. user_id or trace_id — effectively unbounded values that produce one series per user/request.

3. Recording and Alerting Rules

PromQL queries can be slow. A histogram_quantile() over a 30-day window touching millions of series may take seconds — far too slow for a 10 s dashboard refresh or a 30 s alert evaluation. The fix is rules: PromQL expressions evaluated on a fixed schedule, storing results as either new metrics (recording rules) or alert states (alerting rules).

3.1 Recording rules

groups:
  - name: http_slos
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))

      - record: job:http_error_ratio:5m
        expr: |
          job:http_errors:rate5m
            /
          job:http_requests:rate5m

Naming convention: level:metric:operation. The level identifies aggregation scope (job, namespace, cluster), the metric identifies what is measured, and the operation describes the transformation (rate5m, histogram_quantile99).

Rule-chain guidelines:

Keep chains 2-3 levels deep — raw → base → service → cluster.
Don't rate a rate. If job:http_requests:rate5m is already per-second, rate(job:http_requests:rate5m[5m]) is meaningless.
Group by data source and scrape interval so dependent rules evaluate consistently.

3.2 Alerting rules and the `for` clause

- alert: HighRequestErrorRate
  expr: job:http_error_ratio:5m > 0.05
  for: 5m
  labels:
    severity: critical
    team: payments
  annotations:
    summary: "High error rate on {{ $labels.job }}"
    runbook_url: "https://runbooks.example.com/HighRequestErrorRate"

The for clause requires the condition to be continuously true for the specified duration before firing. for: 5m turns a 60 s blip into a non-event; a sustained 5-minute issue still pages.

Alert category	Typical `for`	Reasoning
User-visible symptom (errors, latency)	1-5m	Fast page; users already see it
Resource saturation (CPU, memory, disk)	5-15m	Avoid paging on transient spikes
SLO burn rate (fast window)	2-5m	Catches rapid budget burn
SLO burn rate (slow window)	30-60m	Long, sustained budget drift
Capacity / filling-up trends	1h+	Days-ahead warnings, not pages

Anti-pattern: do not use for to mask noisy query design. If a query flaps because the rate window is too short, fix the query — for only delays alerts, it does not improve accuracy.

3.3 Best practices for rule organisation

Version control — rules live in Git, reviewed by service owners and SREs.
Lint in CI — promtool check rules plus custom linters for naming, required labels (severity, team, runbook_url), and forbidden patterns (no topk in alert exprs).
Tier by severity — critical pages a human, warning opens a ticket, info logs to a channel. Every critical requires a runbook URL.
Safe rollout — new rules ship as recording rules first, then warning alerts in a non-paging channel, finally critical.

sequenceDiagram participant T as "Target /metrics" participant P as "Prometheus scraper" participant R1 as "Recording rule (base)
job:http_requests:rate5m" participant R2 as "Recording rule (ratio)
job:http_error_ratio:5m" participant A as "Alerting rule
HighRequestErrorRate" participant AM as "Alertmanager" T->>P: "scrape every 15s" P->>P: "write samples to TSDB" Note over R1: "every 30s eval interval" P->>R1: "read raw counters" R1->>R1: "sum by (job) (rate(...))" R1->>P: "write new series" Note over R2: "every 30s eval interval" R1->>R2: "read rate5m series" R2->>R2: "errors / requests" R2->>P: "write ratio series" Note over A: "every 30s eval interval" R2->>A: "read ratio series" A->>A: "expr > 0.05 sustained for 5m" A->>AM: "fire alert with labels + annotations"

Key Takeaway

Recording rules pre-compute expensive expressions; alerting rules fire on non-zero results.
Use level:metric:operation naming. Keep chains 2-3 deep. Never rate a rate.
Size for by alert category — never to mask noisy queries.
Treat rules as code: Git, CI lint, runbook URLs in annotations.

4. Common Pitfalls

4.1 Counter resets across restarts

Counters should never decrease, but processes restart. When a counter drops from 12345 to 0, every rate-family function detects the drop and treats it as a reset, counting only post-reset increase. This works automatically — one of PromQL's most pleasant surprises — but edge cases bite:

Frequent restarts inside a short window. A pod restarting every 30 s inside a 1-minute rate window produces noisy, misleading rates.
Gauges disguised as counters. Some metrics named _total are actually gauges. Rate functions silently produce nonsense. Verify metric type.
Aggregating across reset boundaries. sum two counter series, one resets, the sum jumps. Always apply rate() before aggregating:

# WRONG - sum then rate. A reset on one instance corrupts the sum.
rate(sum by (job)(http_requests_total)[5m:])

# RIGHT - rate per series first, then aggregate
sum by (job)(rate(http_requests_total[5m]))

4.2 Staleness markers and missing samples

Prometheus 2.0+ writes an explicit staleness marker when a target disappears or a series stops being reported. Five minutes after the last sample (the default lookback delta), instant queries return no result for that series — not the last known value. Consequences:

Alerts can silently disable themselves. If the alert is up == 0 and the up series stops being reported entirely, the alert never fires. Detect with absent().
Dashboards show gaps. Sparse metrics appear missing rather than zero. Use or vector(0) to substitute defaults.
Subqueries skip gaps. avg_over_time skips missing steps rather than treating them as zero.

# Detect when a critical metric is missing
absent(up{job="payments-api"})

# Default to zero when no data
sum by (service)(rate(payment_failures_total[5m])) or vector(0)

The lookback-delta setting (default 5m) controls how far back PromQL searches for the most recent sample. Do not change it without strong reason — it cascades to every query in the environment.

4.3 Cardinality explosions

Cardinality — the number of unique time series — is the single largest scaling constraint in Prometheus. A metric with 10 labels each having 100 possible values can in principle produce 10²⁰ series. A few hundred thousand series per Prometheus is healthy; millions is painful; tens of millions usually crashes.

Label	Cardinality	Safe in metrics?
`method`	~10	Yes
`status_code`	~40	Yes
`service`	tens	Yes
`pod`	hundreds-thousands, churns	Risky
`request_path` (raw URL)	unbounded	No
`user_id` / `email`	millions	No
`trace_id`	every request	Absolutely not

Normalise at the source: bucket paths into templates (/users/:id/posts/:id), drop high-cardinality labels before exposing them, or move that information into logs and traces (where cardinality is cheap) instead of metrics.

Diagnosis tools:

# Top 20 metrics by series count
topk(20, count by (__name__)({__name__=~".+"}))

# Distinct values for a specific label
count(count by (label_name)(metric_name))

# Total active series - alert when this grows fast
prometheus_tsdb_head_series

graph LR M["http_requests_total
1 metric, no labels
1 series"] M --> L1["+ method
(~10 values)
10 series"] L1 --> L2["+ status_code
(~40 values)
400 series"] L2 --> L3["+ pod
(~500 churning values)
200,000 series"] L3 --> L4["+ request_path raw URL
(unbounded, 50k+)
10,000,000+ series
Prometheus OOM"] L4 --> FIX["Fix: normalize path to template
route="/users/:id/posts/:id"
or drop label entirely"]

Key Takeaway

Counter resets: rate per series first, aggregate afterwards.
Staleness: guard with absent() and or vector(0); never edit lookback-delta lightly.
Cardinality: drop unbounded labels in recording rules; never expose user_id, trace_id, raw URLs as label values.

Post-Quiz — Half 2 (Sections 3-4)

1. The Prometheus level:metric:operation naming convention is intended primarily to:

2. An alert is flapping every 60 seconds. The team adds for: 30m to silence it. Why is this an anti-pattern?

3. You have a counter that resets at midnight (process restart). Which expression handles the reset correctly without distortion?

4. A scrape target stops being targeted by Prometheus. After 5 minutes, an alert up == 0 stops firing. Why?

5. Which label is the textbook example of a cardinality explosion waiting to happen?

Chapter 4 — PromQL: Querying Time-Series Data

Learning Objectives

1. PromQL Fundamentals

1.1 Instant vectors, range vectors, scalars

1.2 Selectors and label matchers

1.3 Offset and `@` modifiers

1.4 Data-shape transitions

Key Takeaway

2. Functions and Operators

2.1 `rate`, `irate`, and `increase` on counters

2.2 `histogram_quantile` and bucket math

2.3 Aggregation operators

Key Takeaway

3. Recording and Alerting Rules

3.1 Recording rules

3.2 Alerting rules and the `for` clause

3.3 Best practices for rule organisation

Key Takeaway

4. Common Pitfalls

4.1 Counter resets across restarts

4.2 Staleness markers and missing samples

4.3 Cardinality explosions

Key Takeaway

Your Progress

Answer Explanations

Chapter 4 — PromQL: Querying Time-Series Data

Learning Objectives

1. PromQL Fundamentals

1.1 Instant vectors, range vectors, scalars

1.2 Selectors and label matchers

1.3 Offset and @ modifiers

1.4 Data-shape transitions

Key Takeaway

2. Functions and Operators

2.1 rate, irate, and increase on counters

2.2 histogram_quantile and bucket math

2.3 Aggregation operators

Key Takeaway

3. Recording and Alerting Rules

3.1 Recording rules

3.2 Alerting rules and the for clause

3.3 Best practices for rule organisation

Key Takeaway

4. Common Pitfalls

4.1 Counter resets across restarts

4.2 Staleness markers and missing samples

4.3 Cardinality explosions

Key Takeaway

Your Progress

Answer Explanations

1.3 Offset and `@` modifiers

2.1 `rate`, `irate`, and `increase` on counters

2.2 `histogram_quantile` and bucket math

3.2 Alerting rules and the `for` clause