PromQL is a spreadsheet-formula language for time. Where a spreadsheet operates on static rows and columns, PromQL operates on labeled streams of timestamps and floats. Every expression returns a vector, and every vector has a shape — number of series, set of labels, and a temporal extent. Master the shape, and the language follows.
1.1 Instant vectors, range vectors, scalars
PromQL has four expression types; three dominate everyday work:
| Type | What it is | Example | Use it for |
| Instant vector | Series with one sample at eval time | http_requests_total | Dashboard panels, alert conditions, most operators |
| Range vector | Series with a range of samples going back in time | http_requests_total[5m] | Input to rate, increase, *_over_time |
| Scalar | A single number, no labels | 0.99, time() | Thresholds, quantile arguments, constants |
The crucial rule: most "PromQL math" operators require an instant vector. Comparison (>), binary arithmetic (+), and aggregation all reject range vectors. Range vectors exist almost exclusively to feed time-windowed functions like rate(), avg_over_time(), and increase().
1.2 Selectors and label matchers
Every PromQL query begins with a selector: a metric name plus optional matchers in {...}. Four matcher operators exist: =, !=, =~ (regex), !~ (negative regex). Regexes are anchored on both ends: code=~"5.." matches 500 and 503 but not 5000.
# Filter exact + regex
http_requests_total{job="api", code=~"5.."}
# Negative regex
http_requests_total{code!~"2..|3.."}
# Metric name as label
{__name__=~"http_.*", job="api"}
1.3 Offset and @ modifiers
The offset modifier shifts a query back by a relative duration; the @ modifier pins it to an absolute Unix timestamp (Prometheus 2.25+). Combined, they enable week-over-week comparisons and reproducible post-incident analysis.
# Week-over-week ratio
rate(http_requests_total[5m])
/
rate(http_requests_total[5m] offset 1w)
Subqueries ([1h:1m]) build a range vector out of an instant-vector expression evaluated at a step you specify. Useful, but expensive — prefer recording rules for anything you run repeatedly.
1.4 Data-shape transitions
graph TD
RAW["Raw samples in TSDB
(timestamp, value, labels)"]
SEL["Selector
http_requests_total{job="api"}"]
IV["Instant Vector
one sample per series at eval time"]
RV["Range Vector
append [5m] - many samples per series"]
AGG["Aggregated Instant Vector
sum/avg/topk by labels"]
S["Scalar
single number, no labels"]
RAW --> SEL
SEL --> IV
IV -->|"append [duration]"| RV
RV -->|"rate, increase, *_over_time"| IV
IV -->|"sum by, avg by, topk"| AGG
AGG -->|"scalar()"| S
S -->|"comparison, arithmetic"| IV
Key Takeaway
- Every PromQL expression has a shape: instant vector, range vector, or scalar.
- Most operators require instant vectors; range vectors exist to feed time-window functions.
- Selector regexes are auto-anchored on both ends.
- Use
offset for relative time-shifts, @ for absolute timestamps.
2.1 rate, irate, and increase on counters
Counters are monotonically increasing numbers (http_requests_total, node_network_transmit_bytes_total). Raw counter values are almost never useful — you care about growth rate. Three functions answer that question with subtly different semantics:
| Function | Output | Best for | Avoid for |
rate(v[w]) | Per-second average over w via linear regression with extrapolation | Dashboards, most alerts | Brief spike detection |
irate(v[w]) | Instantaneous rate from the last two samples | Spike detection, short alerts | Long-term graphs (too noisy) |
increase(v[w]) | Total increments across w = rate(v[w]) * w_seconds | SLO budgets, "how many" | Real-time rates; expecting integers |
All three operate only on counters and automatically handle counter resets — when a value drops (e.g., pod restart from 12345 back to 0), the negative jump is ignored and only post-reset increments count. None of them work correctly on gauges.
A frequent stumble is dividing before aggregating. The ratio of two rates is not the rate of two ratios — aggregate numerator and denominator separately, then divide.
# WRONG - per-instance ratios summed nonsensically
sum by (service) (
rate(http_requests_total{code=~"5.."}[5m])
/
rate(http_requests_total[5m])
)
# RIGHT - aggregate first, then divide
sum by (service) (rate(http_requests_total{code=~"5.."}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))
Window-size rule of thumb: 3-5x the scrape interval. With a 15 s scrape, [1m] is the minimum meaningful window for rate; [5m] is the standard dashboard window.
flowchart LR
C["Counter samples in [5m] window
t-5m ... t-4m ... t-3m ... t-2m ... t-1m ... t"]
C --> R["rate(v[5m])
linear regression across
all samples in window"]
C --> I["irate(v[5m])
uses only the LAST
two samples"]
C --> N["increase(v[5m])
= rate(v[5m]) * 300s
total increments in window"]
R --> RO["Smooth per-second rate
good for dashboards and alerts"]
I --> IO["Spiky per-second rate
good for short spike detection"]
N --> NO["Total event count over window
good for SLO budgets"]
2.2 histogram_quantile and bucket math
Latency, response sizes, and queue depths use histograms. A classic histogram exposes three families per metric: *_bucket{le="..."} cumulative counters at each upper bound, *_sum total of observations, and *_count.
The histogram_quantile() function reconstructs an approximate distribution from buckets and linearly interpolates within the bucket containing the target rank, assuming a uniform distribution inside each bucket.
Four rules must hold for a meaningful answer:
- Apply
rate() to the buckets first. Buckets are counters; you want a rate of observations, not raw cumulative count.
- Preserve
le in aggregation. Dropping le destroys the histogram structure.
- Aggregate before taking the quantile. Quantiles are not linear — you cannot average p99s across instances.
- All instances must share bucket boundaries. Mixed layouts produce meaningless interpolation.
# RIGHT - aggregate buckets preserving le, then one quantile
histogram_quantile(
0.99,
sum by (le, service) (
rate(http_request_duration_seconds_bucket[5m])
)
)
Bucket-design tip: cluster boundaries tightly around your SLO threshold. If your SLO is 300 ms p99, choose buckets like 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 1.0 rather than the defaults. If p99 falls in the +Inf bucket, histogram_quantile() returns +Inf — fix it with better buckets, not by changing the query.
2.3 Aggregation operators
PromQL has a fixed set: sum, avg, min, max, count, count_values, stddev, stdvar, topk, bottomk, quantile, group. Each is modified by by (...) (keep these labels) or without (...) (drop these). In high-cardinality environments, without is often safer — it doesn't hide labels you forgot to mention.
Key Takeaway
- Use
rate() for smooth dashboards, irate() for spike alerts, increase() for "how many events" SLO budgets.
- For histograms: rate the buckets first, preserve
le, compute quantile after aggregating.
- For ratios: aggregate numerator and denominator separately, then divide.
- Rate windows should be 3-5x the scrape interval.
PromQL queries can be slow. A histogram_quantile() over a 30-day window touching millions of series may take seconds — far too slow for a 10 s dashboard refresh or a 30 s alert evaluation. The fix is rules: PromQL expressions evaluated on a fixed schedule, storing results as either new metrics (recording rules) or alert states (alerting rules).
3.1 Recording rules
groups:
- name: http_slos
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_errors:rate5m
expr: sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))
- record: job:http_error_ratio:5m
expr: |
job:http_errors:rate5m
/
job:http_requests:rate5m
Naming convention: level:metric:operation. The level identifies aggregation scope (job, namespace, cluster), the metric identifies what is measured, and the operation describes the transformation (rate5m, histogram_quantile99).
Rule-chain guidelines:
- Keep chains 2-3 levels deep — raw → base → service → cluster.
- Don't rate a rate. If
job:http_requests:rate5m is already per-second, rate(job:http_requests:rate5m[5m]) is meaningless.
- Group by data source and scrape interval so dependent rules evaluate consistently.
3.2 Alerting rules and the for clause
- alert: HighRequestErrorRate
expr: job:http_error_ratio:5m > 0.05
for: 5m
labels:
severity: critical
team: payments
annotations:
summary: "High error rate on {{ $labels.job }}"
runbook_url: "https://runbooks.example.com/HighRequestErrorRate"
The for clause requires the condition to be continuously true for the specified duration before firing. for: 5m turns a 60 s blip into a non-event; a sustained 5-minute issue still pages.
| Alert category | Typical for | Reasoning |
| User-visible symptom (errors, latency) | 1-5m | Fast page; users already see it |
| Resource saturation (CPU, memory, disk) | 5-15m | Avoid paging on transient spikes |
| SLO burn rate (fast window) | 2-5m | Catches rapid budget burn |
| SLO burn rate (slow window) | 30-60m | Long, sustained budget drift |
| Capacity / filling-up trends | 1h+ | Days-ahead warnings, not pages |
Anti-pattern: do not use for to mask noisy query design. If a query flaps because the rate window is too short, fix the query — for only delays alerts, it does not improve accuracy.
3.3 Best practices for rule organisation
- Version control — rules live in Git, reviewed by service owners and SREs.
- Lint in CI —
promtool check rules plus custom linters for naming, required labels (severity, team, runbook_url), and forbidden patterns (no topk in alert exprs).
- Tier by severity —
critical pages a human, warning opens a ticket, info logs to a channel. Every critical requires a runbook URL.
- Safe rollout — new rules ship as recording rules first, then
warning alerts in a non-paging channel, finally critical.
sequenceDiagram
participant T as "Target /metrics"
participant P as "Prometheus scraper"
participant R1 as "Recording rule (base)
job:http_requests:rate5m"
participant R2 as "Recording rule (ratio)
job:http_error_ratio:5m"
participant A as "Alerting rule
HighRequestErrorRate"
participant AM as "Alertmanager"
T->>P: "scrape every 15s"
P->>P: "write samples to TSDB"
Note over R1: "every 30s eval interval"
P->>R1: "read raw counters"
R1->>R1: "sum by (job) (rate(...))"
R1->>P: "write new series"
Note over R2: "every 30s eval interval"
R1->>R2: "read rate5m series"
R2->>R2: "errors / requests"
R2->>P: "write ratio series"
Note over A: "every 30s eval interval"
R2->>A: "read ratio series"
A->>A: "expr > 0.05 sustained for 5m"
A->>AM: "fire alert with labels + annotations"
Key Takeaway
- Recording rules pre-compute expensive expressions; alerting rules fire on non-zero results.
- Use
level:metric:operation naming. Keep chains 2-3 deep. Never rate a rate.
- Size
for by alert category — never to mask noisy queries.
- Treat rules as code: Git, CI lint, runbook URLs in annotations.
4.1 Counter resets across restarts
Counters should never decrease, but processes restart. When a counter drops from 12345 to 0, every rate-family function detects the drop and treats it as a reset, counting only post-reset increase. This works automatically — one of PromQL's most pleasant surprises — but edge cases bite:
- Frequent restarts inside a short window. A pod restarting every 30 s inside a 1-minute rate window produces noisy, misleading rates.
- Gauges disguised as counters. Some metrics named
_total are actually gauges. Rate functions silently produce nonsense. Verify metric type.
- Aggregating across reset boundaries.
sum two counter series, one resets, the sum jumps. Always apply rate() before aggregating:
# WRONG - sum then rate. A reset on one instance corrupts the sum.
rate(sum by (job)(http_requests_total)[5m:])
# RIGHT - rate per series first, then aggregate
sum by (job)(rate(http_requests_total[5m]))
4.2 Staleness markers and missing samples
Prometheus 2.0+ writes an explicit staleness marker when a target disappears or a series stops being reported. Five minutes after the last sample (the default lookback delta), instant queries return no result for that series — not the last known value. Consequences:
- Alerts can silently disable themselves. If the alert is
up == 0 and the up series stops being reported entirely, the alert never fires. Detect with absent().
- Dashboards show gaps. Sparse metrics appear missing rather than zero. Use
or vector(0) to substitute defaults.
- Subqueries skip gaps.
avg_over_time skips missing steps rather than treating them as zero.
# Detect when a critical metric is missing
absent(up{job="payments-api"})
# Default to zero when no data
sum by (service)(rate(payment_failures_total[5m])) or vector(0)
The lookback-delta setting (default 5m) controls how far back PromQL searches for the most recent sample. Do not change it without strong reason — it cascades to every query in the environment.
4.3 Cardinality explosions
Cardinality — the number of unique time series — is the single largest scaling constraint in Prometheus. A metric with 10 labels each having 100 possible values can in principle produce 1020 series. A few hundred thousand series per Prometheus is healthy; millions is painful; tens of millions usually crashes.
| Label | Cardinality | Safe in metrics? |
method | ~10 | Yes |
status_code | ~40 | Yes |
service | tens | Yes |
pod | hundreds-thousands, churns | Risky |
request_path (raw URL) | unbounded | No |
user_id / email | millions | No |
trace_id | every request | Absolutely not |
Normalise at the source: bucket paths into templates (/users/:id/posts/:id), drop high-cardinality labels before exposing them, or move that information into logs and traces (where cardinality is cheap) instead of metrics.
Diagnosis tools:
# Top 20 metrics by series count
topk(20, count by (__name__)({__name__=~".+"}))
# Distinct values for a specific label
count(count by (label_name)(metric_name))
# Total active series - alert when this grows fast
prometheus_tsdb_head_series
graph LR
M["http_requests_total
1 metric, no labels
1 series"]
M --> L1["+ method
(~10 values)
10 series"]
L1 --> L2["+ status_code
(~40 values)
400 series"]
L2 --> L3["+ pod
(~500 churning values)
200,000 series"]
L3 --> L4["+ request_path raw URL
(unbounded, 50k+)
10,000,000+ series
Prometheus OOM"]
L4 --> FIX["Fix: normalize path to template
route="/users/:id/posts/:id"
or drop label entirely"]
Key Takeaway
- Counter resets: rate per series first, aggregate afterwards.
- Staleness: guard with
absent() and or vector(0); never edit lookback-delta lightly.
- Cardinality: drop unbounded labels in recording rules; never expose
user_id, trace_id, raw URLs as label values.