Skip to content

Prometheus

First PublishedByAtif Alam

Prometheus is an open-source monitoring system that scrapes metrics from targets at regular intervals, stores them in a time-series database (TSDB), and provides a powerful query language (PromQL) to analyze them.

┌──────────────┐ scrape /metrics ┌──────────────┐
│ Targets │◄─────────────────────│ Prometheus │
│ (apps, │ │ Server │
│ exporters) │ │ │
└──────────────┘ │ ┌──────────┐ │
│ │ TSDB │ │ ← stores time-series
┌──────────────┐ push (rare) │ └──────────┘ │
│ Push │─────────────────────►│ ┌──────────┐ │
│ Gateway │ │ │ Rules │ │ ← alert + recording rules
└──────────────┘ │ └──────────┘ │
└──────┬───────┘
┌──────────────┼──────────────┐
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌────▼────┐
│ Grafana │ │Alertmanager│ │ PromQL │
│(viz) │ │(notify) │ │(ad-hoc) │
└─────────┘ └───────────┘ └─────────┘

Key points:

  • Pull-based — Prometheus scrapes targets. Targets expose a /metrics endpoint.
  • TSDB — Efficient local storage optimized for time-series data. Default retention is 15 days.
  • Service discovery — Automatically finds targets via Kubernetes, Consul, DNS, file-based, or static config.
  • Push Gateway — For short-lived batch jobs that can’t be scraped (push metrics, then Prometheus scrapes the gateway).

Scrape configuration lives in the Prometheus server’s config file (prometheus.yml). It tells Prometheus which targets to scrape, how often, and at what endpoint.

# prometheus.yml — on the Prometheus server
global:
scrape_interval: 15s # how often to scrape all targets (default)
evaluation_interval: 15s # how often to evaluate alert and recording rules
scrape_configs:
# Prometheus scrapes itself — useful for monitoring Prometheus health
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Scrape Node exporters running on two hosts
# Each target must be running node_exporter on port 9100
- job_name: "node-exporter"
static_configs:
- targets:
- "node1:9100"
- "node2:9100"
# Scrape your application on two instances
- job_name: "my-app"
metrics_path: /metrics # endpoint to scrape (this is the default)
scrape_interval: 10s # override the global interval for this job
static_configs:
- targets: ["app1:8080", "app2:8080"]
labels:
environment: production # extra label added to all metrics from these targets
  1. For each job, Prometheus makes an HTTP GET request to http://<target>:<port><metrics_path> (e.g. http://app1:8080/metrics) at the configured interval.
  2. The target responds with plain-text metrics in the Prometheus exposition format.
  3. Prometheus parses the metrics, attaches the job label and any extra labels, and stores them in the TSDB.
FieldPurpose
global.scrape_intervalDefault interval for all jobs (e.g. every 15s)
global.evaluation_intervalHow often alert/recording rules are checked
job_nameLogical name for a group of targets — becomes the job label
metrics_pathURL path to scrape (default: /metrics)
scrape_intervalPer-job override of the global interval
static_configs.targetsList of host:port endpoints to scrape
static_configs.labelsExtra labels added to all metrics from these targets

The example above uses static_configs — you list targets by hand. This works for fixed infrastructure but doesn’t scale when hosts come and go. For dynamic environments, use service discovery — Prometheus queries an API (Kubernetes, AWS, Azure, etc.) and automatically finds targets.

Cloud VM Service Discovery (AWS EC2 Example)

Section titled “Cloud VM Service Discovery (AWS EC2 Example)”

Prometheus can query the AWS EC2 API to discover running instances automatically. When you launch or terminate instances, Prometheus picks up the changes — no config edits needed.

# prometheus.yml — on the Prometheus server
scrape_configs:
- job_name: "ec2-nodes"
ec2_sd_configs:
- region: us-east-1
port: 9100 # port where node_exporter is running
filters: # only discover instances matching these tags
- name: "tag:Environment"
values: ["production"]
- name: "tag:Monitoring"
values: ["enabled"]
relabel_configs:
# Use the instance's Name tag as the "instance" label
- source_labels: [__meta_ec2_tag_Name]
target_label: instance
# Add the availability zone as a label
- source_labels: [__meta_ec2_availability_zone]
target_label: az
# Use the private IP (default uses private DNS which may not resolve)
- source_labels: [__meta_ec2_private_ip]
target_label: __address__
replacement: "${1}:9100"

How it works:

  1. Prometheus calls the EC2 API using IAM credentials (from an instance role, environment variables, or config).
  2. It discovers all instances matching the filters (e.g. tagged Monitoring=enabled).
  3. relabel_configs transform EC2 metadata (instance name, AZ, private IP) into Prometheus labels.
  4. Prometheus scrapes each discovered instance at <private_ip>:9100/metrics.

Other cloud providers work the same way:

ProviderConfig BlockWhat It Queries
AWS EC2ec2_sd_configsEC2 instances
Azureazure_sd_configsAzure VMs
GCPgce_sd_configsGCE instances
DigitalOceandigitalocean_sd_configsDroplets
Hetznerhetzner_sd_configsHetzner servers

Each provides metadata labels (IPs, tags, zones, instance types) that you can use in relabel_configs to filter and label targets.

When Prometheus runs inside a Kubernetes cluster, it can auto-discover pods, services, and endpoints using the Kubernetes API. This config goes in the Prometheus server’s prometheus.yml (or, if you use the Prometheus Operator, this is handled automatically via ServiceMonitor CRDs — see Observability Setup).

# prometheus.yml — on the Prometheus server
scrape_configs:
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod # discover all pods in the cluster
relabel_configs:
# Only scrape pods that have the annotation prometheus.io/scrape: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use the pod's prometheus.io/path annotation as the metrics path (default: /metrics)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)

On the target pods (your application), you don’t install anything extra — just add annotations to make them discoverable:

# Your app's Deployment manifest
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics" # optional, defaults to /metrics

Prometheus queries the Kubernetes API, finds all pods with prometheus.io/scrape: "true", and scrapes their /metrics endpoint automatically. No static target list needed — new pods are discovered as they appear.

A value that only goes up (or resets to zero on restart). Used for totals:

http_requests_total{method="GET", status="200"} 1542
http_requests_total{method="POST", status="500"} 3

Query the rate of increase, not the raw value:

rate(http_requests_total[5m])

A value that can go up or down. Used for current state:

node_memory_available_bytes 4294967296
temperature_celsius{location="server-room"} 23.5
active_connections 42

Query the current value directly:

node_memory_available_bytes

Measures the distribution of values (e.g. request latency). Automatically creates _bucket, _sum, and _count metrics:

http_request_duration_seconds_bucket{le="0.1"} 5000
http_request_duration_seconds_bucket{le="0.5"} 8000
http_request_duration_seconds_bucket{le="1.0"} 9500
http_request_duration_seconds_bucket{le="+Inf"} 10000
http_request_duration_seconds_sum 3500.5
http_request_duration_seconds_count 10000

Calculate percentiles:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Similar to histogram but calculates quantiles on the client side. Less flexible (can’t aggregate across instances) but more accurate for individual targets.

Use CaseType
Total requests, errors, bytesCounter
Current temperature, memory, connectionsGauge
Latency distribution, request sizesHistogram (preferred)
Pre-calculated quantiles (single target)Summary

PromQL (Prometheus Query Language) is how you ask questions about your metrics. You’ll write PromQL in several places:

WhereWhat For
Prometheus UI (http://prometheus:9090/graph)Ad-hoc exploration — type a query, see results as a table or graph. Great for debugging.
Grafana panelsDashboard visualizations — each panel has a PromQL query that powers its chart, gauge, or table.
Alert rules (rules/*.yml)The expr field in alert rules is PromQL — e.g. “fire if error rate > 5%.”
Recording rules (rules/*.yml)Pre-compute expensive queries — the expr field stores the result as a new metric.
HTTP API (/api/v1/query)Programmatic access — scripts and tools query Prometheus over HTTP and get JSON back.

In all cases, the syntax is the same. The examples below work anywhere you can write PromQL.

A single value per time series at the current moment:

http_requests_total # all series
http_requests_total{method="GET"} # filter by label
http_requests_total{status=~"5.."} # regex match (5xx errors)
http_requests_total{status!="200"} # not equal

Values over a time window (required by functions like rate):

http_requests_total[5m] # last 5 minutes of data points
http_requests_total[1h] # last 1 hour
# Rate of increase per second (for counters)
rate(http_requests_total[5m])
# Increase over a period (total count, not per-second)
increase(http_requests_total[1h])
# Average over time
avg_over_time(node_cpu_seconds_total[5m])
# Current value minus value 1 hour ago
node_memory_available_bytes - node_memory_available_bytes offset 1h
# Sum across all instances
sum(rate(http_requests_total[5m]))
# Sum by specific label
sum by (method) (rate(http_requests_total[5m]))
# Average by job
avg by (job) (node_memory_available_bytes)
# Top 5 highest request rates
topk(5, rate(http_requests_total[5m]))
# Count of time series
count(up == 1)
OperatorWhat It Does
sumTotal across series
avgAverage
min / maxMinimum / maximum
countNumber of series
topk(n, ...)Top N series by value
bottomk(n, ...)Bottom N series
quantile(0.95, ...)95th percentile across series
# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
# Available memory percentage
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

Pre-compute expensive queries and store the result as a new metric:

rules/recording.yml
groups:
- name: http_rules
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_errors:rate5m
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
- record: job:http_error_ratio
expr: job:http_errors:rate5m / job:http_requests:rate5m

Recording rules speed up dashboards and make alert rules simpler to write.

  • Prometheus is pull-based — targets expose /metrics, Prometheus scrapes them.
  • Four metric types: counter (totals), gauge (current state), histogram (distributions), summary (pre-calculated quantiles).
  • Always use rate() on counters — never query raw counter values.
  • PromQL supports label filtering, regex, aggregations (sum by, avg by), and binary operators.
  • Use recording rules to pre-compute expensive queries for dashboards and alerts.