Prometheus

First PublishedFeb 16, 2026ByAtif Alam

Prometheus is an open-source monitoring system that scrapes metrics from targets at regular intervals, stores them in a time-series database (TSDB), and provides a powerful query language (PromQL) to analyze them.

Architecture

1
┌──────────────┐   scrape /metrics    ┌──────────────┐
2
│  Targets     │◄─────────────────────│  Prometheus   │
3
│  (apps,      │                      │  Server       │
4
│   exporters) │                      │               │
5
└──────────────┘                      │  ┌──────────┐ │
6
                                      │  │  TSDB    │ │ ← stores time-series
7
┌──────────────┐   push (rare)        │  └──────────┘ │
8
│  Push        │─────────────────────►│  ┌──────────┐ │
9
│  Gateway     │                      │  │  Rules   │ │ ← alert + recording rules
10
└──────────────┘                      │  └──────────┘ │
11
                                      └──────┬───────┘
12
                                             │
13
                              ┌──────────────┼──────────────┐
14
                              │              │              │
15
                         ┌────▼────┐   ┌─────▼─────┐  ┌────▼────┐
16
                         │ Grafana │   │Alertmanager│  │ PromQL  │
17
                         │(viz)    │   │(notify)    │  │(ad-hoc) │
18
                         └─────────┘   └───────────┘  └─────────┘

Key points:

Pull-based — Prometheus scrapes targets. Targets expose a /metrics endpoint.
TSDB — Efficient local storage optimized for time-series data. Default retention is 15 days.
Service discovery — Automatically finds targets via Kubernetes, Consul, DNS, file-based, or static config.
Push Gateway — For short-lived batch jobs that can’t be scraped (push metrics, then Prometheus scrapes the gateway).

Scrape Configuration

Scrape configuration lives in the Prometheus server’s config file (prometheus.yml). It tells Prometheus which targets to scrape, how often, and at what endpoint.

1
# prometheus.yml — on the Prometheus server
2
global:
3
  scrape_interval: 15s       # how often to scrape all targets (default)
4
  evaluation_interval: 15s   # how often to evaluate alert and recording rules
5

6
scrape_configs:
7
  # Prometheus scrapes itself — useful for monitoring Prometheus health
8
  - job_name: "prometheus"
9
    static_configs:
10
      - targets: ["localhost:9090"]
11

12
  # Scrape Node exporters running on two hosts
13
  # Each target must be running node_exporter on port 9100
14
  - job_name: "node-exporter"
15
    static_configs:
16
      - targets:
17
          - "node1:9100"
18
          - "node2:9100"
19

20
  # Scrape your application on two instances
21
  - job_name: "my-app"
22
    metrics_path: /metrics        # endpoint to scrape (this is the default)
23
    scrape_interval: 10s          # override the global interval for this job
24
    static_configs:
25
      - targets: ["app1:8080", "app2:8080"]
26
        labels:
27
          environment: production   # extra label added to all metrics from these targets

How It Works

For each job, Prometheus makes an HTTP GET request to http://<target>:<port><metrics_path> (e.g. http://app1:8080/metrics) at the configured interval.
The target responds with plain-text metrics in the Prometheus exposition format.
Prometheus parses the metrics, attaches the job label and any extra labels, and stores them in the TSDB.

Key Fields

Field	Purpose
`global.scrape_interval`	Default interval for all jobs (e.g. every 15s)
`global.evaluation_interval`	How often alert/recording rules are checked
`job_name`	Logical name for a group of targets — becomes the `job` label
`metrics_path`	URL path to scrape (default: `/metrics`)
`scrape_interval`	Per-job override of the global interval
`static_configs.targets`	List of `host:port` endpoints to scrape
`static_configs.labels`	Extra labels added to all metrics from these targets

Static vs Dynamic Targets

The example above uses static_configs — you list targets by hand. This works for fixed infrastructure but doesn’t scale when hosts come and go. For dynamic environments, use service discovery — Prometheus queries an API (Kubernetes, AWS, Azure, etc.) and automatically finds targets.

Cloud VM Service Discovery (AWS EC2 Example)

Prometheus can query the AWS EC2 API to discover running instances automatically. When you launch or terminate instances, Prometheus picks up the changes — no config edits needed.

1
# prometheus.yml — on the Prometheus server
2
scrape_configs:
3
  - job_name: "ec2-nodes"
4
    ec2_sd_configs:
5
      - region: us-east-1
6
        port: 9100                # port where node_exporter is running
7
        filters:                  # only discover instances matching these tags
8
          - name: "tag:Environment"
9
            values: ["production"]
10
          - name: "tag:Monitoring"
11
            values: ["enabled"]
12
    relabel_configs:
13
      # Use the instance's Name tag as the "instance" label
14
      - source_labels: [__meta_ec2_tag_Name]
15
        target_label: instance
16
      # Add the availability zone as a label
17
      - source_labels: [__meta_ec2_availability_zone]
18
        target_label: az
19
      # Use the private IP (default uses private DNS which may not resolve)
20
      - source_labels: [__meta_ec2_private_ip]
21
        target_label: __address__
22
        replacement: "${1}:9100"

How it works:

Prometheus calls the EC2 API using IAM credentials (from an instance role, environment variables, or config).
It discovers all instances matching the filters (e.g. tagged Monitoring=enabled).
relabel_configs transform EC2 metadata (instance name, AZ, private IP) into Prometheus labels.
Prometheus scrapes each discovered instance at <private_ip>:9100/metrics.

Other cloud providers work the same way:

Provider	Config Block	What It Queries
AWS EC2	`ec2_sd_configs`	EC2 instances
Azure	`azure_sd_configs`	Azure VMs
GCP	`gce_sd_configs`	GCE instances
DigitalOcean	`digitalocean_sd_configs`	Droplets
Hetzner	`hetzner_sd_configs`	Hetzner servers

Each provides metadata labels (IPs, tags, zones, instance types) that you can use in relabel_configs to filter and label targets.

Kubernetes Service Discovery

When Prometheus runs inside a Kubernetes cluster, it can auto-discover pods, services, and endpoints using the Kubernetes API. This config goes in the Prometheus server’s prometheus.yml (or, if you use the Prometheus Operator, this is handled automatically via ServiceMonitor CRDs — see Observability Setup).

1
# prometheus.yml — on the Prometheus server
2
scrape_configs:
3
  - job_name: "kubernetes-pods"
4
    kubernetes_sd_configs:
5
      - role: pod                # discover all pods in the cluster
6
    relabel_configs:
7
      # Only scrape pods that have the annotation prometheus.io/scrape: "true"
8
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
9
        action: keep
10
        regex: true
11
      # Use the pod's prometheus.io/path annotation as the metrics path (default: /metrics)
12
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
13
        action: replace
14
        target_label: __metrics_path__
15
        regex: (.+)

On the target pods (your application), you don’t install anything extra — just add annotations to make them discoverable:

1
# Your app's Deployment manifest
2
metadata:
3
  annotations:
4
    prometheus.io/scrape: "true"
5
    prometheus.io/port: "8080"
6
    prometheus.io/path: "/metrics"    # optional, defaults to /metrics

Prometheus queries the Kubernetes API, finds all pods with prometheus.io/scrape: "true", and scrapes their /metrics endpoint automatically. No static target list needed — new pods are discovered as they appear.

Metric Types

Counter

A value that only goes up (or resets to zero on restart). Used for totals:

1
http_requests_total{method="GET", status="200"} 1542
2
http_requests_total{method="POST", status="500"} 3

Query the rate of increase, not the raw value:

1
rate(http_requests_total[5m])

Gauge

A value that can go up or down. Used for current state:

1
node_memory_available_bytes 4294967296
2
temperature_celsius{location="server-room"} 23.5
3
active_connections 42

Query the current value directly:

1
node_memory_available_bytes

Histogram

Measures the distribution of values (e.g. request latency). Automatically creates _bucket, _sum, and _count metrics:

1
http_request_duration_seconds_bucket{le="0.1"} 5000
2
http_request_duration_seconds_bucket{le="0.5"} 8000
3
http_request_duration_seconds_bucket{le="1.0"} 9500
4
http_request_duration_seconds_bucket{le="+Inf"} 10000
5
http_request_duration_seconds_sum 3500.5
6
http_request_duration_seconds_count 10000

Calculate percentiles:

1
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Summary

Similar to histogram but calculates quantiles on the client side. Less flexible (can’t aggregate across instances) but more accurate for individual targets.

Choosing a Type

Use Case	Type
Total requests, errors, bytes	Counter
Current temperature, memory, connections	Gauge
Latency distribution, request sizes	Histogram (preferred)
Pre-calculated quantiles (single target)	Summary

PromQL Basics

PromQL (Prometheus Query Language) is how you ask questions about your metrics. You’ll write PromQL in several places:

Where	What For
Prometheus UI (`http://prometheus:9090/graph`)	Ad-hoc exploration — type a query, see results as a table or graph. Great for debugging.
Grafana panels	Dashboard visualizations — each panel has a PromQL query that powers its chart, gauge, or table.
Alert rules (`rules/*.yml`)	The `expr` field in alert rules is PromQL — e.g. “fire if error rate > 5%.”
Recording rules (`rules/*.yml`)	Pre-compute expensive queries — the `expr` field stores the result as a new metric.
HTTP API (`/api/v1/query`)	Programmatic access — scripts and tools query Prometheus over HTTP and get JSON back.

In all cases, the syntax is the same. The examples below work anywhere you can write PromQL.

Instant Vectors

A single value per time series at the current moment:

1
http_requests_total                          # all series
2
http_requests_total{method="GET"}            # filter by label
3
http_requests_total{status=~"5.."}           # regex match (5xx errors)
4
http_requests_total{status!="200"}           # not equal

Range Vectors

Values over a time window (required by functions like rate):

1
http_requests_total[5m]     # last 5 minutes of data points
2
http_requests_total[1h]     # last 1 hour

Common Functions

1
# Rate of increase per second (for counters)
2
rate(http_requests_total[5m])
3

4
# Increase over a period (total count, not per-second)
5
increase(http_requests_total[1h])
6

7
# Average over time
8
avg_over_time(node_cpu_seconds_total[5m])
9

10
# Current value minus value 1 hour ago
11
node_memory_available_bytes - node_memory_available_bytes offset 1h

Aggregations

1
# Sum across all instances
2
sum(rate(http_requests_total[5m]))
3

4
# Sum by specific label
5
sum by (method) (rate(http_requests_total[5m]))
6

7
# Average by job
8
avg by (job) (node_memory_available_bytes)
9

10
# Top 5 highest request rates
11
topk(5, rate(http_requests_total[5m]))
12

13
# Count of time series
14
count(up == 1)

Useful Aggregation Operators

Operator	What It Does
`sum`	Total across series
`avg`	Average
`min` / `max`	Minimum / maximum
`count`	Number of series
`topk(n, ...)`	Top N series by value
`bottomk(n, ...)`	Bottom N series
`quantile(0.95, ...)`	95th percentile across series

Binary Operators

1
# Error rate as a percentage
2
sum(rate(http_requests_total{status=~"5.."}[5m]))
3
  /
4
sum(rate(http_requests_total[5m]))
5
  * 100
6

7
# Available memory percentage
8
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

Recording Rules

Pre-compute expensive queries and store the result as a new metric:

1
groups:
2
  - name: http_rules
3
    rules:
4
      - record: job:http_requests:rate5m
5
        expr: sum by (job) (rate(http_requests_total[5m]))
6

7
      - record: job:http_errors:rate5m
8
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
9

10
      - record: job:http_error_ratio
11
        expr: job:http_errors:rate5m / job:http_requests:rate5m

Recording rules speed up dashboards and make alert rules simpler to write.

Key Takeaways

Prometheus is pull-based — targets expose /metrics, Prometheus scrapes them.
Four metric types: counter (totals), gauge (current state), histogram (distributions), summary (pre-calculated quantiles).
Always use rate() on counters — never query raw counter values.
PromQL supports label filtering, regex, aggregations (sum by, avg by), and binary operators.
Use recording rules to pre-compute expensive queries for dashboards and alerts.