Scaling Prometheus
A single Prometheus server works well for small-to-medium environments. But at scale you hit limits: local disk fills up, a single instance can’t scrape thousands of targets, and you need queries across multiple clusters. Thanos, Cortex, and Mimir solve these problems.
When a Single Prometheus Isn’t Enough
Section titled “When a Single Prometheus Isn’t Enough”| Problem | Symptom |
|---|---|
| Retention | Local disk runs out; you can only keep 15–30 days of data |
| High availability | One Prometheus instance = single point of failure |
| Multi-cluster | Separate Prometheus per cluster; no unified view |
| Cardinality | Millions of time series overwhelm a single TSDB |
| Query performance | Large range queries on months of data are slow |
All three projects solve these problems, but with different architectures.
Thanos
Section titled “Thanos”Thanos extends existing Prometheus instances with long-term storage and global querying. It’s a sidecar-based approach — you keep your existing Prometheus servers and add Thanos components alongside them.
Architecture
Section titled “Architecture”┌──────────────────┐ ┌──────────────────┐│ Prometheus A │ │ Prometheus B ││ (cluster-us) │ │ (cluster-eu) ││ ┌────────────┐ │ │ ┌────────────┐ ││ │ Thanos │ │ │ │ Thanos │ ││ │ Sidecar │──┼─────┼──│ Sidecar │ ││ └────────────┘ │ │ └────────────┘ │└──────────────────┘ └──────────────────┘ │ │ ▼ ▼┌──────────────────────────────────────┐│ Object Storage (S3/GCS) │└──────────────────────────────────────┘ │ ▼┌──────────────────┐ ┌──────────────────┐│ Thanos Store │ │ Thanos Compact ││ Gateway │ │ (downsampling) │└──────────────────┘ └──────────────────┘ │ ▼┌──────────────────┐│ Thanos Query │ ◄── Grafana connects here│ (global view) │└──────────────────┘Components
Section titled “Components”| Component | What It Does |
|---|---|
| Sidecar | Runs alongside each Prometheus. Uploads blocks to object storage and serves real-time data to Query. |
| Store Gateway | Reads historical data from object storage and serves it to Query. |
| Query | A Prometheus-compatible query endpoint that fans out to Sidecars and Store Gateways. Grafana points here. |
| Compactor | Compacts and downsamples blocks in object storage (5m → 1h resolution for old data). Reduces storage costs. |
| Ruler | Evaluates recording and alerting rules across the global view (optional; you can keep rules on Prometheus). |
| Receive | Alternative to Sidecar — accepts remote-write from Prometheus instances. Useful when sidecars aren’t possible. |
How It Works
Section titled “How It Works”- Each Prometheus scrapes its targets normally and writes TSDB blocks to local disk.
- The Sidecar uploads completed 2-hour blocks to object storage (S3, GCS, Azure Blob).
- Store Gateway indexes those blocks and serves them for historical queries.
- Thanos Query federates queries: recent data from Sidecars, historical data from Store Gateway.
- Compactor runs periodically to merge small blocks and create downsampled data.
Key Config: Sidecar
Section titled “Key Config: Sidecar”# Run Thanos Sidecar alongside Prometheusthanos sidecar \ --tsdb.path=/prometheus/data \ --prometheus.url=http://localhost:9090 \ --objstore.config-file=bucket.yaml \ --grpc-address=0.0.0.0:10901
# bucket.yamltype: S3config: bucket: thanos-metrics endpoint: s3.amazonaws.com region: us-east-1 access_key: ${AWS_ACCESS_KEY_ID} secret_key: ${AWS_SECRET_ACCESS_KEY}Key Config: Query
Section titled “Key Config: Query”# Thanos Query connects to all Sidecars and Store Gatewaysthanos query \ --store=prometheus-a-sidecar:10901 \ --store=prometheus-b-sidecar:10901 \ --store=store-gateway:10901 \ --http-address=0.0.0.0:9090Grafana connects to Thanos Query at port 9090 as if it were a single Prometheus.
Cortex
Section titled “Cortex”Cortex is a horizontally scalable, multi-tenant Prometheus-compatible backend. Unlike Thanos (sidecar model), Cortex uses remote write — Prometheus pushes data to Cortex.
Architecture
Section titled “Architecture”┌──────────────┐ remote write ┌──────────────────────────────┐│ Prometheus │─────────────────────►│ Cortex │└──────────────┘ │ ┌─────────┐ ┌───────────┐ │ │ │Distribu-│─►│ Ingester │ │ │ │ tor │ │ │ │ │ └─────────┘ └─────┬─────┘ │ │ │ │ │ ┌──────▼─────┐ │ │ │ Object │ │ │ │ Storage │ │ │ └──────┬─────┘ │ │ │ │ │ ┌──────────┐ ┌──────▼─────┐ │ │ │ Query │◄│ Store │ │ │ │ Frontend │ │ Gateway │ │ │ └──────────┘ └────────────┘ │ └──────────────────────────────┘Key Differences from Thanos
Section titled “Key Differences from Thanos”| Feature | Thanos | Cortex |
|---|---|---|
| Data flow | Sidecar uploads blocks | Remote write (push) |
| Multi-tenancy | No built-in tenancy | Native multi-tenancy (X-Scope-OrgID header) |
| Existing Prometheus | Keep as-is, add sidecar | Add remote_write config |
| Storage | Object storage (blocks) | Object storage (chunks or blocks) |
| HA | Deduplicate via external labels | Built-in replication factor |
| Complexity | Moderate (add sidecars) | Higher (more microservices) |
When to Use Cortex
Section titled “When to Use Cortex”- You need multi-tenancy (e.g. SaaS platform where each customer has isolated metrics).
- You want a fully push-based architecture.
- You’re already using remote write.
Grafana Mimir
Section titled “Grafana Mimir”Grafana Mimir is the successor to Cortex, built by Grafana Labs. It’s architecturally similar to Cortex but with significant performance improvements and simpler operations.
Why Mimir Over Cortex?
Section titled “Why Mimir Over Cortex?”| Feature | Cortex | Mimir |
|---|---|---|
| Performance | Good | Significantly faster (query splitting, shuffle sharding) |
| Storage | Chunks or blocks | Blocks only (simpler) |
| Cardinality limits | Basic | Advanced per-tenant limits |
| Out-of-order ingestion | No | Yes (handles late-arriving data) |
| Maintained by | Community (slower pace) | Grafana Labs (active development) |
| Query performance | Good | Better (query sharding, split-and-merge) |
Mimir is the recommended choice for new deployments. Grafana Labs considers Cortex essentially superseded by Mimir.
Architecture (Same as Cortex, Improved Internals)
Section titled “Architecture (Same as Cortex, Improved Internals)”Prometheus ──remote_write──► Mimir Distributor ──► Ingester ──► Object Storage │Grafana ◄── Query Frontend ◄── Querier ◄── Store Gateway ◄───────────┘Deploying Mimir with Helm
Section titled “Deploying Mimir with Helm”helm repo add grafana https://grafana.github.io/helm-chartshelm install mimir grafana/mimir-distributed \ --set mimir.structuredConfig.common.storage.backend=s3 \ --set mimir.structuredConfig.common.storage.s3.bucket_name=mimir-blocks \ --set mimir.structuredConfig.common.storage.s3.endpoint=s3.amazonaws.comPrometheus Remote Write Config
Section titled “Prometheus Remote Write Config”# prometheus.yml — send metrics to Mimir (or Cortex)remote_write: - url: http://mimir-distributor:8080/api/v1/push headers: X-Scope-OrgID: my-tenant # required for multi-tenancyChoosing Between Them
Section titled “Choosing Between Them”| Criteria | Thanos | Cortex | Mimir |
|---|---|---|---|
| Best for | Extending existing Prometheus | Multi-tenant SaaS | New deployments, Grafana stack |
| Data model | Keep Prometheus local, upload blocks | Push via remote write | Push via remote write |
| Multi-tenancy | No | Yes | Yes |
| Operational effort | Lower (sidecar is lightweight) | Higher | Moderate (good Helm chart) |
| Long-term storage | Yes (object storage) | Yes (object storage) | Yes (object storage) |
| Grafana integration | Good | Good | Native (same company) |
| Active development | Active | Slower | Very active |
| Migration path | Add sidecars to existing Prometheus | Requires remote write | Requires remote write; easy migration from Cortex |
Decision Flow
Section titled “Decision Flow”Do you need multi-tenancy? ├─ Yes → Mimir (or Cortex if already using it) └─ No └─ Want to keep existing Prometheus as-is? ├─ Yes → Thanos (sidecar model) └─ No → Mimir (remote write, best performance)Common Patterns
Section titled “Common Patterns”Long-Term Retention with Downsampling
Section titled “Long-Term Retention with Downsampling”Both Thanos and Mimir can downsample old data to reduce storage costs:
| Resolution | Retention | Use Case |
|---|---|---|
| Raw (15s scrape) | 14 days | Detailed troubleshooting |
| 5 minute | 90 days | Weekly reviews |
| 1 hour | 1+ year | Capacity planning, trend analysis |
Global View Across Clusters
Section titled “Global View Across Clusters”┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐│ Cluster US-East │ │ Cluster EU-West │ │ Cluster AP-SE ││ Prometheus │ │ Prometheus │ │ Prometheus │└────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ │ │ └────────────────────┼────────────────────┘ ▼ ┌───────────────────┐ │ Thanos Query / │ │ Mimir Querier │ └─────────┬─────────┘ ▼ ┌───────────────────┐ │ Grafana │ └───────────────────┘One Grafana instance, one query endpoint, all clusters — without duplicating metrics.
Key Takeaways
Section titled “Key Takeaways”- A single Prometheus is fine for small/medium setups; scale out when you need long-term retention, HA, or multi-cluster queries.
- Thanos adds a sidecar to existing Prometheus — least disruptive, no multi-tenancy.
- Cortex is push-based (remote write) with native multi-tenancy — suited for SaaS platforms.
- Mimir is the successor to Cortex with better performance and active development — the recommended choice for new deployments using the Grafana stack.
- All three use object storage (S3/GCS) for cost-effective long-term retention with optional downsampling.