Skip to content

Grafana

First PublishedByAtif Alam

Grafana is an open-source visualization and analytics platform. It connects to data sources (Prometheus, Loki, Elasticsearch, CloudWatch, etc.) and turns queries into dashboards, charts, and alerts.

Grafana doesn’t store data — it queries external sources. Add a data source in Configuration → Data Sources:

Data SourceUsed For
PrometheusMetrics (PromQL queries)
LokiLogs (LogQL queries)
ElasticsearchLogs, metrics, search
CloudWatchAWS metrics and logs
InfluxDBTime-series metrics
PostgreSQL / MySQLBusiness data, custom queries
Tempo / JaegerDistributed traces

You can have multiple data sources of the same type (e.g. one Prometheus for production, another for staging).

A dashboard is a collection of panels (charts, tables, stats) arranged in rows.

  1. Click + → New Dashboard.
  2. Add a panel.
  3. Choose a data source and write a query.
  4. Select a visualization type.
  5. Configure panel options (title, legend, thresholds).
  6. Save the dashboard.

Dashboards are stored as JSON. You can:

  • Export a dashboard as JSON for version control.
  • Import a JSON file or paste a dashboard ID from grafana.com/dashboards.
  • Provision dashboards from files on disk (for GitOps / config-as-code).

Place YAML configs and JSON dashboards in Grafana’s provisioning directory:

provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: default
folder: ""
type: file
options:
path: /var/lib/grafana/dashboards
provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
isDefault: true

This lets you deploy Grafana with dashboards and data sources pre-configured — no manual setup.

Line/area/bar chart over time. The most common panel:

rate(http_requests_total[5m])

Options: line width, fill opacity, gradient, stacking, point size, thresholds.

Single large number with optional sparkline. Good for KPIs:

sum(rate(http_requests_total[5m]))

Shows: “2,345 req/s”

Circular gauge showing a value against a range:

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

Shows: 72% with color thresholds (green/yellow/red).

Compare values across categories:

sum by (method) (rate(http_requests_total[5m]))

Tabular data with sortable columns:

topk(10, sum by (instance) (rate(http_requests_total[5m])))

Visualize distributions over time (e.g. latency buckets):

sum by (le) (rate(http_request_duration_seconds_bucket[5m]))

Display log lines from Loki or Elasticsearch:

{app="my-app"} |= "error"
  • Pie chart — Proportions
  • State timeline — Status over time (up/down/degraded)
  • Alert list — Current firing alerts
  • Text — Markdown or HTML for notes and documentation

Variables make dashboards dynamic — users can switch between environments, hosts, or services without editing queries.

Dashboard Settings → Variables → Add variable:

Name: instance
Type: Query
Data source: Prometheus
Query: label_values(up, instance)

This populates a dropdown with all instance label values.

rate(http_requests_total{instance="$instance"}[5m])

The $instance is replaced with the selected value from the dropdown.

VariableQueryPurpose
joblabel_values(up, job)Select by job
instancelabel_values(up{job="$job"}, instance)Chain: instances for selected job
namespacelabel_values(kube_pod_info, namespace)Kubernetes namespace
intervalCustom: 1m, 5m, 15m, 1hAdjustable time range

When one variable depends on another (e.g. namespace → pod):

  1. Create namespace variable: label_values(kube_pod_info, namespace)
  2. Create pod variable: label_values(kube_pod_info{namespace="$namespace"}, pod)

Selecting a namespace automatically filters the pod list.

Repeat a panel for each value of a variable:

  1. Set the variable to allow multi-value selection.
  2. In the panel, enable Repeat → Variable: instance.

Grafana creates one panel per selected instance — useful for “per-host” views.

Mark events on time-series panels (deploys, incidents, config changes):

# Query annotation source
ALERTS{alertname="HighErrorRate"}

Or add manual annotations by clicking on the graph and writing a note.

  • Share link — Direct URL with current time range and variables.
  • Snapshot — Static copy of the dashboard (no live data).
  • Export JSON — Full dashboard definition for version control.
  • Embed panel — iframe embed for external pages.
  • PDF/PNG — Via Grafana Image Renderer plugin.

Well-designed dashboards answer questions quickly. Poorly designed ones become “wall of graphs” that nobody reads. These patterns help you build dashboards that are actually useful.

For every resource (CPU, memory, disk, network), show three things:

SignalMeaningExample Panel
UtilizationHow busy is it? (%)node_cpu_seconds_total → CPU usage %
SaturationHow overloaded is it? (queue depth)node_load1 → load average
ErrorsIs it failing?node_disk_io_time_weighted_seconds_total

Layout: One row per resource, three panels per row.

For every service (API, microservice), show three things:

SignalMeaningExample Panel
RateRequests per secondrate(http_requests_total[5m])
ErrorsError rate (% or count)rate(http_requests_total{status=~"5.."}[5m])
DurationLatency (p50, p95, p99)histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Layout: One row per service, three panels per row. This is the most common pattern for microservice dashboards.

Google’s SRE book recommends monitoring these for every user-facing system:

SignalWhat to Measure
LatencyTime to serve a request (separate success vs error latency)
TrafficRequests per second
ErrorsRate of failed requests
SaturationHow “full” the service is (CPU, memory, queue depth)

RED covers the first three; add a saturation panel (CPU/memory of the service pods) for the fourth.

Overview → Detail (Drill-Down):

┌─────────────────────────────────────────────────┐
│ Row 1: Key stats (stat panels) │
│ [Total RPS] [Error %] [p99 Latency] [Pods] │
├─────────────────────────────────────────────────┤
│ Row 2: Time series (trends) │
│ [Request rate over time] [Error rate] │
├─────────────────────────────────────────────────┤
│ Row 3: Per-instance breakdown │
│ [Latency by pod] [CPU by pod] │
├─────────────────────────────────────────────────┤
│ Row 4: Logs (Loki panel) │
│ [Recent errors from Loki] │
└─────────────────────────────────────────────────┘

This pattern gives you the summary at the top and lets you scroll down for detail.

Service Map (Multi-Service):

Create a dashboard per service (using the RED method), then link them:

  • A “Platform Overview” dashboard shows all services as stat panels.
  • Clicking a service stat links to that service’s detailed dashboard.
  • Use Grafana’s Dashboard Links and pass variables.
TipWhy
Put stat panels at the topInstant overview of current state
Use thresholds and colorsGreen/yellow/red makes problems visible without reading numbers
Label axes”Requests per second” not just “rate”
Set meaningful Y-axis limitsDon’t auto-scale from 0.001 to 0.002 — it looks like a crisis
Use the right unitGrafana supports reqps, bytes, percent, seconds, etc.
Add descriptions to panelsHover-text explaining what the panel shows and what “bad” looks like
Collapse rowsGroup related panels; default-collapse less important sections
Limit to 10–15 panelsMore than that = information overload
Anti-PatternProblemFix
Wall of graphs30+ panels, no hierarchyUse rows, collapse, and a summary row at top
No variablesSeparate dashboard per environmentAdd $environment, $namespace, $service variables
Raw metric names as titles”node_cpu_seconds_total” means nothing to on-callUse human-readable titles: “CPU Usage (%)“
Default time range too wide7-day view hides the last-10-minute spikeSet default to “Last 1 hour” for operational dashboards
No alerting linkDashboard shows a problem but no way to see related alertsAdd an Alert List panel or link to alert rules
Mixing audiencesDev metrics + business metrics on one dashboardSeparate: “Service Health” (ops) vs “Business KPIs” (product)

Store dashboards in Git and provision them automatically:

  1. Export dashboard JSON from Grafana UI.
  2. Parameterize data source names using ${DS_PROMETHEUS} variables.
  3. Commit to a dashboards/ directory in your repo.
  4. Use Grafana provisioning or a Kubernetes ConfigMap to load on startup.
# Kubernetes ConfigMap for dashboard provisioning
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
labels:
grafana_dashboard: "1" # Grafana sidecar picks this up
data:
service-health.json: |
{ ... exported dashboard JSON ... }

The kube-prometheus-stack Helm chart’s Grafana sidecar auto-discovers ConfigMaps with the grafana_dashboard label and loads them.

  • Grafana connects to data sources — it doesn’t store data itself.
  • Use variables to make dashboards dynamic (environment, host, namespace dropdowns).
  • Provision data sources and dashboards from files for config-as-code deployments.
  • Choose the right panel type: time series for trends, stat for KPIs, heatmap for distributions, table for top-N lists.
  • Export dashboards as JSON and commit to Git — treat dashboards as code.
  • Use the RED method (Rate, Errors, Duration) for service dashboards and the USE method (Utilization, Saturation, Errors) for infrastructure dashboards.
  • Design dashboards with a summary row at top, details below, and 10–15 panels max to avoid information overload.