Skip to content

Incident Tooling and Customer Communications

First PublishedByAtif Alam

Incident response patterns are covered in Incident response and on-call — roles, first minutes, blameless postmortems. This page covers the tooling and communication side: how on-call schedules, escalation policies, status pages, and customer comms templates fit together, with concrete examples from common vendors.

Pattern-first, vendor-second. The concepts apply to PagerDuty, OpsGenie, FireHydrant, Incident.io, Statuspage, Atlassian, and roll-your-own equivalents.

Related: Incident response and on-call, Alerting, SLOs and error budgets.

A typical incident-tooling stack has three layers, often from different vendors:

┌──────────────────────────────────────────────────────────┐
│ Status Page (Statuspage, Better Stack, │
│ External + Internal Instatus, FireHydrant) │
└──────────────────────────────────────────────────────────┘
│ severity, scope, comms updates
┌──────────────────────────────────────────────────────────┐
│ Incident Management (FireHydrant, Incident.io, │
│ War room, timeline, Rootly, Jeli, internal tool) │
│ postmortem
└──────────────────────────────────────────────────────────┘
│ page, ack, escalate
┌──────────────────────────────────────────────────────────┐
│ Paging and Schedules (PagerDuty, OpsGenie, Grafana │
│ Rotations, escalation OnCall, VictorOps, Squadcast) │
│ policies, overrides
└──────────────────────────────────────────────────────────┘
│ alert
┌──────────────────────────────────────────────────────────┐
│ Alerting (Alertmanager, Grafana Alerts, │
│ See observability Datadog, etc.) │
│ → /library/observability/alerting/
└──────────────────────────────────────────────────────────┘

Some platforms (FireHydrant, Incident.io, PagerDuty) cover multiple layers. The pattern matters more than the SKU.

Most paging tools model your systems as services (sometimes called components or technical services). Each service has:

FieldPurpose
NameHuman-readable; shows up in pages and the status page.
Owning teamWho gets paged.
Escalation policyWho gets paged if no one acknowledges.
IntegrationsWhich alert sources route here (Alertmanager, Datadog, custom webhook).
DependenciesOther services this one relies on. Helps in noisy-incident triage.

Keep services aligned to ownership, not technology. “Checkout API” beats “Postgres cluster 3” because Checkout has an owner; Postgres cluster 3 may serve five teams.

A schedule is a calendar of who is on-call. An escalation policy says what happens when the primary doesn’t ack.

PatternLooks LikeNotes
Weekly primaryOne engineer, Mon–Sun.Simple; tiring; recovery day after each shift.
Weekly primary + secondaryPrimary acks; secondary backs up.Most common; secondary often does triage assist, not full pager load.
Follow-the-sunThree regions, 8 hours each.Best for global teams; requires runbook discipline so handoffs work.
DailyDaily rotation, 7 engineers.Smaller per-shift load; harder to keep context across shifts.
Tier 1: Primary on-call (page, wait 5 min for ack)
Tier 2: Secondary on-call (page, wait 5 min for ack)
Tier 3: Manager / team lead (page, wait 10 min for ack)
Tier 4: Director / VP (page; no further escalation)

Document the expected time-to-ack at each tier and the acceptable interruption hours for off-shift engineers explicitly. Vague policies create awkward escalations.

Overrides cover vacations, conferences, parental leave, and “I have a doctor appointment Tuesday morning.” Norms that work:

  • Plan overrides one shift in advance, not the day of.
  • Visible to the whole team, not buried in the tool.
  • Reciprocal when possible — track who covers whom.
  • Manager review if one engineer is consistently giving up shifts.

A status page communicates incidents to people who can’t see your dashboards. There are two kinds, and they have different audiences and rules.

TypeAudienceToneUpdate Cadence
External status pageCustomers, integrators, public.Calm, factual, no internal jargon.Every 30 min during major; faster early on.
Internal status pageEmployees in support, sales, exec, and other engineering teams.Slightly more detail; can name teams.Every 15–30 min during major.
  • Investigating — “We are investigating elevated error rates affecting the checkout API.”
  • Identified — “We have identified a degraded database replica as the cause and are failing over.”
  • Monitoring — “We have completed the failover. Error rates have returned to normal. We are continuing to monitor.”
  • Resolved — “All systems are operating normally. A postmortem will be published within five business days.”

What not to post: blame (“a vendor outage caused this” without details), root cause speculation, internal team names, or anything that hasn’t been confirmed.

Internal updates can include:

  • Owning team and incident commander so support knows who to ask.
  • War room link (Slack channel, video bridge).
  • Customer-impact estimate so support can size their response.
  • ETA for next update, even if “next update in 30 minutes; nothing new to report” is the message.

Severity drives who is woken up and how customers are told. Define both explicitly.

SeverityDefinitionCustomer Comms
SEV1Major customer-facing outage; data loss risk; security incident.External status page within 15 min; CEO and PR aware; updates every 30 min.
SEV2Significant degradation; partial outage; SLO breach in progress.External status page within 30 min; updates every hour.
SEV3Minor degradation; small subset of customers; no SLO breach yet.Internal status page; targeted comms to affected customers if known.
SEV4Single customer; non-urgent; no immediate fix needed.Support ticket; no status page entry.

Templates speed up comms when the incident commander has 100 other things to think about.

We are currently experiencing [brief description of impact, e.g.
"elevated error rates on the [service name] API"]. Our engineering
team is actively investigating.
This issue began at approximately [time, with timezone] and is
affecting [scope, e.g. "all customers", "customers in the EU
region", "checkout flows"].
We will provide an update within [time, e.g. "30 minutes"].
We have identified the cause of the issue affecting [scope].
[One-sentence description of the cause at a customer-appropriate
level of detail.]
Our team is currently [action, e.g. "rolling back the recent
deployment", "failing over to a healthy replica"].
We expect service to be restored within [estimate, or "we will
provide an updated estimate in our next post"].
The issue affecting [scope] has been resolved as of [time]. All
systems are now operating normally.
A full postmortem will be published within [timeframe, e.g. "five
business days"]. We apologize for the disruption.

Long Incident — “No New Information” Update

Section titled “Long Incident — “No New Information” Update”
We continue to investigate the issue affecting [scope]. We do not
have a new update at this time, but our team remains actively
engaged. We will post the next update by [time].

These should be vetted by communications/legal before they ever go to a status page in a real incident — but having them ready saves precious minutes.

Incidents that last more than an hour pull in stakeholders beyond the war room: support, sales, account management, executives. Patterns that work:

  • Single stakeholder channel separate from the war room. Engineers do not have to filter exec questions while debugging.
  • Hourly summary in that channel, posted by the incident commander or a comms lead.
  • Pre-built distribution lists for SEV1: who gets paged, who gets emailed, who gets a phone call.
  • Customer escalations route to the comms lead, not into the war room directly.

A useful long-incident summary template:

Incident: [name]
Severity: [SEV1/2]
Started: [time]
Current status: [investigating / mitigating / monitoring / resolved]
Customer impact: [scope and rough magnitude]
What we are doing: [one or two bullets]
Next update: [time]

When evaluating tooling, the questions to ask map to capabilities, not features:

CapabilityWhat to Ask
PagingMultiple notification channels (push, SMS, voice)? Acks across channels? Escalation timers?
SchedulingOverride workflow? Audit trail? API for swaps? Time-zone handling?
RoutingMatch alerts to services by label, source, or content? Quiet hours per service?
Incident managementWar room creation? Timeline auto-recording? Postmortem templates? Linked tickets?
Status pagePublic and internal modes? Subscribers? Component-level updates? Templated incidents?
IntegrationsAlerting source you use? Chat platform you use? Ticketing? IdP/SSO?
ReliabilityWhat is their SLA? What happens if their service is down during your incident?

Last bullet matters: an incident-tooling vendor outage during your own outage is a real failure mode. Have a documented fallback (manual phone tree, secondary chat).

  • Every production-paging service has a named owning team and escalation policy.
  • Severity definitions are written down, agreed across engineering and support, and tied to customer-comms expectations.
  • On-call schedules are visible to the whole team; overrides are planned, not emergency.
  • External status page templates exist and have been reviewed by comms/legal before any real incident.
  • Long incidents have a separate stakeholder channel from the war room.
  • You have a documented fallback for when the incident-tooling vendor itself is degraded.