Incident Tooling and Customer Communications
Incident response patterns are covered in Incident response and on-call — roles, first minutes, blameless postmortems. This page covers the tooling and communication side: how on-call schedules, escalation policies, status pages, and customer comms templates fit together, with concrete examples from common vendors.
Pattern-first, vendor-second. The concepts apply to PagerDuty, OpsGenie, FireHydrant, Incident.io, Statuspage, Atlassian, and roll-your-own equivalents.
Related: Incident response and on-call, Alerting, SLOs and error budgets.
The Tooling Stack
Section titled “The Tooling Stack”A typical incident-tooling stack has three layers, often from different vendors:
┌──────────────────────────────────────────────────────────┐│ Status Page (Statuspage, Better Stack, ││ External + Internal Instatus, FireHydrant) │└──────────────────────────────────────────────────────────┘ ▲ │ severity, scope, comms updates │┌──────────────────────────────────────────────────────────┐│ Incident Management (FireHydrant, Incident.io, ││ War room, timeline, Rootly, Jeli, internal tool) ││ postmortem└──────────────────────────────────────────────────────────┘ ▲ │ page, ack, escalate │┌──────────────────────────────────────────────────────────┐│ Paging and Schedules (PagerDuty, OpsGenie, Grafana ││ Rotations, escalation OnCall, VictorOps, Squadcast) ││ policies, overrides└──────────────────────────────────────────────────────────┘ ▲ │ alert │┌──────────────────────────────────────────────────────────┐│ Alerting (Alertmanager, Grafana Alerts, ││ See observability Datadog, etc.) ││ → /library/observability/alerting/└──────────────────────────────────────────────────────────┘Some platforms (FireHydrant, Incident.io, PagerDuty) cover multiple layers. The pattern matters more than the SKU.
Services and Components
Section titled “Services and Components”Most paging tools model your systems as services (sometimes called components or technical services). Each service has:
| Field | Purpose |
|---|---|
| Name | Human-readable; shows up in pages and the status page. |
| Owning team | Who gets paged. |
| Escalation policy | Who gets paged if no one acknowledges. |
| Integrations | Which alert sources route here (Alertmanager, Datadog, custom webhook). |
| Dependencies | Other services this one relies on. Helps in noisy-incident triage. |
Keep services aligned to ownership, not technology. “Checkout API” beats “Postgres cluster 3” because Checkout has an owner; Postgres cluster 3 may serve five teams.
Schedules and Rotations
Section titled “Schedules and Rotations”A schedule is a calendar of who is on-call. An escalation policy says what happens when the primary doesn’t ack.
Schedule Patterns
Section titled “Schedule Patterns”| Pattern | Looks Like | Notes |
|---|---|---|
| Weekly primary | One engineer, Mon–Sun. | Simple; tiring; recovery day after each shift. |
| Weekly primary + secondary | Primary acks; secondary backs up. | Most common; secondary often does triage assist, not full pager load. |
| Follow-the-sun | Three regions, 8 hours each. | Best for global teams; requires runbook discipline so handoffs work. |
| Daily | Daily rotation, 7 engineers. | Smaller per-shift load; harder to keep context across shifts. |
Escalation Policy Example
Section titled “Escalation Policy Example”Tier 1: Primary on-call (page, wait 5 min for ack)Tier 2: Secondary on-call (page, wait 5 min for ack)Tier 3: Manager / team lead (page, wait 10 min for ack)Tier 4: Director / VP (page; no further escalation)Document the expected time-to-ack at each tier and the acceptable interruption hours for off-shift engineers explicitly. Vague policies create awkward escalations.
Overrides
Section titled “Overrides”Overrides cover vacations, conferences, parental leave, and “I have a doctor appointment Tuesday morning.” Norms that work:
- Plan overrides one shift in advance, not the day of.
- Visible to the whole team, not buried in the tool.
- Reciprocal when possible — track who covers whom.
- Manager review if one engineer is consistently giving up shifts.
Status Pages
Section titled “Status Pages”A status page communicates incidents to people who can’t see your dashboards. There are two kinds, and they have different audiences and rules.
| Type | Audience | Tone | Update Cadence |
|---|---|---|---|
| External status page | Customers, integrators, public. | Calm, factual, no internal jargon. | Every 30 min during major; faster early on. |
| Internal status page | Employees in support, sales, exec, and other engineering teams. | Slightly more detail; can name teams. | Every 15–30 min during major. |
External Status Page — What to Post
Section titled “External Status Page — What to Post”- Investigating — “We are investigating elevated error rates affecting the checkout API.”
- Identified — “We have identified a degraded database replica as the cause and are failing over.”
- Monitoring — “We have completed the failover. Error rates have returned to normal. We are continuing to monitor.”
- Resolved — “All systems are operating normally. A postmortem will be published within five business days.”
What not to post: blame (“a vendor outage caused this” without details), root cause speculation, internal team names, or anything that hasn’t been confirmed.
Internal Status Page — What to Add
Section titled “Internal Status Page — What to Add”Internal updates can include:
- Owning team and incident commander so support knows who to ask.
- War room link (Slack channel, video bridge).
- Customer-impact estimate so support can size their response.
- ETA for next update, even if “next update in 30 minutes; nothing new to report” is the message.
Severity-Driven Customer Communications
Section titled “Severity-Driven Customer Communications”Severity drives who is woken up and how customers are told. Define both explicitly.
| Severity | Definition | Customer Comms |
|---|---|---|
| SEV1 | Major customer-facing outage; data loss risk; security incident. | External status page within 15 min; CEO and PR aware; updates every 30 min. |
| SEV2 | Significant degradation; partial outage; SLO breach in progress. | External status page within 30 min; updates every hour. |
| SEV3 | Minor degradation; small subset of customers; no SLO breach yet. | Internal status page; targeted comms to affected customers if known. |
| SEV4 | Single customer; non-urgent; no immediate fix needed. | Support ticket; no status page entry. |
Customer Comms Templates
Section titled “Customer Comms Templates”Templates speed up comms when the incident commander has 100 other things to think about.
Initial Public Update (Severity 1 or 2)
Section titled “Initial Public Update (Severity 1 or 2)”We are currently experiencing [brief description of impact, e.g."elevated error rates on the [service name] API"]. Our engineeringteam is actively investigating.
This issue began at approximately [time, with timezone] and isaffecting [scope, e.g. "all customers", "customers in the EUregion", "checkout flows"].
We will provide an update within [time, e.g. "30 minutes"].Identified-Cause Update
Section titled “Identified-Cause Update”We have identified the cause of the issue affecting [scope].[One-sentence description of the cause at a customer-appropriatelevel of detail.]
Our team is currently [action, e.g. "rolling back the recentdeployment", "failing over to a healthy replica"].
We expect service to be restored within [estimate, or "we willprovide an updated estimate in our next post"].Resolved Update
Section titled “Resolved Update”The issue affecting [scope] has been resolved as of [time]. Allsystems are now operating normally.
A full postmortem will be published within [timeframe, e.g. "fivebusiness days"]. We apologize for the disruption.Long Incident — “No New Information” Update
Section titled “Long Incident — “No New Information” Update”We continue to investigate the issue affecting [scope]. We do nothave a new update at this time, but our team remains activelyengaged. We will post the next update by [time].These should be vetted by communications/legal before they ever go to a status page in a real incident — but having them ready saves precious minutes.
Stakeholder Updates During Long Incidents
Section titled “Stakeholder Updates During Long Incidents”Incidents that last more than an hour pull in stakeholders beyond the war room: support, sales, account management, executives. Patterns that work:
- Single stakeholder channel separate from the war room. Engineers do not have to filter exec questions while debugging.
- Hourly summary in that channel, posted by the incident commander or a comms lead.
- Pre-built distribution lists for SEV1: who gets paged, who gets emailed, who gets a phone call.
- Customer escalations route to the comms lead, not into the war room directly.
A useful long-incident summary template:
Incident: [name]Severity: [SEV1/2]Started: [time]Current status: [investigating / mitigating / monitoring / resolved]Customer impact: [scope and rough magnitude]What we are doing: [one or two bullets]Next update: [time]Vendor-Agnostic Capability Map
Section titled “Vendor-Agnostic Capability Map”When evaluating tooling, the questions to ask map to capabilities, not features:
| Capability | What to Ask |
|---|---|
| Paging | Multiple notification channels (push, SMS, voice)? Acks across channels? Escalation timers? |
| Scheduling | Override workflow? Audit trail? API for swaps? Time-zone handling? |
| Routing | Match alerts to services by label, source, or content? Quiet hours per service? |
| Incident management | War room creation? Timeline auto-recording? Postmortem templates? Linked tickets? |
| Status page | Public and internal modes? Subscribers? Component-level updates? Templated incidents? |
| Integrations | Alerting source you use? Chat platform you use? Ticketing? IdP/SSO? |
| Reliability | What is their SLA? What happens if their service is down during your incident? |
Last bullet matters: an incident-tooling vendor outage during your own outage is a real failure mode. Have a documented fallback (manual phone tree, secondary chat).
Checklist
Section titled “Checklist”- Every production-paging service has a named owning team and escalation policy.
- Severity definitions are written down, agreed across engineering and support, and tied to customer-comms expectations.
- On-call schedules are visible to the whole team; overrides are planned, not emergency.
- External status page templates exist and have been reviewed by comms/legal before any real incident.
- Long incidents have a separate stakeholder channel from the war room.
- You have a documented fallback for when the incident-tooling vendor itself is degraded.
Related
Section titled “Related”- Incident response and on-call — roles, first minutes, blameless postmortems
- Alerting — symptom-based alerts, severity, and noise control upstream of paging
- SLOs and error budgets — what makes an alert SEV1 vs SEV3
- QA reliability guide — the wider reliability practice