Agile for SRE and Platform Work
Most Agile guidance is written for product teams shipping features. SRE and platform teams have different traffic: on-call interrupts, toil, incidents, and long-lived infrastructure projects all compete for the same engineers. This page covers what to keep, what to drop, and what to invent when applying Agile to that mix.
”Agile” here is the lowercase practice — short iterations, written commitments, regular retros — not a particular vendor framework.
Related: QA and reliability guide, CI/CD best practices, Incident response and on-call, Leadership and mentoring.
Why Vanilla Scrum Hurts SRE Teams
Section titled “Why Vanilla Scrum Hurts SRE Teams”Vanilla Scrum assumes the team can commit to a fixed scope for two weeks. That assumption breaks when:
- The on-call rotation absorbs an unpredictable amount of time per sprint.
- An incident can vaporize three days of planned work.
- Cross-team requests arrive constantly and many are legitimately urgent.
- Long-running infrastructure projects (cluster upgrades, migrations) span months and don’t fit a sprint.
Without adjustment, the team feels constantly behind, retros become “we got interrupted again,” and planning becomes theater.
Three Workable Patterns
Section titled “Three Workable Patterns”| Pattern | When It Fits |
|---|---|
| Kanban with WIP limits | High interrupt rate; work arrives unpredictably; team is small. |
| Scrum with explicit interrupt capacity | Medium interrupt rate; team values cadence and demos. |
| Split rotation: project shift + interrupt shift | Larger team; can dedicate engineers to “shield” project work. |
The wrong question is “Scrum or Kanban?” The right question is “how do we make commitments that survive contact with on-call?”
Pattern 1 — Kanban With WIP Limits
Section titled “Pattern 1 — Kanban With WIP Limits”Backlog ──► Ready ──► In Progress (WIP: 3) ──► Review ──► Done ▲ │ │ └── Interrupts get a separate swimlane │ with its own WIP limit (e.g. 1) └── Refilled weekly from prioritized backlog| Element | Purpose |
|---|---|
| WIP limit on In Progress | Prevents starting more than the team can finish; surfaces blockers. |
| Separate interrupt swimlane | Makes interrupt cost visible; if it constantly hits its WIP limit, the rotation is overloaded. |
| Weekly refill, not sprint | Replaces “sprint commitment” with steady prioritization. |
Pattern 2 — Scrum With Explicit Interrupt Capacity
Section titled “Pattern 2 — Scrum With Explicit Interrupt Capacity”If you keep sprints, plan with two budgets:
| Budget | Typical Allocation |
|---|---|
| Project work | 50–70% of team capacity |
| Interrupts and toil | 20–40% of team capacity |
| Slack (learning, on-call recovery, retros) | 10% |
Treat the interrupt budget like real capacity. If the on-call engineer for the sprint is at 100% interrupt time for the whole sprint, do not also commit them to project work. This is the most common failure mode.
Pattern 3 — Split Rotation
Section titled “Pattern 3 — Split Rotation”Week 1 Week 2 Week 3 Week 4─────────── ─────────── ─────────── ───────────Engineer A: PRJ Engineer A: ON Engineer A: PRJ Engineer A: PRJEngineer B: ON Engineer B: PRJ Engineer B: PRJ Engineer B: ONEngineer C: PRJ Engineer C: PRJ Engineer C: ON Engineer C: PRJEngineers rotate between “on” (interrupt + on-call) and “project” shifts. Project work is only committed to engineers in project shift. This is the cleanest way to make Scrum-style commitments work for SRE.
Sprint Commitments With On-Call Reality
Section titled “Sprint Commitments With On-Call Reality”A few rules that prevent commitment theater:
- Subtract on-call capacity from sprint capacity before planning, every sprint.
- Do not assign the on-call engineer to time-sensitive project work in the same sprint.
- Carry-over is normal when an incident hits; estimate, don’t blame.
- Post-incident days count as recovery, not slack. The engineer who ran a major incident should not be expected to also ship a feature that day.
Toil Budgets
Section titled “Toil Budgets”Toil — manual, repetitive, automatable, value-neutral work — silently consumes a team. Cap it explicitly.
| Step | What It Looks Like |
|---|---|
| Measure | Categorize tickets and on-call work as “toil” vs “project” vs “incident response.” Track per sprint. |
| Cap | Common SRE convention: keep toil below 50% of team time. Above 50%, automate or reduce intake. |
| Convert | Each toil category gets a project to automate or eliminate it; the toil time funds the project. |
| Re-measure | Repeat. Toil is a moving target; new toil appears as systems grow. |
Toil reduction projects deserve named owners and roadmap slots, not a perpetual “we’ll get to it” status.
See also QA reliability guide §4 for how reliability validation work fits with toil budgets.
Ceremonies That Help vs Ceremony Theater
Section titled “Ceremonies That Help vs Ceremony Theater”| Ceremony | Helps When | Becomes Theater When |
|---|---|---|
| Daily standup | 5–10 minutes; surfaces blockers and on-call status. | 30+ minutes; status reporting to manager; people zone out. |
| Sprint planning | Discusses tradeoffs and capacity honestly. | Mechanical backlog grooming; everyone agrees too quickly. |
| Retro | Concrete actions with owners; trends across sprints. | Same complaints every sprint; no actions tracked. |
| Demo / review | Shows running infrastructure improvements; cross-team learning. | Slideware of “things we did”; no one in the audience. |
| Backlog refinement | Sharpens upcoming items; sizes against actual capacity. | Endless re-prioritization without committing to anything. |
A useful retro pattern beyond “what went well / what didn’t”: trend tracking. Bring the toil ratio, incident count, carry-over %, and MTTR to retro. Retro on the trend, not the anecdotes.
Definition of Done for Infrastructure Tasks
Section titled “Definition of Done for Infrastructure Tasks”A change is not done at “merged.” A reusable Definition of Done for platform/SRE work:
- Code or config merged with at least one review (per CI/CD compliance).
- Tests added or updated (where applicable).
- Deployed to all target environments — not just staging.
- Observability: metrics emitted, dashboard updated, alert rule reviewed (see Alerting).
- Runbook added or updated for the new failure modes (see Incident response).
- Rollback path documented and verified.
- Change communicated to the affected partner teams.
A change shipped without an updated runbook just turned itself into future toil for the on-call rotation.
Pairing With Product Teams
Section titled “Pairing With Product Teams”Platform teams serve product teams. Patterns that prevent the platform from becoming a bottleneck:
- Office hours — predictable window where any product engineer can ask questions; reduces ad-hoc DMs.
- Templates and self-service — golden paths that cover 80% of needs without a ticket. See CI/CD best practices.
- Embedded rotation — a platform engineer joins a product team for a sprint; learns their pain, ships what’s needed, returns.
- Shared SLOs — when both teams own a number, prioritization arguments shrink.
- Joint on-call for the shared dependency during launches; both teams have skin in the game.
Anti-Patterns to Watch
Section titled “Anti-Patterns to Watch”- “We do Agile” without measuring anything — no toil ratio, no carry-over %, no retro trends. Cargo-cult ceremonies.
- Sprint commitments that ignore on-call — guarantees a demoralized team and missed commitments.
- Endless backlog refinement — sharpening tickets that will never be picked up.
- Hero culture — one engineer carries the rotation; nobody else learns; they burn out.
- No project work — 100% interrupt mode is unsustainable; the team eventually loses senior engineers.
Checklist
Section titled “Checklist”- On-call capacity is subtracted from sprint capacity before planning.
- The team measures and reports a toil ratio at least monthly.
- Retros bring trends (toil, MTTR, carry-over) — not just anecdotes.
- Definition of Done includes observability, runbook, and rollback.
- Platform team has at least one self-service path that does not require a ticket.
- Engineers rotate between project and on-call work; nobody runs interrupts indefinitely.
Related
Section titled “Related”- Leadership and mentoring — how senior engineers protect mentoring and project time
- Incident response and on-call — sustainable rotations and post-incident recovery
- QA reliability guide — the wider reliability practice
- CI/CD best practices — self-service and platform guardrails