Agile for SRE and Platform Work

First PublishedApr 29, 2026ByAtif Alam

Most Agile guidance is written for product teams shipping features. SRE and platform teams have different traffic: on-call interrupts, toil, incidents, and long-lived infrastructure projects all compete for the same engineers. This page covers what to keep, what to drop, and what to invent when applying Agile to that mix.

”Agile” here is the lowercase practice — short iterations, written commitments, regular retros — not a particular vendor framework.

Why Vanilla Scrum Hurts SRE Teams

Vanilla Scrum assumes the team can commit to a fixed scope for two weeks. That assumption breaks when:

The on-call rotation absorbs an unpredictable amount of time per sprint.
An incident can vaporize three days of planned work.
Cross-team requests arrive constantly and many are legitimately urgent.
Long-running infrastructure projects (cluster upgrades, migrations) span months and don’t fit a sprint.

Without adjustment, the team feels constantly behind, retros become “we got interrupted again,” and planning becomes theater.

Three Workable Patterns

Pattern	When It Fits
Kanban with WIP limits	High interrupt rate; work arrives unpredictably; team is small.
Scrum with explicit interrupt capacity	Medium interrupt rate; team values cadence and demos.
Split rotation: project shift + interrupt shift	Larger team; can dedicate engineers to “shield” project work.

The wrong question is “Scrum or Kanban?” The right question is “how do we make commitments that survive contact with on-call?”

Pattern 1 — Kanban With WIP Limits

1
Backlog ──► Ready ──► In Progress (WIP: 3) ──► Review ──► Done
2
              ▲             │
3
              │             └── Interrupts get a separate swimlane
4
              │                  with its own WIP limit (e.g. 1)
5
              └── Refilled weekly from prioritized backlog

Element	Purpose
WIP limit on In Progress	Prevents starting more than the team can finish; surfaces blockers.
Separate interrupt swimlane	Makes interrupt cost visible; if it constantly hits its WIP limit, the rotation is overloaded.
Weekly refill, not sprint	Replaces “sprint commitment” with steady prioritization.

Pattern 2 — Scrum With Explicit Interrupt Capacity

If you keep sprints, plan with two budgets:

Budget	Typical Allocation
Project work	50–70% of team capacity
Interrupts and toil	20–40% of team capacity
Slack (learning, on-call recovery, retros)	10%

Treat the interrupt budget like real capacity. If the on-call engineer for the sprint is at 100% interrupt time for the whole sprint, do not also commit them to project work. This is the most common failure mode.

Pattern 3 — Split Rotation

1
Week 1            Week 2            Week 3            Week 4
2
───────────       ───────────       ───────────       ───────────
3
Engineer A: PRJ   Engineer A: ON    Engineer A: PRJ   Engineer A: PRJ
4
Engineer B: ON    Engineer B: PRJ   Engineer B: PRJ   Engineer B: ON
5
Engineer C: PRJ   Engineer C: PRJ   Engineer C: ON    Engineer C: PRJ

Engineers rotate between “on” (interrupt + on-call) and “project” shifts. Project work is only committed to engineers in project shift. This is the cleanest way to make Scrum-style commitments work for SRE.

Sprint Commitments With On-Call Reality

A few rules that prevent commitment theater:

Subtract on-call capacity from sprint capacity before planning, every sprint.
Do not assign the on-call engineer to time-sensitive project work in the same sprint.
Carry-over is normal when an incident hits; estimate, don’t blame.
Post-incident days count as recovery, not slack. The engineer who ran a major incident should not be expected to also ship a feature that day.

Toil Budgets

Toil — manual, repetitive, automatable, value-neutral work — silently consumes a team. Cap it explicitly.

Step	What It Looks Like
Measure	Categorize tickets and on-call work as “toil” vs “project” vs “incident response.” Track per sprint.
Cap	Common SRE convention: keep toil below 50% of team time. Above 50%, automate or reduce intake.
Convert	Each toil category gets a project to automate or eliminate it; the toil time funds the project.
Re-measure	Repeat. Toil is a moving target; new toil appears as systems grow.

Toil reduction projects deserve named owners and roadmap slots, not a perpetual “we’ll get to it” status.

See also QA reliability guide §4 for how reliability validation work fits with toil budgets.

Ceremonies That Help vs Ceremony Theater

Ceremony	Helps When	Becomes Theater When
Daily standup	5–10 minutes; surfaces blockers and on-call status.	30+ minutes; status reporting to manager; people zone out.
Sprint planning	Discusses tradeoffs and capacity honestly.	Mechanical backlog grooming; everyone agrees too quickly.
Retro	Concrete actions with owners; trends across sprints.	Same complaints every sprint; no actions tracked.
Demo / review	Shows running infrastructure improvements; cross-team learning.	Slideware of “things we did”; no one in the audience.
Backlog refinement	Sharpens upcoming items; sizes against actual capacity.	Endless re-prioritization without committing to anything.

A useful retro pattern beyond “what went well / what didn’t”: trend tracking. Bring the toil ratio, incident count, carry-over %, and MTTR to retro. Retro on the trend, not the anecdotes.

Definition of Done for Infrastructure Tasks

A change is not done at “merged.” A reusable Definition of Done for platform/SRE work:

Code or config merged with at least one review (per CI/CD compliance).
Tests added or updated (where applicable).
Deployed to all target environments — not just staging.
Observability: metrics emitted, dashboard updated, alert rule reviewed (see Alerting).
Runbook added or updated for the new failure modes (see Incident response).
Rollback path documented and verified.
Change communicated to the affected partner teams.

A change shipped without an updated runbook just turned itself into future toil for the on-call rotation.

Pairing With Product Teams

Platform teams serve product teams. Patterns that prevent the platform from becoming a bottleneck:

Office hours — predictable window where any product engineer can ask questions; reduces ad-hoc DMs.
Templates and self-service — golden paths that cover 80% of needs without a ticket. See CI/CD best practices.
Embedded rotation — a platform engineer joins a product team for a sprint; learns their pain, ships what’s needed, returns.
Shared SLOs — when both teams own a number, prioritization arguments shrink.
Joint on-call for the shared dependency during launches; both teams have skin in the game.

Anti-Patterns to Watch

“We do Agile” without measuring anything — no toil ratio, no carry-over %, no retro trends. Cargo-cult ceremonies.
Sprint commitments that ignore on-call — guarantees a demoralized team and missed commitments.
Endless backlog refinement — sharpening tickets that will never be picked up.
Hero culture — one engineer carries the rotation; nobody else learns; they burn out.
No project work — 100% interrupt mode is unsustainable; the team eventually loses senior engineers.

Checklist

On-call capacity is subtracted from sprint capacity before planning.
The team measures and reports a toil ratio at least monthly.
Retros bring trends (toil, MTTR, carry-over) — not just anecdotes.
Definition of Done includes observability, runbook, and rollback.
Platform team has at least one self-service path that does not require a ticket.
Engineers rotate between project and on-call work; nobody runs interrupts indefinitely.

Leadership and mentoring — how senior engineers protect mentoring and project time
Incident response and on-call — sustainable rotations and post-incident recovery
QA reliability guide — the wider reliability practice
CI/CD best practices — self-service and platform guardrails