QA and reliability — a guide for SRE engineers
This guide is meant to be read and applied on the job: principles, checklists, and pointers into this site’s CI/CD, Observability, Kubernetes, cloud, and AIOps content. It is not a substitute for your organization’s policies, regulated industries’ requirements, or hands-on training on specific test frameworks (e.g. a given vendor or device lab).
1. Scope and principles
Section titled “1. Scope and principles”Learning outcomes: You can explain quality vs reliability, who typically owns what, and where this guide applies.
| Term | Practical meaning |
|---|---|
| Quality | Does the system meet requirements and behave correctly for users—often validated before and after release (tests, acceptance). |
| Reliability | Does the system stay correct under real-world conditions—errors, load, dependency failures, deploys, and time (SLOs, incidents, capacity). |
Ownership (typical patterns, not universal): product engineering builds features; QA may own test strategy and some automation; SRE / platform may own production SLOs, incident response, and shared infrastructure; everyone owns learning from incidents.
Strategy and influence: Priorities come from risk (user impact, compliance) and cost (people, infrastructure). Useful artifacts: a roadmap for reliability work, SLOs and error budgets so tradeoffs are numeric, and a published service readiness checklist so “production” means the same thing across teams.
When this guide fits: Distributed, often cloud-hosted services with CI/CD, observability, and cross-team coordination. Tighter physical or safety-critical systems may need additional standards not covered here.
Checklist — clarity before you standardize
- Written definitions of severity for incidents and priority for defects
- Agreed release criteria (what must pass before prod)
- Named owners for test automation, production monitoring, and postmortems
2. Test strategy and automation
Section titled “2. Test strategy and automation”Learning outcomes: You can outline a test pyramid, CI stages, and a flaky-test policy; you know where to read more in this library.
- Pyramid: many fast unit tests, fewer integration tests, still fewer end-to-end tests—E2E is valuable but slow and brittle if overused.
- CI stages: run cheap tests first; gate promotions to staging/prod with stronger suites. See Pipeline fundamentals.
- Flakiness: detect (history in CI), quarantine with ownership and a fix deadline, avoid silently retrying without fixing root cause.
- Environments: document gaps vs production (data volume, feature flags, dependencies); risky assumptions should be explicit in release notes.
Go deeper: CI/CD, GitOps, Deployment strategies.
Honest gap: This library does not catalog every test case management or device farm tool. Use your org’s toolchain; the ideas above still apply.
Optional exercise: Sketch your current CI as stages (build → unit → integration → E2E → deploy) and mark where you would add a quality gate.
3. Reliability validation
Section titled “3. Reliability validation”Learning outcomes: You can contrast load, stress, and soak tests; describe failure injection and game days at a high level.
| Activity | Focus |
|---|---|
| Load | Behavior at expected or peak traffic |
| Stress | Behavior beyond expected capacity (degradation, backpressure) |
| Soak | Stability over long duration (leaks, drift) |
Failure scenarios: dependency timeouts, partial outages, bad deploys, AZ impairment where applicable. Keep blast radius small in test (namespaces, canaries). See Kubernetes production patterns and Deployment strategies.
Observability during tests: use Distributed tracing and metrics to validate latency and error budgets—not just “HTTP 200.”
Optional exercise: List three failure modes for a service you run; for each, note one metric and one log or trace signal you would check.
4. Production monitoring and incident learning
Section titled “4. Production monitoring and incident learning”Learning outcomes: You can tie metrics, logs, and traces to alerting and post-incident improvement.
- Three pillars: metrics, logs, traces—see Observability overview.
- SLOs: define SLIs, SLOs, and error budgets so alerting and postmortems reference the same bar.
- Alerting: alert on symptoms that matter to users or SLOs; reduce noise. See Alerting.
- Incidents and on-call: roles, escalation, and blameless reviews—see Incident response and on-call.
- RCA and postmortems: timeline, contributing factors, action items with owners; focus on systems, not blame.
Go deeper: Prometheus, Grafana, OpenTelemetry. For network-leaning incidents, add VPC Flow Logs and Network RCA and Packet capture after metrics and app logs narrow the window.
Checklist — incident response (outline)
- Severity and comms channel agreed
- Mitigation first, then root cause
- Postmortem scheduled; actions tracked to completion
5. AI and automation (safely)
Section titled “5. AI and automation (safely)”Learning outcomes: You can name safe uses of AI in QA/reliability (drafting, summarization, retrieval) vs risky uses (unreviewed production changes).
AI can assist with test ideas, log summarization, runbook retrieval, and postmortem drafts. It should not bypass review for anything that changes production state or security posture.
Use retrieval (RAG) over internal docs and evaluate outputs before widening automation. See AIOps, RAG for Incident Operations, Evaluating LLM Outputs.
Checklist — before AI touches production-adjacent workflows
- Human review path defined
- Auditability (what was retrieved, what was suggested)
- Rollback or stop condition if suggestions are wrong
6. Cloud and platform reality
Section titled “6. Cloud and platform reality”Learning outcomes: You can relate Kubernetes, IaC, and multi-AZ ideas to reliability work.
- Kubernetes: deployments, probes, resource limits—Kubernetes section.
- AWS EKS (example): EKS overview, Terraform cluster.
- IaC: Terraform, Ansible for repeatability and reviewable change.
Optional exercise: List every external dependency your service needs to be “healthy” and how you test each dependency’s failure mode.
7. Domain lens — energy / grid–style systems
Section titled “7. Domain lens — energy / grid–style systems”Learning outcomes: You can name types of constraints (operational windows, compliance, time-series rigor) and questions to ask domain experts—without claiming OT expertise this library does not provide.
Many energy and grid-adjacent systems combine software with operational and regulatory constraints: change control, evidence for audits, seasonal or market-driven peaks, and high cost of downtime.
Non-prescriptive considerations
- Time-series integrity — bad telemetry can look like a software bug; validate clocks, gaps, and unit consistency.
- Change windows — coordinated maintenance may limit deploy times.
- Safety and separation — IT patterns may not map 1:1 to field or control networks; follow your org’s architecture.
If you lack domain history: partner with operators and compliance early; document assumptions.
8. Documentation and continuous improvement
Section titled “8. Documentation and continuous improvement”Learning outcomes: You can list what to publish for cross-team visibility and metrics to improve over time.
Publish
- Test strategy and scope (what is automated vs manual)
- Release and rollback expectations
- Known limitations and risk accepted for a release
- Postmortem actions and verification
Metrics (examples)
- Escaped defects, MTTR, incident recurrence
- Flaky test rate and time-to-fix
- Change failure rate and deployment frequency (if you track DORA-style metrics)
Recommended order (library)
- Pipeline fundamentals
- Observability pillars, SLOs and error budgets, and Alerting
- Kubernetes production patterns and Service readiness checklist
- Incident response and on-call
- AIOps for safe AI-assisted workflows
Back to the QA section
Section titled “Back to the QA section”QA overview — how this section fits the rest of the library.