QA Overview
This section holds practical guides for engineers who own or share quality and reliability in cloud-native, distributed systems—for example platforms serving grid, energy, or other operational workloads where outages are costly and change must be defensible.
The library does not replace formal QA certification or vendor-specific test tools; it connects reliability practices to the rest of the topics here (CI/CD, observability, Kubernetes, cloud, AIOps).
Main guide
Section titled “Main guide”QA and reliability: a guide for SRE engineers — structured chapters with learning outcomes, checklists, optional exercises, and “go deeper” links across the library.
Focused guides
Section titled “Focused guides”- Service readiness checklist — observability, Kubernetes, and CI/CD gates before production.
- Incident response and on-call — command, comms, escalation, postmortems, sustainable rotations.
Related Practices
Section titled “Related Practices”Engineering practices that surround reliability work — leadership, Agile for platform teams, and incident tooling — live in the Practices section:
- Leadership and mentoring — mentoring structures, feedback patterns, technical judgment, roadmap influence, and cross-team prioritization.
- Agile for SRE and platform work — sprint commitments alongside on-call, toil budgets, and Definition of Done for infrastructure.
- Incident tooling and customer comms — paging schedules, status pages, and severity-driven customer comms templates.
Suggested reading order
Section titled “Suggested reading order”- Read the main guide start to finish, or jump to the chapter that matches your current initiative (e.g. test strategy vs incident learning).
- Deepen foundations as needed:
- CI/CD — pipelines, deployment strategies, GitOps
- Observability — metrics, logs, traces, SLOs, alerting
- Kubernetes — workloads, production patterns, EKS on AWS if relevant
- Incident response and service readiness when you own production
- AIOps — AI-assisted triage, RAG, evaluation guardrails
- Return to the guide’s documentation and continuous improvement chapter when you are ready to publish standards for your team.
Related sections
Section titled “Related sections”| Topic | Where to go |
|---|---|
| Pipelines and release safety | CI/CD, Pipeline fundamentals |
| Production signals | Observability, Alerting |
| AI in operations | AIOps |
| Cloud platforms | AWS, Azure |