QA Overview
This section holds practical guides for engineers who own or share quality and reliability in cloud-native, distributed systems—for example platforms serving grid, energy, or other operational workloads where outages are costly and change must be defensible.
The library does not replace formal QA certification or vendor-specific test tools; it connects reliability practices to the rest of the topics here (CI/CD, observability, Kubernetes, cloud, AIOps).
Main guide
Section titled “Main guide”QA and reliability: a guide for SRE engineers — structured chapters with learning outcomes, checklists, optional exercises, and “go deeper” links across the library.
Focused guides
Section titled “Focused guides”- Service readiness checklist — observability, Kubernetes, and CI/CD gates before production.
- Incident response and on-call — command, comms, escalation, postmortems, sustainable rotations.
Suggested reading order
Section titled “Suggested reading order”- Read the main guide start to finish, or jump to the chapter that matches your current initiative (e.g. test strategy vs incident learning).
- Deepen foundations as needed:
- CI/CD — pipelines, deployment strategies, GitOps
- Observability — metrics, logs, traces, SLOs, alerting
- Kubernetes — workloads, production patterns, EKS on AWS if relevant
- Incident response and service readiness when you own production
- AIOps — AI-assisted triage, RAG, evaluation guardrails
- Return to the guide’s documentation and continuous improvement chapter when you are ready to publish standards for your team.
Related sections
Section titled “Related sections”| Topic | Where to go |
|---|---|
| Pipelines and release safety | CI/CD, Pipeline fundamentals |
| Production signals | Observability, Alerting |
| AI in operations | AIOps |
| Cloud platforms | AWS, Azure |