CI/CD Best Practices
Good CI/CD isn’t just about having a pipeline — it’s about having a pipeline that is fast, reliable, secure, and maintainable. This page covers the practices that separate a basic pipeline from a production-grade one.
Self-service and platform guardrails
Section titled “Self-service and platform guardrails”Platform teams often expose self-service pipelines or templates so product teams can ship without a ticket for every change. That only works with guardrails: approved base images, mandatory scans, environment promotion rules, and observability hooks. The goal is safe autonomy—speed with defaults that prevent repeated mistakes. See Pipeline fundamentals for stages and secrets; pair with Kubernetes production patterns and service readiness for what “done” means before production.
Pipeline Design
Section titled “Pipeline Design”Fail Fast
Section titled “Fail Fast”Order stages so the quickest checks run first. If linting takes 10 seconds and e2e tests take 10 minutes, run linting first:
lint (10s) ──► unit tests (60s) ──► integration tests (3m) ──► e2e tests (10m) ──► deploy ▲ │ │ If any step fails, pipeline stops here │ └──────────────────────────────────────────────────────────────────────────────┘Parallelize
Section titled “Parallelize”Run independent jobs simultaneously:
┌── lint (10s) │build (30s) ────────┼── unit tests (60s) Total: 30s + 60s = 90s │ (not 30s + 10s + 60s + 30s = 130s) └── security scan (30s)Keep Pipelines Under 10 Minutes
Section titled “Keep Pipelines Under 10 Minutes”Fast feedback is the core value of CI. If the pipeline takes 30+ minutes, developers stop waiting for it and context-switch.
| Technique | Impact |
|---|---|
| Cache dependencies | Save 30-60s per run (npm, pip, Go modules) |
| Parallelize tests | Cut test time by N (number of parallel jobs) |
| Use faster runners | Larger VMs = faster builds |
| Skip unnecessary work | Path filtering for monorepos |
| Use incremental builds | Only recompile changed modules |
| Split test suites | Run unit tests in CI, e2e tests on merge to main only |
Pipeline Stages
Section titled “Pipeline Stages”A recommended stage ordering:
| Stage | What | When to Run |
|---|---|---|
| Lint / format | Code style, formatting | Every push and PR |
| Build | Compile, install deps, create artifact | Every push and PR |
| Unit tests | Fast, isolated tests | Every push and PR |
| Integration tests | Tests with real dependencies (DB, API) | Every push and PR (or on merge) |
| Security scan | SAST, dependency vulnerabilities, container scan | Every push and PR |
| E2E tests | Full system tests | On merge to main (or nightly) |
| Deploy staging | Deploy to staging, smoke test | On merge to main |
| Deploy production | Manual approval, deploy, monitor | After staging validation |
Security
Section titled “Security”No Secrets in Code
Section titled “No Secrets in Code”Bad: AWS_SECRET_KEY = "AKIA..." hardcoded in pipeline YAML or source codeGood: Use the CI/CD platform's encrypted secret storeBest: Use OIDC — no stored credentials at allOIDC Over Long-Lived Credentials
Section titled “OIDC Over Long-Lived Credentials”OIDC (OpenID Connect) lets the pipeline request a short-lived token from the cloud provider — no access keys to store, rotate, or leak:
| Platform | OIDC Support |
|---|---|
| GitHub Actions | permissions: id-token: write + cloud provider trust |
| GitLab CI | id_tokens keyword |
| Azure Pipelines | Workload Identity Federation |
See GitHub Actions OIDC and GitLab CI OIDC for setup.
Least-Privilege Permissions
Section titled “Least-Privilege Permissions”- GitHub Actions: Set
permissionsin the workflow to restrictGITHUB_TOKENscope. - GitLab CI: Use
protectedandmaskedvariables, scoped to environments. - Cloud roles: Grant only the permissions the pipeline needs (e.g. push to ECR, deploy to ECS — not full admin).
Pin Dependencies and Actions
Section titled “Pin Dependencies and Actions”# Bad: uses latest (could change without notice)- uses: actions/checkout@main
# Good: pin to a version tag- uses: actions/checkout@v4
# Best: pin to a full commit SHA (immutable)- uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11Supply Chain Security
Section titled “Supply Chain Security”| Practice | What It Does |
|---|---|
| Pin action/image versions | Prevent unexpected changes from upstream |
| Dependency scanning | Detect known vulnerabilities in packages |
| Container scanning | Scan Docker images for CVEs |
| SBOM generation | Create a Software Bill of Materials for each build |
| Signed artifacts | Sign container images (cosign, Notary) to prove provenance |
| Dependabot / Renovate | Auto-update dependencies with PRs |
Testing Strategy in CI
Section titled “Testing Strategy in CI”The Test Pyramid
Section titled “The Test Pyramid” ┌─────────┐ / E2E Tests \ Slow, expensive, fragile / (few: ~10) \ Run on merge to main /─────────────────────\ / Integration Tests \ Medium speed / (~50-100) \ Run on every PR /─────────────────────────────\ / Unit Tests \ Fast, cheap, reliable / (~500-1000+) \ Run on every push /─────────────────────────────────────\| Level | What It Tests | Speed | Run When |
|---|---|---|---|
| Unit | Individual functions/classes in isolation | Milliseconds | Every push |
| Integration | Components working together (DB, API) | Seconds | Every PR |
| E2E | Full user flows through the UI or API | Minutes | Merge to main, nightly |
Test Splitting
Section titled “Test Splitting”For large test suites, split tests across parallel runners:
# GitHub Actions — run tests in parallel shardsstrategy: matrix: shard: [1, 2, 3, 4]steps: - run: npm test -- --shard=${{ matrix.shard }}/4Flaky Test Management
Section titled “Flaky Test Management”Flaky tests (tests that sometimes pass, sometimes fail) erode confidence in the pipeline:
| Strategy | What It Does |
|---|---|
| Quarantine | Move flaky tests to a separate job (non-blocking) |
| Retry | Retry failed tests once (but track flake rate) |
| Track metrics | Dashboard of flaky tests — fix or delete them |
| No new flakes | Require new tests to pass 10 consecutive runs before merging |
Branch Strategies
Section titled “Branch Strategies”Trunk-Based Development (Recommended)
Section titled “Trunk-Based Development (Recommended)”main ───●───●───●───●───●───●───●───●──► (always deployable) \ / \ / feat-A feat-B (short-lived, 1-2 days)- Everyone commits to
main(or very short-lived feature branches). - Feature flags hide incomplete work.
- CI runs on every push; CD deploys main continuously.
Best for: Teams with good test coverage and feature flags. Fastest feedback loop.
GitHub Flow
Section titled “GitHub Flow”main ───●───────●───────●───────●──► (protected, always deployable) \ / \ / feat-A feat-B (PR + review) (PR + review)- Create a feature branch from
main. - Open a PR, get review, merge.
mainis always deployable.
Best for: Most teams. Simple, well-understood, works with GitHub PRs.
GitLab Flow
Section titled “GitLab Flow”main ──────●──────●──────●──────●───► (development) \ \ ▼ ▼staging ─────●──────────────●───────► (staging environment) \ \ ▼ ▼production ────●──────────────●─────► (production environment)mainfor development,stagingandproductionbranches for deployment.- Merge from
main→staging→production.
Best for: Teams that need environment branches and explicit promotion.
Comparison
Section titled “Comparison”| Strategy | Branches | Merge Frequency | Complexity | Best For |
|---|---|---|---|---|
| Trunk-based | main only (+ short feature) | Multiple times/day | Low | High-performing teams |
| GitHub Flow | main + feature branches | Daily to weekly | Low | Most teams |
| GitLab Flow | main + env branches | Weekly | Medium | Teams needing env promotion |
| Git Flow | main + develop + feature + release + hotfix | Weekly to monthly | High | Versioned software (avoid if possible) |
Monorepo CI
Section titled “Monorepo CI”For repositories containing multiple services/packages:
Path Filtering
Section titled “Path Filtering”Only run pipelines for the service that changed:
# GitHub Actionson: push: paths: - 'services/api/**' - 'shared/**' # Also rebuild if shared code changes# GitLab CIapi-tests: rules: - changes: - services/api/** - shared/**Affected-Only Builds
Section titled “Affected-Only Builds”Tools like Nx (JavaScript), Turborepo, Bazel, or pants understand the dependency graph and only build/test what was affected:
# Nx: only test projects affected by changes since mainnpx nx affected --target=test --base=origin/mainMonorepo Best Practices
Section titled “Monorepo Best Practices”| Practice | Why |
|---|---|
| Path filters | Don’t rebuild everything on every change |
| Shared base image | Pre-built Docker image with common deps |
| Dependency graph tool | Only build/test affected packages |
| Separate deploy jobs per service | Don’t deploy the API when only the frontend changed |
| Cache aggressively | Share caches across services where possible |
Pipeline as Code
Section titled “Pipeline as Code”Treat pipeline definitions like application code:
| Practice | What It Means |
|---|---|
| Version controlled | Pipeline YAML lives in the same repo as the code |
| Code reviewed | Pipeline changes go through PR review |
| Tested | Use act (GitHub Actions) or gitlab-ci-lint to validate locally |
| DRY | Reusable workflows (GitHub) / includes+extends (GitLab) / templates (Azure) |
| Documented | Comments explaining non-obvious steps |
Notifications and Observability
Section titled “Notifications and Observability”Notifications
Section titled “Notifications”# GitHub Actions — Slack notification on failure- name: Notify Slack on failure if: failure() uses: slackapi/slack-github-action@v1 with: channel-id: 'C0123456789' slack-message: "Pipeline failed: ${{ github.repository }}@${{ github.sha }}" env: SLACK_BOT_TOKEN: ${{ secrets.SLACK_TOKEN }}| Channel | When | What |
|---|---|---|
| Slack / Teams | Failure | Pipeline failed, deployment failed |
| Failure (optional) | Summary of failures | |
| GitHub/GitLab comments | PR pipelines | Test results, coverage, plan output |
| Dashboard | Always | Pipeline success rate, duration trends |
DORA Metrics
Section titled “DORA Metrics”The DORA (DevOps Research and Assessment) metrics measure CI/CD effectiveness:
| Metric | What It Measures | Elite Benchmark |
|---|---|---|
| Deployment Frequency | How often you deploy to production | Multiple times per day |
| Lead Time for Changes | Time from commit to production | Less than 1 hour |
| Change Failure Rate | % of deployments that cause a failure | 0–15% |
| Time to Restore Service | Time to recover from a production failure | Less than 1 hour |
Track these metrics to understand and improve your CI/CD process:
Deployment Frequency: 3x/day ✓ EliteLead Time for Changes: 45 min ✓ EliteChange Failure Rate: 8% ✓ EliteTime to Restore Service: 30 min ✓ EliteTools for DORA metrics: Sleuth, LinearB, Faros AI, GitLab Value Stream Analytics, GitHub-based custom dashboards.
Common Anti-Patterns
Section titled “Common Anti-Patterns”| Anti-Pattern | Problem | Fix |
|---|---|---|
| 30+ minute pipelines | Developers don’t wait, context-switch | Parallelize, cache, split test suites |
| Flaky tests | False failures erode trust | Quarantine, fix, or delete flaky tests |
| Manual gates everywhere | Slow deployments, bottleneck on approvers | Automate staging deploy; manual gate only for production |
| No rollback plan | Stuck when a deployment goes bad | Test rollback procedures regularly |
| Secrets in code | Credential leaks | Use platform secret store + OIDC |
| Pipeline YAML copy-paste | Inconsistent, hard to maintain | Reusable workflows / includes / templates |
| No path filtering in monorepo | Every change rebuilds everything | Add path filters and affected-only builds |
| Testing only in CI | Slow feedback for developers | Run fast tests locally too (pre-commit, husky) |
| Ignoring security scans | Vulnerabilities ship to production | Block merge if critical/high vulnerabilities found |
| No deployment observability | Don’t know if deploy succeeded or degraded | Smoke tests + monitoring after deploy |
Checklist
Section titled “Checklist”A quick checklist for a healthy CI/CD setup:
- Pipeline runs on every push and PR.
- Pipeline completes in under 10 minutes (ideally under 5).
- Secrets are in the platform’s encrypted store, not in code.
- OIDC is used for cloud authentication (no long-lived keys).
- Actions and dependencies are pinned to specific versions.
- Tests follow the test pyramid (many unit, some integration, few e2e).
- Flaky tests are tracked and fixed.
- Path filtering is in place for monorepos.
- Staging deploys automatically; production has a manual approval gate.
- Rollback procedure is documented and tested.
- Pipeline failures send notifications (Slack/Teams).
- DORA metrics are tracked.
- Pipeline YAML is reviewed like application code.
Key Takeaways
Section titled “Key Takeaways”- Fail fast — lint and unit tests before long-running jobs.
- Parallelize — independent jobs should run simultaneously.
- Keep it under 10 minutes — fast feedback is the core value of CI.
- OIDC > stored credentials — short-lived tokens with no secrets to manage.
- Pin everything — actions, images, dependencies.
- Test pyramid — many unit tests, some integration, few e2e.
- Trunk-based or GitHub Flow — short-lived branches, frequent merges.
- Track DORA metrics — deployment frequency, lead time, change failure rate, time to restore.
- Treat pipeline YAML as code — version controlled, reviewed, DRY, documented.