Skip to content

LLM Diagnostics and Intelligent Runbooks

First PublishedByAtif Alam

Job descriptions often ask for LLM-based diagnostics and intelligent runbooks. This page maps those phrases to workflows you can describe and prototype.

By the end of this page, you should be able to sketch an LLM-assisted triage flow and list what to verify before automation touches production.

Tools such as Cursor, Continue.dev, or Aider help author:

  • runbooks and Terraform,
  • scripts for one-off remediation,
  • queries for metrics and logs.

Minimum bar: Evaluate LLM-generated code for correctness, security (secrets, overly broad IAM), and edge cases—not just whether it “runs once.”

A static runbook is a fixed checklist. An intelligent runbook workflow:

  1. Ingests the symptom (alert text, ticket, short description).
  2. Retrieves relevant runbook sections and recent changes (deploys, config).
  3. Suggests next steps and links to dashboards or queries.

You may not have built one end-to-end; you should still be able to design the architecture: retrieval, policy gates, and human approval. That usually leads to RAG—see RAG for Incident Operations.

Common pattern: send log excerpts, stack traces, or trace IDs to an LLM with strict instructions to:

  • summarize what failed,
  • propose hypotheses ranked by likelihood,
  • suggest queries to confirm or deny.

Retrieval-Augmented Generation (RAG) improves grounding by pulling similar past incidents and runbooks into the prompt instead of relying on the model’s parametric memory alone.

  • Default to read-only suggestions until reviewed.
  • Log prompts and outputs for audit on production-impacting paths.
  • Use Evaluating LLM Outputs for acceptance criteria.