Skip to content

AIOps Tooling and Stack

First PublishedByAtif Alam

This page is a practical reference for tools and libraries mentioned in AI-assisted SRE contexts. By the end, you should be able to pick a minimal stack for a pilot and justify tradeoffs.

AIOps and intelligent observability (platforms)

Section titled “AIOps and intelligent observability (platforms)”

Know conceptually (not every feature):

AreaExamples
APM / monitoring AIDatadog Watchdog, Dynatrace Davis
AWSDevOps Guru
Incident correlationMoogsoft, BigPanda, PagerDuty AIOps
LibraryTypical use
prophetTime-series forecasting (capacity, traffic)
scikit-learnIsolationForest, clustering for alert grouping
statsmodelsStatistical tests, ARIMA
pyodOutlier detection
riverOnline/streaming models
LayerExamples
OrchestrationLangChain
RAG indexingLlamaIndex
APIsopenai, anthropic SDKs
ComponentExamples
Embeddingssentence-transformers
Local / prototypeChromaDB, FAISS
ManagedPinecone, Weaviate

Intelligent runbooks and coding assistants

Section titled “Intelligent runbooks and coding assistants”
ToolNotes
CursorMulti-file context
AiderTerminal coding agent
Continue.devOpen-source, self-hostable
LibraryNotes
loguruStructured logging
elasticsearch-pyQuery logs for LLM context
tiktokenToken counting for context limits
LibraryNotes
instructorPydantic-structured LLM outputs
guidance (Microsoft)constrained generation
ToolNotes
ragasRAG faithfulness, relevance
deepevalTest-style assertions
promptfooPrompt regression across models
ToolNotes
mlflowExperiments, model registry
evidentlyDrift monitoring
great-expectationsData quality checks
  1. LangChain + ChromaDB + Anthropic — small RAG over runbooks.
  2. Prophet or IsolationForest — anomalies on sample metrics.
  3. Instructor + Anthropic — structured fields from a log snippet.
  4. Ragas — evaluate the RAG pipeline.