9 min read
HolmesGPT Wiki: AI-Powered Troubleshooting for Cloud Native

HolmesGPT is an open-source AI agent that investigates problems in your cloud infrastructure, finds root causes, and suggests remediations. Originally built by Robusta and now jointly maintained with Microsoft, it was accepted as a CNCF Sandbox project in October 2025. Unlike traditional monitoring tools that tell you what is broken, HolmesGPT tells you why it’s broken and how to fix it.

How It Works

HolmesGPT connects large language models with live observability data through an agentic loop. When you ask it a question or feed it an alert, it doesn’t just query a single data source and pass results to an LLM. Instead, it iteratively calls tools, gathers data from multiple sources, correlates findings, and builds a coherent root cause analysis.

The loop follows this pattern:

  1. Create a task list — break the problem into smaller investigation steps
  2. Query data sources — run Prometheus queries, collect Kubernetes events and logs, inspect pod specs, check deployment history
  3. Correlate context — detect that a recent deployment updated the image, a config change was made, or resource limits were hit
  4. Explain and suggest fixes — return a natural language diagnosis with concrete remediation steps

This agentic approach means HolmesGPT can chain together multiple queries without human intervention. If it finds an unhealthy pod, it checks the logs. If the logs mention a connection error, it checks the target service. If the service is backed by a database, it checks RDS metrics. Each step informs the next.

Architecture

HolmesGPT’s core is the ToolCallingLLM class, which implements the agentic loop. It orchestrates investigation through its call() method, iteratively invoking tools until reaching a conclusion.

User/Alert → HolmesGPT Agent → [Tool 1] → [Tool 2] → ... → Root Cause Analysis
                  ↑                                              |
                  └──────────── iterate ─────────────────────────┘

Key architectural principles:

  • Read-only access — HolmesGPT never modifies your infrastructure. It respects RBAC permissions and only reads data, making it safe for production
  • Hub-and-spoke model — a central Config class acts as a factory for all components, with toolsets as spokes that can be added or removed independently
  • LLM-agnostic — supports OpenAI, Anthropic Claude, Azure OpenAI, AWS Bedrock, Google Gemini, Google Vertex AI, and Ollama for local models
  • Data privacy — no training on your data. Data sent to Robusta SaaS is private to your account. For extra isolation, bring your own LLM API key

Installation

HolmesGPT supports multiple installation methods depending on your use case.

Homebrew (Mac/Linux)

brew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt

# verify
holmes ask --help

pipx

pipx install holmesgpt

# verify
holmes ask --help

From Source (Poetry)

git clone https://github.com/HolmesGPT/holmesgpt.git
cd holmesgpt
poetry install --no-root

# verify
poetry run holmes ask --help

Docker

docker run -it --net=host \
  -e OPENAI_API_KEY="your-key" \
  -v ~/.holmes:/root/.holmes \
  -v $HOME/.kube/config:/root/.kube/config \
  us-central1-docker.pkg.dev/genuine-flight-317411/devel/holmes \
  ask "what pods are unhealthy and why?"

Mount additional credential directories as needed (~/.aws, ~/.config/gcloud).

Helm Chart (Kubernetes)

For deploying HolmesGPT inside your cluster, a Helm chart is available on Artifact Hub. If using Robusta, add enableHolmesGPT: true to your Helm values:

helm upgrade robusta robusta/robusta \
  --values=generated_values.yaml \
  --set clusterName=<YOUR_CLUSTER_NAME>

CLI Usage

Ask a Question

The simplest usage is asking a direct question. HolmesGPT investigates using whatever toolsets are available (Kubernetes by default if you have a kubeconfig):

holmes ask "what pods are unhealthy and why?"

Interactive Mode

For follow-up questions and deeper investigation, use interactive mode:

holmes ask "why is my deployment failing?" --interactive

Interactive mode supports commands like /run to execute additional tools, /show to display gathered data, and /clear to reset context.

Investigate Alerts

HolmesGPT can pull alerts directly from monitoring systems and investigate each one:

# AlertManager
holmes investigate alertmanager --alertmanager-url http://localhost:9093

# PagerDuty
holmes investigate pagerduty --pagerduty-api-key <KEY>

# OpsGenie
holmes investigate opsgenie --opsgenie-api-key <KEY>

Add --update to write investigation results back to the alert source (e.g., add a note to the PagerDuty incident).

File Context

Pass additional context from local files:

holmes ask "summarize the key issues" -f ./incident-report.txt

Configuration

Store common settings in ~/.holmes/config.yaml to avoid repeating CLI arguments:

# ~/.holmes/config.yaml
model: claude-sonnet-4-20250514
api_key: your-api-key

# or use environment variables
# OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, etc.

An example config file with all available settings is provided in the repository at config.example.yaml.

Toolsets

Toolsets are HolmesGPT’s integration layer — they define what data sources the agent can query during investigation. Over 25 built-in toolsets ship with the project.

Built-in Toolsets

CategoryToolsets
Container/OrchestrationKubernetes, Docker, ArgoCD, Helm
ObservabilityPrometheus, Grafana Loki, Tempo, Datadog, NewRelic, Coralogix
MessagingKafka, RabbitMQ
DatabasesAWS RDS, OpenSearch
DocumentationConfluence, Slab
CloudAWS, GCP, Azure

Alert Source Integrations

HolmesGPT connects to incident management systems for automated investigation:

  • Prometheus/AlertManager
  • PagerDuty
  • OpsGenie
  • Jira
  • Slack
  • Microsoft Teams
  • GitHub

Custom Toolsets

You can extend HolmesGPT with custom YAML-based toolsets for organization-specific data sources. A toolset defines prerequisites, a description, and a set of tools with their commands:

# grafana_toolset.yaml
name: grafana
description: "Query Grafana dashboards for investigation"
prerequisites:
  env:
    - GRAFANA_URL
    - GRAFANA_API_KEY
  command:
    - curl --version
tools:
  - name: grafana_get_dashboards
    description: "Get list of Grafana dashboards"
    command: >
      curl -s -H "Authorization: Bearer ${GRAFANA_API_KEY}"
      "${GRAFANA_URL}/api/search?type=dash-db"
  - name: grafana_get_dashboard
    description: "Get a specific Grafana dashboard by UID"
    command: >
      curl -s -H "Authorization: Bearer ${GRAFANA_API_KEY}"
      "${GRAFANA_URL}/api/dashboards/uid/{{ dashboard_uid }}"

Load custom toolsets with the -t flag:

holmes ask -t grafana_toolset.yaml "what grafana dashboard should I look at for high CPU?"

Use -t multiple times to load multiple toolsets. Community-contributed toolsets are available in the holmesgpt-community-toolsets repository.

Output Transformers

For tools that return large outputs (like kubectl describe on a complex deployment), HolmesGPT supports transformers. The most common is llm_summarize, which uses a fast secondary model to condense lengthy output while preserving critical information. This helps manage LLM context window limits during deep investigations.

Custom Runbooks

Runbooks let you codify investigation procedures for known alert patterns. When HolmesGPT encounters a matching alert, it follows your runbook instructions alongside its own investigation:

# custom_runbooks.yaml
runbooks:
  - alert_name: HighPodCPU
    instructions: |
      1. Check if this pod has HPA configured
      2. Look for recent deployment changes
      3. Search Grafana for the pod's CPU dashboard
      4. Check if the CPU limit is set too low for the workload

This transforms HolmesGPT into a virtual SRE tailored to your environment — it knows your team’s debugging procedures and applies them automatically.

HolmesGPT vs K8sGPT

Both are CNCF Sandbox projects for AI-assisted Kubernetes troubleshooting, but they take different approaches:

AspectK8sGPTHolmesGPT
ScopeKubernetes-specific resource analysisBroad cloud native troubleshooting
ApproachBuilt-in analyzers + AI explanationAgentic loop with iterative tool calling
Data sourcesKubernetes APIKubernetes, Prometheus, Loki, cloud providers, ServiceNow, and 25+ more
InvestigationAnalyze known resource patternsFree-form investigation from any starting point
ExtensibilityAnalyzer pluginsYAML toolsets, MCP servers, custom runbooks
Maintained byCommunityRobusta + Microsoft

K8sGPT excels at surfacing and explaining Kubernetes-specific misconfigurations (CrashLoopBackOff, ImagePullBackOff, pending pods). HolmesGPT is broader — it can start from a Prometheus alert, trace through logs and metrics across services, query external systems, and produce a cross-cutting root cause analysis.

They’re complementary tools. K8sGPT is your quick health check; HolmesGPT is your incident investigator.

Supported LLM Providers

ProviderNotes
Anthropic ClaudeClaude Sonnet 4 / Sonnet 4.5 recommended for best results
OpenAIGPT-4o and later
Azure OpenAIEnterprise deployments
AWS BedrockClaude and other models via AWS
Google GeminiDirect API access
Google Vertex AIEnterprise GCP deployments
OllamaLocal/self-hosted models for maximum privacy

Production Safety

HolmesGPT is designed for production use:

  • Read-only operations — it never modifies resources, only reads
  • RBAC-aware — respects your Kubernetes RBAC policies. If a service account can’t access certain namespaces, neither can HolmesGPT
  • No data training — your data is never used to train models
  • Bring your own LLM — for maximum privacy, use Ollama or your own API key

Community and Governance

HolmesGPT is governed as a CNCF Sandbox project with a joint roadmap maintained by Robusta and Microsoft. The project welcomes contributions through:

Getting Started

The fastest path to trying HolmesGPT:

# Install
brew tap robusta-dev/homebrew-holmesgpt && brew install holmesgpt

# Set your LLM API key
export OPENAI_API_KEY="your-key"

# Point at your cluster and ask
holmes ask "what pods are unhealthy and why?"

If you want to go deeper, set up a ~/.holmes/config.yaml, add custom toolsets for your observability stack, and wire it into your alert pipeline with holmes investigate alertmanager. The documentation at holmesgpt.dev covers each integration in detail.

References