Skip to content

Auto-verifying your AI-SRE's fixes (Part II): HolmesGPT end-to-end on a real cluster

8.2 relevance
Score Breakdown
technical depth
9
novelty
8
actionability
9
community
5
strategic
6
personal
10

Scored daily by a customisable AI persona to surface the most relevant engineering leadership news.

End-to-end AI SRE fix verification on GKE with concrete results, directly relevant to AI ops and cloud infrastructure.

AI/ML dev.to
Auto-verifying your AI-SRE's fixes (Part II): HolmesGPT end-to-end on a real cluster
Summary

HolmesGPT, an open-source AI-SRE, autonomously investigates alerts on a GKE cluster, generates code patches via a Claude wrapper, and verifies fixes using mirrord exec to run patched code against real cluster dependencies. In a demo, it correctly diagnosed a ValueError causing 5% error rate violations, produced a patch that passed verification by clearing the SLO without regressions, and rejected a second patch that failed. The system integrates with Alertmanager, Prometheus, and existing runbooks to ground investigations in cluster state and documentation.

Author

Eyal Bukchin

More from Eyal Bukchin →