Auto-verifying your AI-SRE's fixes (Part II): HolmesGPT end-to-end on a real cluster

8.2 relevance

End-to-end AI SRE fix verification on GKE with concrete results, directly relevant to AI ops and cloud infrastructure.

AI/ML dev.to

Auto-verifying your AI-SRE's fixes (Part II): HolmesGPT end-to-end on a real cluster

Summary

HolmesGPT, an open-source AI-SRE, autonomously investigates alerts on a GKE cluster, generates code patches via a Claude wrapper, and verifies fixes using mirrord exec to run patched code against real cluster dependencies. In a demo, it correctly diagnosed a ValueError causing 5% error rate violations, produced a patch that passed verification by clearing the SLO without regressions, and rejected a second patch that failed. The system integrates with Alertmanager, Prometheus, and existing runbooks to ground investigations in cluster state and documentation.

Author

Eyal Bukchin