I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

8.7 relevance

Novel experiment using LLMs for penetration testing; directly actionable and highly relevant.

AI/ML kasra.blog

I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

Summary

A researcher spent $1,500 testing 10 LLMs on a deliberately vulnerable React Native/Expo app with a FastAPI backend and open Firebase Firestore. GPT-5.5 achieved a 70% solve rate ($9.46/solve), while Deepseek V4 Pro solved 3/10 at $0.62/solve, and Claude Sonnet 4.6 succeeded only 2/10 but was often cut off by budget limits. Most failures stemmed from models never discovering the Firebase bypass, hitting security guardrails (Gemini refused nearly every run), or exhausting tokens without producing the exploit.