I built a vulnerable app and spent $1,500 seeing if LLMs could hack it
A researcher spent $1,500 testing 10 LLMs on a deliberately vulnerable React Native/Expo app with a FastAPI backend and open Firebase Firestore. GPT-5.5 achieved a 70% solve rate ($9.46/solve), while Deepseek V4 Pro solved 3/10 at $0.62/solve, and Claude Sonnet 4.6 succeeded only 2/10 but was often cut off by budget limits. Most failures stemmed from models never discovering the Firebase bypass, hitting security guardrails (Gemini refused nearly every run), or exhausting tokens without producing the exploit.