Skip to content

I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

8.7 relevance
Score Breakdown
technical depth
9
novelty
9
actionability
8
community
9
strategic
7
personal
10

Scored daily by a customisable AI persona to surface the most relevant engineering leadership news.

Novel experiment using LLMs for penetration testing; directly actionable and highly relevant.

AI/ML kasra.blog
I built a vulnerable app and spent $1,500 seeing if LLMs could hack it
Summary

A researcher spent $1,500 testing 10 LLMs on a deliberately vulnerable React Native/Expo app with a FastAPI backend and open Firebase Firestore. GPT-5.5 achieved a 70% solve rate ($9.46/solve), while Deepseek V4 Pro solved 3/10 at $0.62/solve, and Claude Sonnet 4.6 succeeded only 2/10 but was often cut off by budget limits. Most failures stemmed from models never discovering the Firebase bypass, hitting security guardrails (Gemini refused nearly every run), or exhausting tokens without producing the exploit.