I stress-tested Gemma 4 E4B's 128K context on a laptop GPU — recall is great, prefill is not
Stress-testing Gemma 4 E4B (Q4_K_M, ~9.6 GB) on an RTX 5050 laptop with 8 GB VRAM showed perfect recall across 5K–100K context in a needle-in-a-haystack test, but time to first token (prefill) scaled nearly linearly from 4s at 5K to 72s at 100K, while generation throughput dropped only 26% (9.2→6.8 tok/s). The author defines three practical zones—interactive (<20K), research-assistant (20–60K), batch (60–100K)—and provides a ~30-line Python rig on Ollama 0.24.0 to reproduce the results.