Gemma 4 12B: A unified, encoder-free multimodal model

9.2 relevance

Google's new open multimodal model; top scores across all dimensions for our reader.

AI/ML blog.google

Gemma 4 12B: A unified, encoder-free multimodal model

Summary

Google DeepMind's Gemma 4 12B is an open-source (Apache 2.0) multimodal model that runs locally on laptops with 16GB VRAM, using an encoder-free architecture to natively process vision and audio without separate encoders. It incorporates multi-token prediction drafters for low latency and achieves benchmark performance near the larger 26B MoE model, enabling agentic workflows on consumer hardware.

Author

Olivier Lacombe