Gemma 4 12B Enables On-Device, Multimodal Agentic Workflows with an Encoder-free Architecture

8 relevance

Gemma 4 12B for on-device agentic workflows, highly novel and relevant.

AI/ML infoq.com

Gemma 4 12B Enables On-Device, Multimodal Agentic Workflows with an Encoder-free Architecture

Summary

Google's Gemma 4 12B introduces an encoder-free, decoder-only transformer that directly ingests raw image patches and audio frames, eliminating separate vision and audio encoders to reduce latency and memory fragmentation. Its 35M-parameter vision embedder projects 48×48 pixel patches into the LLM hidden space via a single matrix multiplication, while audio is sliced into 40 ms frames and linearly projected. The model runs locally via Google AI Edge, LiteRT-LM, or llama.cpp, enabling on-device agentic workflows like generating Python scripts from natural language.

Author

Sergio De Simone