Gemma 4 12B Enables On-Device, Multimodal Agentic Workflows with an Encoder-free Architecture
Google's Gemma 4 12B introduces an encoder-free, decoder-only transformer that directly ingests raw image patches and audio frames, eliminating separate vision and audio encoders to reduce latency and memory fragmentation. Its 35M-parameter vision embedder projects 48×48 pixel patches into the LLM hidden space via a single matrix multiplication, while audio is sliced into 40 ms frames and linearly projected. The model runs locally via Google AI Edge, LiteRT-LM, or llama.cpp, enabling on-device agentic workflows like generating Python scripts from natural language.