Skip to content

Google LiteRT-LM Speeds Up Local Inference Up to 2.2x With Gemma 4 Multi-Token Prediction

7.7 relevance
Score Breakdown
technical depth
8
novelty
9
actionability
7
community
5
strategic
7
personal
9

Scored daily by a customisable AI persona to surface the most relevant engineering leadership news.

Google LiteRT-LM achieves 2.2x faster local inference with multi-token prediction—novel and directly relevant to ML inference optimization.

Languages infoq.com
Google LiteRT-LM Speeds Up Local Inference Up to 2.2x With Gemma 4 Multi-Token Prediction
Summary

Google's LiteRT-LM, built on LiteRT (formerly TensorFlow Lite), delivers up to 2.2x faster on-device inference for Gemma 4 by natively supporting multi-token prediction drafters with memory-local speculative decoding. Benchmarks show 1.8x-3.7x faster prefill and decode than llama.cpp, MLX, Cactus, and ONNX, while the Gemma 4 E2B model uses only 607MB on Apple mobile CPUs. The runtime adds Swift and JavaScript APIs, session management for KV cache persistence, and agentic features like constrained decoding and function calling.

Author

Sergio De Simone

More from Sergio De Simone →