Instructions to use barakplasma/translategemma-4b-it-android-task-quantized with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use barakplasma/translategemma-4b-it-android-task-quantized with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Python toolkit that converts Google's TranslateGemma 4B IT model (HuggingFace) into on-device inference bundles for Android. Google's official TFLite files only support WebGPU β this project produces CPU/XNNPACK-compatible .litertlm (LiteRT-LM) and .task (MediaPipe) files with proper KV-cache prefill/decode signatures.
Common Commands
Single quantization conversion (produces .task)
source /home/ubuntu/conv-venv/bin/activate
python convert_translategemma_android.py \
--model-dir ./translategemma-4b-it \
--tflite-dir ./tflite_output/dynamic_int8 \
--output-dir ./output \
--task-file ./output/translategemma-4b-it-native-dynamic_int8.task \
--quantize dynamic_int8 \
--prefill-seq-len 1024 --kv-cache-max-len 1024 --allow-no-token
Valid --quantize values: none, dynamic_int8, int8, int4, float16
Aliases accepted: fp16, f16, i8, q8, i4, q4, fp32
Bundle a TFLite into .litertlm (recommended for Google AI Edge Gallery)
# Requires /tmp/litert-lm-pkg β see bundle_litertlm.py setup block if missing
python bundle_litertlm.py \
--tflite ./tflite_output/dynamic_int8/*.tflite \
--tokenizer ./translategemma-4b-it/tokenizer.model \
--output ./output/translategemma-4b-it-native-dynamic_int8.litertlm \
--quant dynamic_int8
Bundle-only from existing TFLite (skip conversion)
python convert_translategemma_android.py \
--bundle-only \
--existing-tflite ./tflite_output/none/translategemma-4b-it-generic-none.tflite \
--quantize int8 \
--tflite-dir ./tflite_output/int8 \
--output-dir ./output \
--task-file ./output/translategemma-4b-it-int8.task
Batch multi-quant build + HF upload
python multi_quant_build_upload.py \
--model-dir ./translategemma-4b-it \
--quants "int4,int8,dynamic_int8" \
--repo-id barakplasma/translategemma-4b-it-android-task-quantized \
--no-upload # remove to upload
Architecture
Three-script design
convert_translategemma_android.py β single conversion run, three strategies in sequence:
Strategy 1 (
strategy1_litert_native) β preferred; useslitert-torchwithbuild_translategemma_4b(), a custom builder for the 4B architecture. Produces proper KV-cache TFLite with prefill/decode signatures. Quantization is applied natively by the converter viaQUANT_MAP:"int4"β"dynamic_int4_block128"(blockwise INT4, ~2 GB)"int8"β"weight_only_int8"(~4 GB)"dynamic_int8"β"dynamic_int8"(~4 GB)"float16"β"fp16"(~8 GB)
Strategy 2 (
strategy2_generic) β fallback; wraps the HF model inLogitsOnlyWrapperand exports viaai_edge_torch.convert(). Always exports float32. Important: outputs a flatinput_ids β logitsTFLite with NO KV cache β NOT compatible with MediaPipe LLM inference.Strategy 3 (
strategy3_post_tflite_quantize) β runs only when Strategy 2 was used (never after Strategy 1); applies post-hoc weight quantization to the TFLite flatbuffer viaai_edge_quantizer. Does NOT add KV cache but reduces file size.
bundle_litertlm.py β takes a Strategy 1 TFLite + SentencePiece tokenizer and packages them into .litertlm format with LlmMetadata proto (Gemma3 model type, embedded Jinja chat template, BOS/EOS stop tokens, 2K max tokens). Requires /tmp/litert-lm-pkg/ with compiled FlatBuffers and proto Python bindings (see script header for setup).
multi_quant_build_upload.py β orchestrator; invokes convert_translategemma_android.py as a subprocess per quant level, handles timeouts/signals, writes output/quantization_summary.json and output/README.md, uploads artifacts to HuggingFace.
Key function: build_translategemma_4b()
Critical β without this, litert-torch 0.8.0 falls back to wrong-architecture builders (1B/270m). Hardcodes the correct config:
- 34 layers, embedding_dim=2560, 8 heads, 4 KV heads, head_dim=256, intermediate=10240
- Sliding window 1024; global attention at layers where
(idx+1) % 6 == 0(indices 5,11,17,23,29) - RMS norm with
zero_centered=True, per-head QK normalization (q_norm,k_norm) - Custom loader strips
language_model.prefix from TranslateGemma's multimodal safetensors keys (standard Gemma3 safetensors don't have this prefix)
Output formats
| Format | Runtime | Notes |
|---|---|---|
.litertlm |
LiteRT-LM / Google AI Edge Gallery | Recommended; embeds Jinja prompt template and LlmMetadata |
.task |
MediaPipe GenAI | Legacy; no embedded template β user must manually add <start_of_turn> tokens |
Prompt format for on-device inference
TranslateGemma requires this exact format (trained with it):
<bos><start_of_turn>user
You are a professional English (en) to Spanish (es) translator...
Produce only the Spanish translation...Please translate:
{text}<end_of_turn>
<start_of_turn>model
In Google AI Edge Gallery Prompt Lab mode, paste this as the System Prompt with {{input}} as the placeholder. .litertlm files embed a simplified Jinja template for AI Chat mode.
Runtime notes
conv-venv/β virtualenv with all deps (litert-torch==0.8.0,mediapipe,ai_edge_torch,transformers)/tmp/litert-lm-pkg/β manually assembled package from cloned LiteRT-LM repo with compiled FlatBuffer (flatc -p --gen-onefile) and proto (protoc) Python bindings; required bybundle_litertlm.pyat runtime; NOT persistent across reboots/tmp/litert-lm/β clonedgoogle-ai-edge/LiteRT-LMrepo (schema source for rebuilding the package)- Conversion requires ~128 GB RAM; 4B model loads ~46 GB
translategemma-4b-it/tokenizer.modelis the SentencePiece binary used by both.taskand.litertlmbundlers;ensure_tokenizer_model()auto-converts fromtokenizer.jsonif missing- HuggingFace repo:
barakplasma/translategemma-4b-it-android-task-quantized; upload token inHF_TOKENenv var