barakplasma's picture
Upload CLAUDE.md with huggingface_hub
f1f7e94 verified

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Python toolkit that converts Google's TranslateGemma 4B IT model (HuggingFace) into on-device inference bundles for Android. Google's official TFLite files only support WebGPU β€” this project produces CPU/XNNPACK-compatible .litertlm (LiteRT-LM) and .task (MediaPipe) files with proper KV-cache prefill/decode signatures.

Common Commands

Single quantization conversion (produces .task)

source /home/ubuntu/conv-venv/bin/activate

python convert_translategemma_android.py \
  --model-dir ./translategemma-4b-it \
  --tflite-dir ./tflite_output/dynamic_int8 \
  --output-dir ./output \
  --task-file ./output/translategemma-4b-it-native-dynamic_int8.task \
  --quantize dynamic_int8 \
  --prefill-seq-len 1024 --kv-cache-max-len 1024 --allow-no-token

Valid --quantize values: none, dynamic_int8, int8, int4, float16 Aliases accepted: fp16, f16, i8, q8, i4, q4, fp32

Bundle a TFLite into .litertlm (recommended for Google AI Edge Gallery)

# Requires /tmp/litert-lm-pkg β€” see bundle_litertlm.py setup block if missing
python bundle_litertlm.py \
  --tflite ./tflite_output/dynamic_int8/*.tflite \
  --tokenizer ./translategemma-4b-it/tokenizer.model \
  --output ./output/translategemma-4b-it-native-dynamic_int8.litertlm \
  --quant dynamic_int8

Bundle-only from existing TFLite (skip conversion)

python convert_translategemma_android.py \
  --bundle-only \
  --existing-tflite ./tflite_output/none/translategemma-4b-it-generic-none.tflite \
  --quantize int8 \
  --tflite-dir ./tflite_output/int8 \
  --output-dir ./output \
  --task-file ./output/translategemma-4b-it-int8.task

Batch multi-quant build + HF upload

python multi_quant_build_upload.py \
  --model-dir ./translategemma-4b-it \
  --quants "int4,int8,dynamic_int8" \
  --repo-id barakplasma/translategemma-4b-it-android-task-quantized \
  --no-upload   # remove to upload

Architecture

Three-script design

convert_translategemma_android.py β€” single conversion run, three strategies in sequence:

  1. Strategy 1 (strategy1_litert_native) β€” preferred; uses litert-torch with build_translategemma_4b(), a custom builder for the 4B architecture. Produces proper KV-cache TFLite with prefill/decode signatures. Quantization is applied natively by the converter via QUANT_MAP:

    • "int4" β†’ "dynamic_int4_block128" (blockwise INT4, ~2 GB)
    • "int8" β†’ "weight_only_int8" (~4 GB)
    • "dynamic_int8" β†’ "dynamic_int8" (~4 GB)
    • "float16" β†’ "fp16" (~8 GB)
  2. Strategy 2 (strategy2_generic) β€” fallback; wraps the HF model in LogitsOnlyWrapper and exports via ai_edge_torch.convert(). Always exports float32. Important: outputs a flat input_ids β†’ logits TFLite with NO KV cache β€” NOT compatible with MediaPipe LLM inference.

  3. Strategy 3 (strategy3_post_tflite_quantize) β€” runs only when Strategy 2 was used (never after Strategy 1); applies post-hoc weight quantization to the TFLite flatbuffer via ai_edge_quantizer. Does NOT add KV cache but reduces file size.

bundle_litertlm.py β€” takes a Strategy 1 TFLite + SentencePiece tokenizer and packages them into .litertlm format with LlmMetadata proto (Gemma3 model type, embedded Jinja chat template, BOS/EOS stop tokens, 2K max tokens). Requires /tmp/litert-lm-pkg/ with compiled FlatBuffers and proto Python bindings (see script header for setup).

multi_quant_build_upload.py β€” orchestrator; invokes convert_translategemma_android.py as a subprocess per quant level, handles timeouts/signals, writes output/quantization_summary.json and output/README.md, uploads artifacts to HuggingFace.

Key function: build_translategemma_4b()

Critical β€” without this, litert-torch 0.8.0 falls back to wrong-architecture builders (1B/270m). Hardcodes the correct config:

  • 34 layers, embedding_dim=2560, 8 heads, 4 KV heads, head_dim=256, intermediate=10240
  • Sliding window 1024; global attention at layers where (idx+1) % 6 == 0 (indices 5,11,17,23,29)
  • RMS norm with zero_centered=True, per-head QK normalization (q_norm, k_norm)
  • Custom loader strips language_model. prefix from TranslateGemma's multimodal safetensors keys (standard Gemma3 safetensors don't have this prefix)

Output formats

Format Runtime Notes
.litertlm LiteRT-LM / Google AI Edge Gallery Recommended; embeds Jinja prompt template and LlmMetadata
.task MediaPipe GenAI Legacy; no embedded template β€” user must manually add <start_of_turn> tokens

Prompt format for on-device inference

TranslateGemma requires this exact format (trained with it):

<bos><start_of_turn>user
You are a professional English (en) to Spanish (es) translator...
Produce only the Spanish translation...Please translate:


{text}<end_of_turn>
<start_of_turn>model

In Google AI Edge Gallery Prompt Lab mode, paste this as the System Prompt with {{input}} as the placeholder. .litertlm files embed a simplified Jinja template for AI Chat mode.

Runtime notes

  • conv-venv/ β€” virtualenv with all deps (litert-torch==0.8.0, mediapipe, ai_edge_torch, transformers)
  • /tmp/litert-lm-pkg/ β€” manually assembled package from cloned LiteRT-LM repo with compiled FlatBuffer (flatc -p --gen-onefile) and proto (protoc) Python bindings; required by bundle_litertlm.py at runtime; NOT persistent across reboots
  • /tmp/litert-lm/ β€” cloned google-ai-edge/LiteRT-LM repo (schema source for rebuilding the package)
  • Conversion requires ~128 GB RAM; 4B model loads ~46 GB
  • translategemma-4b-it/tokenizer.model is the SentencePiece binary used by both .task and .litertlm bundlers; ensure_tokenizer_model() auto-converts from tokenizer.json if missing
  • HuggingFace repo: barakplasma/translategemma-4b-it-android-task-quantized; upload token in HF_TOKEN env var