Qwen-2.5-1.5B-KCC-LiteRT-LM

This is an on-device farmer advisory language model fine-tuned on cleaned Kisan Call Centre (KCC) question–answer pairs from Indian smallholder farmers, then converted and packaged for efficient offline inference using Google's LiteRT / LiteRT-LM stack.

It is intended for low-connectivity, edge scenarios β€” such as mobile advisory apps for Indian farmers.

Model Details

  • Base model: unsloth/Qwen2.5-1.5B-Instruct
  • Fine-tuning method: LoRA (parameter-efficient) via Unsloth + TRL SFTTrainer
  • Dataset: Cleaned "Farmers Call Query Data" by Das Koushik, based on data from data.gov.in β†’ Only null/empty rows removed; no synthetic data, paraphrasing, or external augmentation
  • Training regime: Very short, step-limited runs (max_steps=60, warmup_steps=5) on Colab Tesla T4 due to free-tier constraints
    β†’ Pilot for pipeline validation, not full convergence
  • Conversion: PyTorch β†’ LiteRT (.tflite) using Google AI Edge Torch (v0.7.1)
    β†’ Static KV cache: 4096 tokens
    β†’ Result: ~1.6 GB .tflite artifact
  • Packaging: .tflite β†’ .litertlm using LiteRT-LM (v0.8.1) tools
  • Quantization: Quantized graph as produced by AI Edge Torch conversion
  • Context length: 4096 tokens (fixed/static KV cache)
  • Intended use: Offline, interactive agricultural advisory in low-resource settings
  • Out-of-scope: General-purpose chat, high-precision agronomy, multi-turn memory beyond context limit, production-grade fluency

Performance on Consumer Hardware

Tested on Mac mini (Apple M4, 16 GB unified memory) using LiteRT-LM with GPU backend:

  • Time-to-first-token (TTFT): < 1 second
  • End-to-end response time (50–150 token advisory answers): ~2.5–4 seconds
  • Throughput: Stable incremental decoding

Suitable for real-time, offline farmer-facing tools.

Important Limitations & Known Behaviors

This is an early engineering validation release β€” not a production model.

Due to extremely short training (Colab constraints):

  • Strong mirroring of original KCC terse, bullet-list style β†’ outputs often lack natural conversational flow
  • Occasional near-verbatim reuse of training phrases with limited adaptation to query variations
  • Mild repetition / incomplete reasoning (undertraining artifact)

LiteRT-LM specific observations (compared to PyTorch inference):

  • Noticeably reduced coherence
  • Increased repetition, fragmentation, or looping in some generations
  • Responses sometimes feel more generic / less tightly grounded

β†’ These are runtime-specific behaviors (not present in original PyTorch checkpoint).
Root cause not yet isolated due to lack of controlled ablation; likely contributors include decoding configuration, stop-token alignment, fixed KV-cache constraints, and runtime-specific sampling behavior.

The project deliberately prioritizes reproducible deployment path + failure mode transparency over peak quality. Full multi-epoch training + runtime debugging expected to improve results significantly.

Comparison with Prior Gemma-3n Effort

Compared to earlier Gemma-3n-E2B fine-tuning on the same task:

  • Qwen2.5-1.5B wins on: conversion success, long-context stability, deployment reliability
  • Gemma-3n-E2B wins on: more natural dialogue style, broader multilingual starting point
  • Deciding factor: Gemma-3n could not be reliably converted to LiteRT-LM with public tooling β†’ hard dead-end

See full comparison in the project documentation.

Usage

This model is packaged in .litertlm format for use with LiteRT-LM runtime (preview stage as of Dec 2025).

Refer to:

  • LiteRT-LM GitHub
  • Google AI Edge Gallery app (Android) for quick testing
  • LiteRT documentation for integration into Android/iOS/macOS/Linux apps
  • UAI.LiteRTLM β€” a Unity package wrapping LiteRT-LM inference, useful for building Android/Quest apps with this model

Reproducibility

Full pipeline (data cleaning β†’ LoRA fine-tuning β†’ merge β†’ LiteRT conversion β†’ LiteRT-LM packaging) is documented with scripts and exact commands in the associated repository.

Conversion requires high-RAM CPU instance (~128 GB recommended). No GPUs needed for conversion/packaging.

See the project documentation for step-by-step instructions (AWS EC2 r6i instances used in original work):
https://uralstech.github.io/Qwen-KCC-On-Device-Pipeline

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for uralstech/Qwen-2.5-1.5B-KCC-LiteRT-LM

Base model

Qwen/Qwen2.5-1.5B
Finetuned
(130)
this model