| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - causal-lm |
| - scientific-language-model |
| - mathematics |
| - arxiv |
| - research |
| library_name: transformers |
| --- |
| |
| # Minnow-Math-1.5B |
|
|
| **Minnow-Math-1.5B** is a ~1.5B parameter decoder-only transformer trained from scratch on raw arXiv LaTeX sources across mathematics, computer science, and theoretical physics. |
|
|
| 📄 **Paper:** https://arxiv.org/abs/2602.17288 |
| 💻 **Github:** https://github.com/kitefishai/Minnow-Math-1.5B |
|
|
| This is a **base scientific language model** (not instruction-tuned). |
|
|
| ## Overview |
|
|
| Minnow-Math-1.5B explores what it takes to train a domain-specialized scientific language model directly from structured LaTeX archives. |
|
|
| **Training Scale** |
| - ~52B pretraining tokens |
| - ~5B additional post-training tokens |
| - ~200GB processed scientific corpus |
| - LLaMA-compatible tokenizer (~102k vocab) |
| - 2× NVIDIA A100 (80GB) GPUs |
| - 24 experimental training runs |
|
|
| The focus of this project is *scientific language modeling robustness*, not benchmark optimization. |
|
|
| ## Model Architecture |
|
|
| - 24 Transformer layers |
| - Hidden size: 2048 |
| - FFN size: 5504 |
| - 16 attention heads |
| - Context length: 4096 (trained at 768 tokens) |
| - Dense LLaMA-style architecture |
|
|
| **Optimization** |
| - AdamW |
| - Learning rate: 2e-4 |
| - Warmup: 500 steps |
| - Weight decay: 0.1 |
| - Gradient accumulation: 32 |
| - bf16 mixed precision |
| - Gradient checkpointing enabled |
|
|
| **Validation Perplexity:** ~4.2 (held-out scientific corpus) |
|
|
| ## Intended Use |
|
|
| Minnow-Math-1.5B is suitable for: |
|
|
| - Scientific text modeling research |
| - Mathematical language modeling experiments |
| - Pretraining initialization for domain fine-tuning |
| - Tokenization and symbolic modeling research |
| - Studying LaTeX structure modeling |
|
|
| It is **not optimized for:** |
|
|
| - Instruction following |
| - Chat-based applications |
| - General conversational AI |
| - Benchmark leaderboard performance |
|
|
| ## Performance Notes |
|
|
| This model was trained under moderate compute constraints and without instruction tuning or alignment stages. |
|
|
| Observed characteristics: |
|
|
| - Strong familiarity with scientific writing style |
| - Stable LaTeX structural modeling |
| - Reasonable symbolic fluency |
| - Limited reasoning depth |
| - Low downstream benchmark accuracy without fine-tuning |
|
|
| Performance improves significantly with supervised fine-tuning (SFT), LoRA adaptation, or domain-specific instruction tuning. |
|
|
| ## Limitations |
|
|
| - Not instruction-tuned |
| - No RLHF or preference alignment |
| - Trained at 768-token sequence length |
| - Domain restricted to selected arXiv categories |
| - Not optimized for reasoning benchmarks |
| - General NLP benchmark scores may be low |
|
|
| This release is intended primarily for research and experimentation. |
|
|
| ## Example Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| import torch |
| |
| model_id = "KiteFishAI/Minnow-Math-1.5B" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained(model_id) |
| |
| prompt = "Prove that the sum of two continuous functions is continuous." |
| inputs = tokenizer(prompt, return_tensors="pt") |
| |
| with torch.no_grad(): |
| outputs = model.generate(**inputs, max_new_tokens=200) |
| |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| |
| ``` |
|
|
| ## Citation |
|
|
| If you use this model in your research, please cite: |
|
|
| ``` |
| @article{kitefish_a1_2026, |
| title={KiteFish-A1: Training a Scientific Language Model from Raw LaTeX Archives}, |
| author={...}, |
| year={2026}, |
| eprint={2602.17288}, |
| archivePrefix={arXiv} |
| } |
| ``` |
|
|