NPC Fast 1.7B โ€” GGUF

GGUF conversions of ramankrishna10/npc-fast-1.7b for llama.cpp / Ollama / LM Studio / local inference.

File Size Precision Best for
npc-fast-1.7b-q4_k_m.gguf 1.1 GB Q4_K_M laptops, CPU inference, fastest
npc-fast-1.7b-q8_0.gguf 1.8 GB Q8_0 best quality at quantized size
npc-fast-1.7b-f16.gguf 3.4 GB F16 lossless, reference

See the main repo for training recipe, evaluation results, and honest limitation disclosures (short version: trained to 16K context, 128K YaRN scaling is unvalidated).

Usage with llama.cpp

./build/bin/llama-cli \
  -m npc-fast-1.7b-q4_k_m.gguf \
  -sys "You are NPC Fast. Output exactly one JSON object with fields route and reason." \
  -p "What is 2+2?" \
  -n 60 --temp 0.0

Usage with Ollama

# Create a Modelfile:
cat > Modelfile << "MOD"
FROM ./npc-fast-1.7b-q4_k_m.gguf
PARAMETER temperature 0.0
PARAMETER num_predict 60
SYSTEM "You are NPC Fast, a routing model. Output exactly one JSON object with fields route and reason."
MOD

ollama create npc-fast -f Modelfile
ollama run npc-fast "Build a DCF for TSLA."

Intended use

See the main repo. Short version: decide between self (handle locally) and npc_fin (escalate to a 32B finance specialist) for every user request, then emit {"route": "self"|"npc_fin", "reason": "<short>"}.

Limitations

  • Trained to 16K context. 128K YaRN scaling is configured but not validated.
  • Router fine-tuned on 500 synthetic pairs. OOD eval at 98.3% (n=60).
  • No RLHF. Refusal behavior inherited from SmolLM2-Instruct.

Credits

Built by Bottensor (a Falcon Hash company), creator: dude.npc. Base: HuggingFaceTB/SmolLM2-1.7B-Instruct.

Citation

If you use this model or build on its training recipe, please cite the accompanying preprint:

Bachu, R. K. (2026). NPC Fast 1.7B: Building a Usable Small Model on a Single H100. Zenodo. https://doi.org/10.5281/zenodo.19771040

@misc{bachu2026npcfast,
  title     = {NPC Fast 1.7B: Building a Usable Small Model on a Single H100},
  author    = {Bachu, Rama Krishna},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19771040},
  url       = {https://doi.org/10.5281/zenodo.19771040},
  note      = {Preprint},
}
Downloads last month
36
GGUF
Model size
2B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ramankrishna10/npc-fast-1.7b-gguf

Quantized
(89)
this model