Instructions to use rkumar70900/qwen2.5-1.5b-gguf-experiments with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use rkumar70900/qwen2.5-1.5b-gguf-experiments with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="rkumar70900/qwen2.5-1.5b-gguf-experiments", filename="gguf/qwen2.5-1.5b-IQ1_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use rkumar70900/qwen2.5-1.5b-gguf-experiments with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M # Run inference directly in the terminal: llama-cli -hf rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M # Run inference directly in the terminal: llama-cli -hf rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M
Use Docker
docker model run hf.co/rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use rkumar70900/qwen2.5-1.5b-gguf-experiments with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rkumar70900/qwen2.5-1.5b-gguf-experiments" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rkumar70900/qwen2.5-1.5b-gguf-experiments", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M
- Ollama
How to use rkumar70900/qwen2.5-1.5b-gguf-experiments with Ollama:
ollama run hf.co/rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M
- Unsloth Studio
How to use rkumar70900/qwen2.5-1.5b-gguf-experiments with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rkumar70900/qwen2.5-1.5b-gguf-experiments to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rkumar70900/qwen2.5-1.5b-gguf-experiments to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for rkumar70900/qwen2.5-1.5b-gguf-experiments to start chatting
- Pi
How to use rkumar70900/qwen2.5-1.5b-gguf-experiments with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use rkumar70900/qwen2.5-1.5b-gguf-experiments with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use rkumar70900/qwen2.5-1.5b-gguf-experiments with Docker Model Runner:
docker model run hf.co/rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M
- Lemonade
How to use rkumar70900/qwen2.5-1.5b-gguf-experiments with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull rkumar70900/qwen2.5-1.5b-gguf-experiments:Q4_K_M
Run and chat with the model
lemonade run user.qwen2.5-1.5b-gguf-experiments-Q4_K_M
List all available models
lemonade list
Qwen2.5-1.5B-Instruct โ GGUF Quantization Experiments
This repo contains Qwen2.5-1.5B-Instruct quantized into multiple GGUF formats using llama.cpp. It was created as part of a hands-on quantization experiment documenting the full process from raw HuggingFace weights โ multiple GGUF formats โ quality evaluation.
What's in This Repo
gguf/
โโโ qwen2.5-1.5b-f16.gguf ~2.9 GB source of truth โ full precision
โโโ qwen2.5-1.5b-Q8_0.gguf ~1.6 GB near-lossless
โโโ qwen2.5-1.5b-Q5_K_M.gguf ~1.0 GB great quality/size tradeoff
โโโ qwen2.5-1.5b-Q4_K_M.gguf ~935 MB recommended โ sweet spot โ
โโโ qwen2.5-1.5b-Q4_K_S.gguf ~865 MB leaner 4-bit variant
โโโ qwen2.5-1.5b-Q2_K.gguf ~530 MB aggressive K-quant baseline
โโโ qwen2.5-1.5b-Q2_K_S.gguf ~530 MB aggressive K-quant โ needs imatrix
โโโ qwen2.5-1.5b-IQ3_M.gguf ~680 MB importance-weighted 3-bit
โโโ qwen2.5-1.5b-IQ2_XS.gguf ~480 MB importance-weighted 2-bit โ needs imatrix
โโโ qwen2.5-1.5b-IQ2_XXS.gguf ~420 MB most aggressive โ needs imatrix
โโโ qwen2.5-1.5b-IQ2_S.gguf ~450 MB importance-weighted 2.5-bit โ needs imatrix
โโโ qwen2.5-1.5b-IQ1_M.gguf ~300 MB extreme 1.75-bit โ needs imatrix
โโโ qwen2.5-1.5b-IQ1_S.gguf ~280 MB extreme 1.56-bit โ needs imatrix
Note on f16: The F16 file is included as the reference baseline for perplexity comparisons. It is not intended for general inference use โ at 2.9 GB it offers no practical advantage over Q8_0 for local deployment.
Which File Should I Use?
| Use Case | Recommended Format |
|---|---|
| Best quality, VRAM not a concern | Q8_0 |
| Daily driver โ best quality/size tradeoff | Q4_K_M โ start here |
| Tight on memory, want decent quality | Q4_K_S or Q2_K |
| Edge deployment / very limited RAM | IQ2_XS or IQ2_XXS |
| Research / extreme compression testing | IQ1_M or IQ1_S |
| Partial GPU offload (CPU + GPU split) | Q4_K_M or IQ3_M |
If you're not sure, start with Q4_K_M. It's the most tested format in the community and gives you ~68% size reduction with minimal quality loss.
โ ๏ธ IQ1 and IQ2 formats (
IQ1_S,IQ1_M,IQ2_S,IQ2_XS,IQ2_XXS,Q2_K_S) were all generated with an importance matrix. Without one, these formats produce significantly degraded output. See the Imatrix Calibration section below for details.
Format Guide
K-Quant Family (Q*_K_*)
Standard llama.cpp quantization using superblocks of 256 weights. The suffix means:
_S(Small) โ more aggressive, smaller file_M(Medium) โ mixed-precision, smarter assignment of bits to sensitive layers
Despite the "4" in Q4_K_M, it is not uniform 4-bit. Critical tensors like the embedding table and output projection are bumped to 6-bit internally. The "4" is the average bits-per-weight.
IQ Family (IQ*_*)
Importance-weighted quantization. These formats use an importance matrix โ calibration data was run through the base model to identify which weights matter most, and precision was distributed accordingly. This is why IQ formats punch above their weight class at the same file size compared to K-quants.
The IQ2 files in this repo were generated with a WikiText-2 calibration dataset (see below). Without an importance matrix, these formats produce near-incoherent output โ the imatrix is what makes them viable.
Quantization Details
Base model: Qwen/Qwen2.5-1.5B-Instruct
Quantization tool: llama.cpp build 7074 (commit 22e1ce2f8)
Source precision: F16 GGUF (converted from original SafeTensors)
Platform: Apple Silicon (arm64)
Imatrix Calibration
The IQ2 formats were quantized using an importance matrix generated from WikiText-2:
from datasets import load_dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
with open("calibration.txt", "w") as f:
for row in dataset:
text = row["text"].strip()
if len(text) > 100:
f.write(text + "\n")
./build/bin/llama-imatrix \
-m qwen2.5-1.5b-f16.gguf \
-f calibration.txt \
-o imatrix.dat \
--ctx-size 512 \
-ngl -1 \
--chunks 100
How to Run
llama.cpp CLI
./build/bin/llama-cli \
-m qwen2.5-1.5b-Q4_K_M.gguf \
-n 512 \
-ngl 99 \
--prompt "Explain the difference between supervised and unsupervised learning."
llama.cpp Server (OpenAI-compatible)
./build/bin/llama-server \
-m qwen2.5-1.5b-Q4_K_M.gguf \
-ngl 99 \
--port 8080
Then hit http://localhost:8080/v1/chat/completions like any OpenAI endpoint.
Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="qwen2.5-1.5b-Q4_K_M.gguf",
n_gpu_layers=-1, # full GPU offload
n_ctx=4096,
)
output = llm(
"Explain quantization in simple terms:",
max_tokens=256,
temperature=0.7,
)
print(output["choices"][0]["text"])
Ollama
ollama run hf.co/your-username/qwen2.5-1.5b-gguf-experiments:Q4_K_M
Model Architecture (from metadata)
| Parameter | Value |
|---|---|
| Architecture | Qwen2 |
| Parameters | 1.5B |
| Layers | 28 |
| Hidden dimension | 1536 |
| FFN intermediate | 8960 |
| Attention heads (Q) | 12 |
| Attention heads (KV) | 2 |
| Attention type | Grouped Query Attention (GQA) |
| Context length | 32768 |
| Vocabulary size | 151,936 |
| Tokenizer | GPT-2 BPE (Qwen2 variant) |
License
The quantized weights in this repo are derived from Qwen/Qwen2.5-1.5B-Instruct and inherit its Apache 2.0 license.
Citation
If you use these files in your work, please also cite the original Qwen2.5 model:
@misc{qwen2.5,
title = {Qwen2.5: A Party of Foundation Models},
author = {Qwen Team},
year = {2024},
url = {https://qwenlm.github.io/blog/qwen2.5/}
}
- Downloads last month
- 218
1-bit
2-bit
3-bit
4-bit
5-bit
8-bit
16-bit