TinyLlama 1.1B โ fraQtl KV Cache Optimized
KV cache optimized with fraQtl โ 3.5x less KV cache memory during inference.
Note: The model file size is the same as the original (~2.2GB). The optimization modifies V projection weights so that at inference time, the KV cache uses less GPU memory. The savings happen at runtime, not at download.
| Metric | Value |
|---|---|
| Original | TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
| File size | Same as original (~2.2GB) |
| PPL before | 15.5249 |
| PPL after | 15.8782 |
| Delta | +0.353 (weight-level) |
| Config | k=16, INT3 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("fraQtl/TinyLlama-1.1B-compressed")
tokenizer = AutoTokenizer.from_pretrained("fraQtl/TinyLlama-1.1B-compressed")
Runtime Compression
Our runtime compression achieves significantly better results on larger models. Contact us for integration.
- Downloads last month
- 40
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support