TinyLlama 1.1B โ€” fraQtl KV Cache Optimized

KV cache optimized with fraQtl โ€” 3.5x less KV cache memory during inference.

Note: The model file size is the same as the original (~2.2GB). The optimization modifies V projection weights so that at inference time, the KV cache uses less GPU memory. The savings happen at runtime, not at download.

Metric Value
Original TinyLlama/TinyLlama-1.1B-Chat-v1.0
File size Same as original (~2.2GB)
PPL before 15.5249
PPL after 15.8782
Delta +0.353 (weight-level)
Config k=16, INT3

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("fraQtl/TinyLlama-1.1B-compressed")
tokenizer = AutoTokenizer.from_pretrained("fraQtl/TinyLlama-1.1B-compressed")

Runtime Compression

Our runtime compression achieves significantly better results on larger models. Contact us for integration.

Downloads last month
40
Safetensors
Model size
1B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for fraQtl/TinyLlama-1.1B-optimized

Quantizations
2 models