TinyLlama 1.1B — fraQtl KV Cache Optimized

KV cache optimized with fraQtl — 3.5x less KV cache memory during inference.

Note: The model file size is the same as the original (~2.2GB). The optimization modifies V projection weights so that at inference time, the KV cache uses less GPU memory. The savings happen at runtime, not at download.

Metric	Value
Original	TinyLlama/TinyLlama-1.1B-Chat-v1.0
File size	Same as original (~2.2GB)
PPL before	15.5249
PPL after	15.8782
Delta	+0.353 (weight-level)
Config	k=16, INT3

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("fraQtl/TinyLlama-1.1B-compressed")
tokenizer = AutoTokenizer.from_pretrained("fraQtl/TinyLlama-1.1B-compressed")

Runtime Compression

Our runtime compression achieves significantly better results on larger models. Contact us for integration.

Downloads last month: 40

Safetensors

Model size

1B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fraQtl/TinyLlama-1.1B-optimized

Quantizations

2 models