GenomeOcean-500M-v1.2-GGUF
This is a GGUF (Q4_K_M) quantized version of GenomeOcean-500M-v1.2.
Model Details
- Base Model: GenomeOcean-500M-v1.2
- Quantization Method: GGUF (Q4_K_M)
- Bits: 4-bit (Mixed 4-bit and 6-bit)
Benchmark Results
| Metric | Original (FP16) | Quantized (GGUF Q4_K_M) | Change |
|---|---|---|---|
| VRAM Footprint | 1032.3 MB | 309.7 MB | -70.0% |
| NLL Loss | 5.9931 | 6.0295 | - |
| Perplexity (PPL) | 400.6447 | 415.5238 | +3.71% (Drift) |
| Total Gen Time (50 seqs) | 34.2s | 89.3s | +161.0% (Slower) |
Analysis
- Fidelity: Superior fidelity with only +3.71% PPL drift. The Q4_K_M quantization method effectively preserves model quality by using higher bit-rates for critical layers.
- Efficiency: Reduces VRAM requirements by 70%.
- Inference Latency Warning: This format is highly optimized for
llama.cpp. In current vLLM or Transformers implementations, it may be significantly slower than FP16 (approx. 2.6x slower in our benchmarks) due to lack of optimized GPU kernels. Recommended primarily for CPU-based or localllama.cppinference.
Quick Start
Using llama.cpp
./llama-cli -m models/GenomeOcean-500M-v1.2-GGUF/model.gguf -p "ATGCGATCGATCGATCGATCG" -n 100
Using Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"models/GenomeOcean-500M-v1.2-GGUF",
gguf_file="model.gguf",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("models/GenomeOcean-500M-v1.2-GGUF")
inputs = tokenizer("ATGCGATCGATCGATCGATCG", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 42
Hardware compatibility
Log In to add your hardware
We're not able to determine the quantization variants.
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support