GenomeOcean-500M-v1.2-GGUF

This is a GGUF (Q4_K_M) quantized version of GenomeOcean-500M-v1.2.

Model Details

Base Model: GenomeOcean-500M-v1.2
Quantization Method: GGUF (Q4_K_M)
Bits: 4-bit (Mixed 4-bit and 6-bit)

Benchmark Results

Metric	Original (FP16)	Quantized (GGUF Q4_K_M)	Change
VRAM Footprint	1032.3 MB	309.7 MB	-70.0%
NLL Loss	5.9931	6.0295	-
Perplexity (PPL)	400.6447	415.5238	+3.71% (Drift)
Total Gen Time (50 seqs)	34.2s	89.3s	+161.0% (Slower)

Analysis

Fidelity: Superior fidelity with only +3.71% PPL drift. The Q4_K_M quantization method effectively preserves model quality by using higher bit-rates for critical layers.
Efficiency: Reduces VRAM requirements by 70%.
Inference Latency Warning: This format is highly optimized for llama.cpp. In current vLLM or Transformers implementations, it may be significantly slower than FP16 (approx. 2.6x slower in our benchmarks) due to lack of optimized GPU kernels. Recommended primarily for CPU-based or local llama.cpp inference.

Quick Start

Using llama.cpp

./llama-cli -m models/GenomeOcean-500M-v1.2-GGUF/model.gguf -p "ATGCGATCGATCGATCGATCG" -n 100

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "models/GenomeOcean-500M-v1.2-GGUF",
    gguf_file="model.gguf",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("models/GenomeOcean-500M-v1.2-GGUF")

inputs = tokenizer("ATGCGATCGATCGATCGATCG", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: 42

GGUF

Model size

0.5B params

Architecture

llama

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support