GenomeOcean-500M-v1.2-GGUF

This is a GGUF (Q4_K_M) quantized version of GenomeOcean-500M-v1.2.

Model Details

  • Base Model: GenomeOcean-500M-v1.2
  • Quantization Method: GGUF (Q4_K_M)
  • Bits: 4-bit (Mixed 4-bit and 6-bit)

Benchmark Results

Metric Original (FP16) Quantized (GGUF Q4_K_M) Change
VRAM Footprint 1032.3 MB 309.7 MB -70.0%
NLL Loss 5.9931 6.0295 -
Perplexity (PPL) 400.6447 415.5238 +3.71% (Drift)
Total Gen Time (50 seqs) 34.2s 89.3s +161.0% (Slower)

Analysis

  • Fidelity: Superior fidelity with only +3.71% PPL drift. The Q4_K_M quantization method effectively preserves model quality by using higher bit-rates for critical layers.
  • Efficiency: Reduces VRAM requirements by 70%.
  • Inference Latency Warning: This format is highly optimized for llama.cpp. In current vLLM or Transformers implementations, it may be significantly slower than FP16 (approx. 2.6x slower in our benchmarks) due to lack of optimized GPU kernels. Recommended primarily for CPU-based or local llama.cpp inference.

Quick Start

Using llama.cpp

./llama-cli -m models/GenomeOcean-500M-v1.2-GGUF/model.gguf -p "ATGCGATCGATCGATCGATCG" -n 100

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "models/GenomeOcean-500M-v1.2-GGUF",
    gguf_file="model.gguf",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("models/GenomeOcean-500M-v1.2-GGUF")

inputs = tokenizer("ATGCGATCGATCGATCGATCG", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
42
GGUF
Model size
0.5B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support