Qwen3-8B-FineGrained-FP8 (Blackwell Optimized)

This repository contains a high-precision Fine-Grained FP8 quantization of huihui-ai/Qwen3-8B-Instruct-Abliterated.

The model has been specifically quantized using parameters optimized for next-generation hardware, particularly the NVIDIA Blackwell (RTX 50-series) architecture.

Model Highlights

  • Architecture: Qwen3-8B
  • Quantization: Fine-Grained FP8
  • Optimization: Optimized for Blackwell Tensor Cores (weight_block_size=(128, 128))
  • Abliterated: Based on the version by huihui-ai, where refusal mechanisms have been removed to provide more direct, unfiltered responses.

Technical Configuration

The quantization was performed using FineGrainedFP8Config with the following settings:

  • Weight Block Size: 128x128. This specific block size is designed to align with the hardware throughput of RTX 5090 and other Blackwell-based GPUs, allowing for native execution with minimal overhead.
  • Precision: Unlike standard per-tensor FP8, the fine-grained approach maintains significantly higher output quality by scaling weights in smaller blocks.

Hardware Requirements

  • Optimal: NVIDIA RTX 50-series (Blackwell) for native hardware acceleration.
  • Supported: NVIDIA RTX 40-series (Ada Lovelace), H100, and L40S.
  • VRAM: Occupies approximately 8-9 GB of VRAM. A 12GB+ card is recommended for handling longer context windows and KV-cache.

Usage

You can load this model directly using the transformers library. Ensure you have the latest version of accelerate and transformers installed.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "ikarius/Qwen3-8B-FineGrained-FP8"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Explain the advantages of FP8 quantization for LLMs."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantization Process

The model was quantized from the BF16 source using the following logic:

Loaded with dtype="auto" and device_map="auto".

Configured with FineGrainedFP8Config(weight_block_size=(128, 128)).

Weights were saved in the optimized FP8 format to allow for immediate loading without re-quantization.

Disclaimer

This is an abliterated model. It has fewer safety guardrails compared to the original Qwen3 release. Users are responsible for their own implementations of moderation layers and for using the model ethically and legally.

Credits

Original Model: Qwen Team

Abliteration: huihui-ai
Downloads last month
12
Safetensors
Model size
8B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support