Qwen3-8B-FineGrained-FP8 (Blackwell Optimized)
This repository contains a high-precision Fine-Grained FP8 quantization of huihui-ai/Qwen3-8B-Instruct-Abliterated.
The model has been specifically quantized using parameters optimized for next-generation hardware, particularly the NVIDIA Blackwell (RTX 50-series) architecture.
Model Highlights
- Architecture: Qwen3-8B
- Quantization: Fine-Grained FP8
- Optimization: Optimized for Blackwell Tensor Cores (
weight_block_size=(128, 128)) - Abliterated: Based on the version by huihui-ai, where refusal mechanisms have been removed to provide more direct, unfiltered responses.
Technical Configuration
The quantization was performed using FineGrainedFP8Config with the following settings:
- Weight Block Size: 128x128. This specific block size is designed to align with the hardware throughput of RTX 5090 and other Blackwell-based GPUs, allowing for native execution with minimal overhead.
- Precision: Unlike standard per-tensor FP8, the fine-grained approach maintains significantly higher output quality by scaling weights in smaller blocks.
Hardware Requirements
- Optimal: NVIDIA RTX 50-series (Blackwell) for native hardware acceleration.
- Supported: NVIDIA RTX 40-series (Ada Lovelace), H100, and L40S.
- VRAM: Occupies approximately 8-9 GB of VRAM. A 12GB+ card is recommended for handling longer context windows and KV-cache.
Usage
You can load this model directly using the transformers library. Ensure you have the latest version of accelerate and transformers installed.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "ikarius/Qwen3-8B-FineGrained-FP8"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
dtype="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Explain the advantages of FP8 quantization for LLMs."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quantization Process
The model was quantized from the BF16 source using the following logic:
Loaded with dtype="auto" and device_map="auto".
Configured with FineGrainedFP8Config(weight_block_size=(128, 128)).
Weights were saved in the optimized FP8 format to allow for immediate loading without re-quantization.
Disclaimer
This is an abliterated model. It has fewer safety guardrails compared to the original Qwen3 release. Users are responsible for their own implementations of moderation layers and for using the model ethically and legally.
Credits
Original Model: Qwen Team
Abliteration: huihui-ai
- Downloads last month
- 12