Instructions to use Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4") model = AutoModelForMultimodalLM.from_pretrained("Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4
- SGLang
How to use Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4 with Docker Model Runner:
docker model run hf.co/Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4
Gemma 4 26B A4B IT Assistant MTP Draft - NVFP4
NVFP4 quantization of google/gemma-4-26B-A4B-it-assistant, the MTP/draft assistant model.
This is a quantized derivative, not a fine-tune. The repo metadata sets base_model_relation: quantized and tags the base model accordingly.
Format
- 2D BF16 weight tensors are stored as packed NVFP4 E2M1 codes.
- Per-block scales use FP8 E4M3, one scale per 16 values.
- 1D norm/scalar tensors remain BF16.
- The tied embedding/head tensor is quantized; there is no separate
lm_head.weightin the source checkpoint.
Because this Gemma 4 assistant architecture currently requires source/newer Transformers support, the repo includes load_gemma4_nvfp4.py, which dequants the packed NVFP4 tensors into the upstream Gemma4AssistantForCausalLM module.
from load_gemma4_nvfp4 import load_model
model, tok = load_model("Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4", device="cuda")
prompt = "Explain in one sentence what a draft model does."
inputs = tok(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=32, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))
Files
nvfp4_model.safetensors- packed NVFP4 weights plus BF16 residual tensors.quantization_config.json- tensor map, block size, parameter counts, and format metadata.quant_error_report.json- per-tensor relative L2 quantization error.load_gemma4_nvfp4.py- loader/smoke-test helper.
Notes
This is a storage-format quantization for the new Gemma 4 assistant draft architecture. Native NVFP4 kernel acceleration depends on runtime support catching up to this architecture; the included loader provides a correctness-first dequant path.
- Downloads last month
- 30
Model tree for Reza2kn/gemma-4-26B-A4B-it-assistant-NVFP4
Base model
google/gemma-4-26B-A4B-it-assistant