MiniMax-M2.7-abliterated-heretic-ara-AWQ
AWQ W4A16 (group_size 128, symmetric) quantization of Youssofal/MiniMax-M2.7-abliterated-BF16 — a Heretic-ARA abliterated derivative of MiniMaxAI/MiniMax-M2.7.
⚠️ Decensored model. Safety guardrails have been deliberately removed. Research and experimentation only. See full disclaimer below.
Quantization Details
| Parameter | Value |
|---|---|
| Method | AWQ (Activation-aware Weight Quantization) |
| Scheme | W4A16 (symmetric) |
| Weight Bits | 4 |
| Activation Bits | 16 |
| Group Size | 128 |
| Format | compressed-tensors |
| Calibration Dataset | HuggingFaceH4/ultrachat_200k |
| Calibration Samples | 128 |
| Max Sequence Length | 512 |
| Router Gates | Unquantized (full precision) |
| LM Head | Unquantized (full precision) |
| Experts Calibrated | All 256 per layer (custom all-experts patch) |
| Compatible Inference Engine | vLLM |
Quantization Notes
- Router gates kept full precision: MiniMax-M2's MoE uses sigmoid routing
with an
e_score_correction_bias. Quantizing the gate or this bias destroys routing decisions and produces multilingual gibberish. Both are explicitly excluded from quantization. - All 256 experts calibrated: With top-8 routing on a 256-expert model, naive AWQ leaves most experts with insufficient calibration data. This quantization uses a custom MoE forward patch during calibration that runs every expert on every batch (sparse for routed tokens, with a small dummy forward for unrouted experts to fire activation hooks).
- QK-Norm absorbs Q/K smoothing: MiniMax-M2 has
use_qk_norm=true. The RMSNorm weights ofq_norm/k_normabsorb AWQ's per-channel scales applied toq_proj/k_projoutputs, preserving correctness. - GQA v→o smoothing skipped: With 8 KV heads and 48 query heads, the v_proj output dimensions don't match o_proj input dimensions for per-channel smoothing. llm-compressor correctly identifies this incompatibility and skips it.
- MTP heads not preserved: The base checkpoint contains multi-token
prediction (MTP) module weights that HF's
MiniMaxM2ForCausalLMdoesn't use. These are dropped from the quantized model.
Deployment
Recommended inference with vLLM:
vllm serve alonsoko/MiniMax-M2.7-abliterated-heretic-ara-AWQ \
--trust-remote-code \
--tensor-parallel-size 4 \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2 \
--enable-auto-tool-choice
Recommended sampling: temperature=1.0, top_p=0.95, top_k=40 (per upstream MiniMax-M2 guidance).
MiniMax-M2 is an interleaved thinking model — when chaining assistant turns,
preserve <think>...</think> blocks from prior turns in the message history.
Hardware Requirements
Approximate VRAM for inference at this quantization (W4A16-G128):
- Weights: ~115 GB
- KV cache (per request, varies with context length): ~2-8 GB
- Recommended: 2× 80GB GPUs (e.g., A100/H100) with
--tensor-parallel-size 2, or 4× 48GB GPUs (e.g., L40S/A6000) with--tensor-parallel-size 4 - Minimum: A single 141GB H200 should fit weights + modest context
This is a decensored version of MiniMaxAI/MiniMax-M2.7, made using Heretic v1.2.0+custom with the Arbitrary-Rank Ablation (ARA) method
⚠️ Disclaimer
This model is intended for research, experimentation, and testing purposes only.
- This model may produce harmful, offensive, inappropriate, or otherwise objectionable content.
- The abliteration process removes safety guardrails that were intentionally built into the original model.
- Do not use this model in production systems, consumer-facing applications, or any context where harmful outputs could cause real-world harm.
- The authors and contributors of this toolkit bear no responsibility for any misuse of this model or any harm caused by outputs generated by this model.
- By using this model, you agree that you are solely responsible for ensuring its use complies with all applicable laws and ethical guidelines.
This model is shared purely for academic and technical exploration of model internals.
Abliteration parameters
| Parameter | Value |
|---|---|
| start_layer_index | 30 |
| end_layer_index | 51 |
| preserve_good_behavior_weight | 0.4512 |
| steer_bad_behavior_weight | 0.0037 |
| overcorrect_relative_weight | 0.8804 |
| neighbor_count | 14 |
About the Base Model
Original model: MiniMaxAI/MiniMax-M2.7
MiniMax-M2.7 is a 230B-parameter sparse MoE (10B active) built for agentic
workflows, coding, and tool use. It uses interleaved thinking with <think>...</think>
blocks. See the base model card
for capabilities, benchmarks, and deployment details.
- Downloads last month
- 25
Model tree for lhca521/MiniMax-M2.7-abliterated-heretic-ara-AWQ
Base model
MiniMaxAI/MiniMax-M2.7