Instructions to use JANGQ-AI/MiniMax-M2.7-JANGTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use JANGQ-AI/MiniMax-M2.7-JANGTQ with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("JANGQ-AI/MiniMax-M2.7-JANGTQ") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use JANGQ-AI/MiniMax-M2.7-JANGTQ with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "JANGQ-AI/MiniMax-M2.7-JANGTQ"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "JANGQ-AI/MiniMax-M2.7-JANGTQ" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use JANGQ-AI/MiniMax-M2.7-JANGTQ with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "JANGQ-AI/MiniMax-M2.7-JANGTQ"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default JANGQ-AI/MiniMax-M2.7-JANGTQ
Run Hermes
hermes
- MLX LM
How to use JANGQ-AI/MiniMax-M2.7-JANGTQ with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "JANGQ-AI/MiniMax-M2.7-JANGTQ"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "JANGQ-AI/MiniMax-M2.7-JANGTQ" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "JANGQ-AI/MiniMax-M2.7-JANGTQ", "messages": [ {"role": "user", "content": "Hello"} ] }'

MiniMax-M2.7-JANGTQ
MiniMax M2.7 — 47 GB on disk (down from the ~230 GB FP8 source) — 2-bit JANGTQ2 quantization in JANGTQ-PRESTACK layout (pre-stacked routed experts on disk → instant cold load, no runtime cache sidecar).
- Source: MiniMaxAI/MiniMax-M2.7 (MiniMax M2 architecture, FP8 E4M3 block-128 native, 196K context, 62 layers, 256 routed experts top-8)
- Quantization: JANGTQ2 — 2-bit MXTQ codebook (Hadamard-rotated, Lloyd-Max
optimized) on routed-expert weights + 8-bit affine on attention / shared
expert / embed / lm_head + fp16 passthrough on RMSNorms / router gate /
expert_bias - Routed-expert layout: pre-stacked along axis 0
(
block_sparse_moe.switch_mlp.<proj>.tq_packedshape[256, out, packed_in]) per the JANGTQ-PRESTACK STANDARD — no runtime restacking, nojangtq_stacked.safetensorssidecar - Bundle size: 47 GB on-disk across 51 shards
- Runs on: M3 Max 96 GB+ / M4 Max 128 GB / M5 Max 128 GB / Mac Studio
What's new in this build (2026-05-04)
This bundle is shipped in JANGTQ-PRESTACK layout — the routed-expert TurboQuant tensors are stacked along axis 0 directly in the main shards. Wins vs the previous per-expert layout:
| Metric | Old (per-expert) | This (pre-stacked) |
|---|---|---|
| First-load time | ~5-10s restacking pass | mx.load() direct (~14 s incl warmup) |
| Decode tok/s | reference | identical (same MXTQ codec, same fused decode kernels) |
| Bundle size | ~57 GB | ~47 GB (smaller by virtue of removing per-expert metadata duplication) |
| Loader path | streaming hydrate + per-expert restack | generic loader's prestack branch |
What's in the bundle
| Module | Source dtype | Bundle dtype |
|---|---|---|
| Routed experts (256 × 3 mats × 62 layers, pre-stacked along axis 0) | FP8 E4M3 + F32 weight_scale_inv | 2-bit MXTQ + sidecar codebook |
| Attention (q/k/v/o, q/k norms) | FP8 E4M3 / BF16 | 8-bit affine g=64 |
embed_tokens / lm_head |
BF16 | 8-bit affine g=64 |
RMSNorm / router gate / e_score_correction_bias |
BF16 / F32 | fp16 / fp32 passthrough |
jangtq_runtime.safetensors sidecar (~25 KB) for Swift runtimes — covers
(in_features={1536, 3072}, seed=42, bits=2) codebooks + sign-flip vectors.
Loading (Python)
pip install jang-tools mlx-lm
from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model("JANGQ-AI/MiniMax-M2.7-JANGTQ")
The loader detects the pre-stacked layout via
jang_config.routed_expert_layout == "prestacked" and routes through the
generic JANGTQ loader's prestack branch. Decode applies the standard
SwitchGLU fused gate+up + P15 router compile + P18 QKV fusion patches
automatically.
Reasoning + tools
- Reasoning parser:
qwen3(extracts<think>...</think>blocks) - Tool parser:
minimax - Default mode: thinking ON (chat template opens
<think>for the assistant); passenable_thinking=Falseto skip reasoning - Cache:
kv(standard MLA-free MoE attention cache)
Credits
- Quantization + MLX runtime: Jinho Jang (eric@jangq.ai)
- Base model: MiniMaxAI — M2.7 architecture
- Downloads last month
- 6,969
Quantized
Model tree for JANGQ-AI/MiniMax-M2.7-JANGTQ
Base model
MiniMaxAI/MiniMax-M2.7