regex pattern

  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  },

However the regex pattern (?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+ is wrong, the correct one being: [^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+

The ground source for the regex pattern can be found in our tekken.json file for our official supported tokenizer using mistral-common :

    "config": {
        "pattern": "[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
        "num_vocab_tokens": 150000,
        "default_vocab_size": 131072,
        "default_num_special_tokens": 1000,
        "version": "v11"
    },

patrickvonplaten

Mistral AI_ org Nov 10, 2025

Also just to add an easy repro:

# CORRECT
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
tok_mc = MistralTokenizer.from_hf_hub("mistralai/Mistral-Small-3.1-24B-Instruct-2503")

tok_mc.instruct_tokenizer.tokenizer.encode("'The'", True, False)
# [1, 1039, 1784, 1039]

# INCORRECT
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-3.1-24B-Instruct-2503")

# tok.encode("'The'")
# [1, 1039, 1084, 1268, 1039]
# Incorrectly tokenizes to ["'", 'T', 'he', "'"]

BramVanroy

Dec 9, 2025

Can this be merged? Any blockers?

Qubitium

Dec 25, 2025

@pandora-s @patrickvonplaten @BramVanroy Can someone confirm with Qwen3 team if they trained with the broken tokenizers? Qwen3 30B-A3B uses this tokenizer and maybe other Qwen3 models.

THU-CVML-Administrator

Jan 4

@pandora-s @patrickvonplaten @BramVanroy Can someone confirm with Qwen3 team if they trained with the broken tokenizers? Qwen3 30B-A3B uses this tokenizer and maybe other Qwen3 models.

Good Question, I have the same question too.

juliendenize

Mistral AI_ org Jan 5

Hey

@pandora-s @patrickvonplaten @BramVanroy Can someone confirm with Qwen3 team if they trained with the broken tokenizers? Qwen3 30B-A3B uses this tokenizer and maybe other Qwen3 models.

Anyone that used the HF tokenizer and didn't use mistral-common prior to the fix in Transformers that trained using Mistral-Small-3.1-24B-Instruct-2503 as a base loaded a "broken" tokenizer for the model. That being said fine-tuning should diminish the impact over time but might have subpar performance and longer convergence.

We noticed a discrepancy for less than a 1% of tokens.

travisking

19 days ago

as someone on the receiving end of this, I just want to chime in and say that I have read this message a few thousand times now as it is spammed at me in my console and I am not even using a mistral model at all but a llama3 tokenizer

The tokenizer you are loading from '/.../meta-llama/llama-3.3-8B-Instruct' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue.

im glad that this was so important that every tokenizer ever, when loaded, needs to say that needs to say that it doesnt have the mistral tokenizer fix

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment