regex pattern
Fixing Mistral-Common vs Hugging Face Tokenizer Mismatch
The current pre-tokenizer in the tokenizer.json is as follows:
"pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{
"type": "Split",
"pattern": {
"Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": false
}
]
},
However the regex pattern (?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+ is wrong, the correct one being: [^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+
The ground source for the regex pattern can be found in our tekken.json file for our official supported tokenizer using mistral-common :
"config": {
"pattern": "[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
"num_vocab_tokens": 150000,
"default_vocab_size": 131072,
"default_num_special_tokens": 1000,
"version": "v11"
},
Also just to add an easy repro:
# CORRECT
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
tok_mc = MistralTokenizer.from_hf_hub("mistralai/Mistral-Small-3.1-24B-Instruct-2503")
tok_mc.instruct_tokenizer.tokenizer.encode("'The'", True, False)
# [1, 1039, 1784, 1039]
# INCORRECT
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-3.1-24B-Instruct-2503")
# tok.encode("'The'")
# [1, 1039, 1084, 1268, 1039]
# Incorrectly tokenizes to ["'", 'T', 'he', "'"]
Can this be merged? Any blockers?
@pandora-s @patrickvonplaten @BramVanroy Can someone confirm with Qwen3 team if they trained with the broken tokenizers? Qwen3 30B-A3B uses this tokenizer and maybe other Qwen3 models.
@pandora-s @patrickvonplaten @BramVanroy Can someone confirm with Qwen3 team if they trained with the broken tokenizers? Qwen3 30B-A3B uses this tokenizer and maybe other Qwen3 models.
Good Question, I have the same question too.
Hey
@pandora-s @patrickvonplaten @BramVanroy Can someone confirm with Qwen3 team if they trained with the broken tokenizers? Qwen3 30B-A3B uses this tokenizer and maybe other Qwen3 models.
Anyone that used the HF tokenizer and didn't use mistral-common prior to the fix in Transformers that trained using Mistral-Small-3.1-24B-Instruct-2503 as a base loaded a "broken" tokenizer for the model. That being said fine-tuning should diminish the impact over time but might have subpar performance and longer convergence.
We noticed a discrepancy for less than a 1% of tokens.
as someone on the receiving end of this, I just want to chime in and say that I have read this message a few thousand times now as it is spammed at me in my console and I am not even using a mistral model at all but a llama3 tokenizer
The tokenizer you are loading from '/.../meta-llama/llama-3.3-8B-Instruct' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the
fix_mistral_regex=Trueflag when loading this tokenizer to fix this issue.
im glad that this was so important that every tokenizer ever, when loaded, needs to say that needs to say that it doesnt have the mistral tokenizer fix