[Query-ISSUE] tokenizer.vocab_size is 128000, however len(tokenizer) is 128256, which prevents me from using those other tokens.
#34
by HV-Khurdula - opened
@HV-Khurdula The extra 256 are special tokens with token ids ranging from 128000-128255.
These are <|begin_of_text|>, <|end_of_text|>, <|reserved_special_token_0|>, etc.. The first two are already in use as BOS and EOS tokens.
You can find the complete list in the tokenizer_config.json file.
https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/tokenizer_config.json
