Sentencepiece model file

#9
by Bingsu - opened

Thank you for releasing the gemma4 model.
It appears that gemma4 uses a Sentencepiece tokenizer very similar to the one in gemma3. Would you be able to upload the Sentencepiece model file you used?

Google org

Hi @Bingsu ,

Thank you for reaching out and notifying us about this. We have escalated this issue to our Engineering team for further investigation. We are tracking this internally and will provide an update on this thread as soon as we have a resolution or next steps.

That file is 1 bit off at address 0x12. Byte 0x38 declares field 7 of enclosing NormalizerSpec, wire type 0 (varint).
SentencePiece ProtoBuf schema does not support such a field. NormalizerSpec in the schema defines fields 1..6 and allows extensions 200..max. The fields 7..199 are illegal per official schema.
The most likely intended field is 5 (escape_whitespaces), encoded with byte 0x28, wire type unchanged, following 1-byte varint value being 0 for false.

"1 bit off at address 0x12. Byte 0x38 declares field 7 of enclosing NormalizerSpec, wire type 0 (varint)"

What is the exact error when you used SentencePiece to load that file?

I linked the official schema it violates. The "error" is an illegal field identifier.
I'm not "loading" it into anything — just reading the file per the official spec.
There is no field 7 of NormalizerSpec defined and the schema does not permit extensions with this field identifier.
image
Ignore the further misread fields in the left pane — that's expected of parsing a streaming serialization format like ProtoBuf. Look up its specification for wire transport encoding.

Put simply: in whatever software you're loading it in — see if NormalizerSpec.escape_whitespaces is set to be enabled.
Google set it to be disabled in the file, but since they misidentified the field — expect the value to be silently ignored and defaulted to enabled.

Sign up or log in to comment