[!NOTE] This is not a tutorial type blog. Think of this as my notes as I was going through the building tokenizer phase.
First thing first, let’s download the model files. I mean tokenizer files. But, they end with .model
. These are serialized using ProtoBuf. You can find the specification of the file here. You can download the model file from here.
Gemma-3 tokenizer is different than Gemma-2. Algorithm remains same, vocabulary size is also more or less similar (256K vs ~262K). Within gemma3 variants, the same tokenizers are used.