(Re)building Gemma tokenizer in Python

[!NOTE] This is not a tutorial type blog. Think of this as my notes as I was going through the building tokenizer phase.

First thing first, let’s download the model files. I mean tokenizer files. But, they end with .model. These are serialized using ProtoBuf. You can find the specification of the file here. You can download the model file from here.

Gemma-3 tokenizer is different than Gemma-2. Algorithm remains same, vocabulary size is also more or less similar (256K vs ~262K). Within gemma3 variants, the same tokenizers are used.

[]

NotAeroCalc Part 1: What is NotAeroCalc and how to use it?

I am a Computer Engineer by education and by training. Luckily, we computer engineers do not have to deal with different units. All our units are power of 2. We have sensible names, Kilo, Mega, Giga, you get it. When we get unlucky, there could be a confusion of whether Kilo in the context is $ 2^{10} $ or $ 10^3 $. But that is it.

But in other engineering (and science) branches this is not the case. A while ago I was taking a course “Introduction Aeronautical Engineering” on edX (the course is from TUDelft). Honestly, the homework problems there were annoying, at least for me. I can understand that students in the Aero department might need to practice the conversion process.

[]

How to find the number of unique elements in a stream?

So, I’ve been reading about streaming algorithms. Seems like the journey to streaming algorithms(aka Algorithms for Big Data) starts with the Flajolet-Martin algorithm.

The Problem

We are given a sequence <u_0, u_1, u_2, u_3, ... , u_n> of n elements. Each u_i comes from the fixed set U of some finite size. We want to see how many elements are unique.

A simple Python code like below can solve the problem, if we have required memory.

[]