Dhruv Patel

(Re)building Gemma tokenizer in Python

2025-04-21

[!NOTE] This is not a tutorial type blog. Think of this as my notes as I was going through the building tokenizer phase.

First thing first, let’s download the model files. I mean tokenizer files. But, they end with .model. These are serialized using ProtoBuf. You can find the specification of the file here. You can download the model file from here.

Gemma-3 tokenizer is different than Gemma-2. Algorithm remains same, vocabulary size is also more or less similar (256K vs ~262K). Within gemma3 variants, the same tokenizers are used.

[]

Can we do better than NumPy in special cases?

2022-04-10

Can we do better than NumPy by trading generality? Answer is yes, by using BLAS and C extension.

[]

Why the solution to part 2(AoC21-Day7) works?

2021-12-07

#aoc #optimization

Day 7(part2) of this year’s content brought an interesting problem. Many peoples’ solution just worked, but many (including me) had no idea why it worked. This is my attempt to solve that mystery.

[]

Why Blelloch Scan Works

2021-10-02

#parallel-computing #cuda

A video blog explaining why Blelloch scan works

[]

Python got me with circular imports!

2021-04-14

#python-gotcha #python

Yet another way circular import can get you. This blog describes how Python import works, why circular imports cause issue and how to fix it.

[]

NotAeroCalc Part 1: What is NotAeroCalc and how to use it?

2021-04-03

#science #aero #physics #python #ply #yacc #lex

I am a Computer Engineer by education and by training. Luckily, we computer engineers do not have to deal with different units. All our units are power of 2. We have sensible names, Kilo, Mega, Giga, you get it. When we get unlucky, there could be a confusion of whether Kilo in the context is $ 2^{10} $ or $ 10^3 $. But that is it.

But in other engineering (and science) branches this is not the case. A while ago I was taking a course “Introduction Aeronautical Engineering” on edX (the course is from TUDelft). Honestly, the homework problems there were annoying, at least for me. I can understand that students in the Aero department might need to practice the conversion process.

[]

How to find the number of unique elements in a stream?

2021-03-29

#python #big data

So, I’ve been reading about streaming algorithms. Seems like the journey to streaming algorithms(aka Algorithms for Big Data) starts with the Flajolet-Martin algorithm.

The Problem

We are given a sequence <u_0, u_1, u_2, u_3, ... , u_n> of n elements. Each u_i comes from the fixed set U of some finite size. We want to see how many elements are unique.

A simple Python code like below can solve the problem, if we have required memory.

[]