Tokenizers are easy!

Tokenizers are easy!

From words to bytes: why tokenizers don’t just split on spaces, how Byte-Pair Encoding builds a practical vocabulary, and a hands-on look at optimizing BPE from naïve O(V×M) to 85× faster with an inverted index and heap.

[Read more]