Algorihms on giacolees

Algorihms on giacolees - Tech Bloghttps://giacolees.github.io/tags/algorihms/Recent content in Algorihms on giacolees - Tech BlogHugoenThu, 19 Mar 2026 22:58:56 +0100Tokenizers are easy!https://giacolees.github.io/posts/tokenizers/Thu, 19 Mar 2026 22:58:56 +0100https://giacolees.github.io/posts/tokenizers/TL;DR Your LLM has never read a single word. It reads tokens — and the way text gets chopped up matters more than you'd think. Splitting on spaces explodes the vocabulary and chokes on anything outside English. The fix? Byte-Pair Encoding: start from raw bytes, greedily merge the most frequent pairs, repeat. Simple idea, nasty bottleneck — the naive version scans every word on every merge, costing O(V × M).