<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Algorihms on giacolees - Tech Blog</title><link>https://giacolees.github.io/tags/algorihms/</link><description>Recent content in Algorihms on giacolees - Tech Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Thu, 19 Mar 2026 22:58:56 +0100</lastBuildDate><atom:link href="https://giacolees.github.io/tags/algorihms/index.xml" rel="self" type="application/rss+xml"/><item><title>Tokenizers are easy!</title><link>https://giacolees.github.io/posts/tokenizers/</link><pubDate>Thu, 19 Mar 2026 22:58:56 +0100</pubDate><guid>https://giacolees.github.io/posts/tokenizers/</guid><description>TL;DR Your LLM has never read a single word. It reads tokens — and the way text gets chopped up matters more than you'd think. Splitting on spaces explodes the vocabulary and chokes on anything outside English. The fix? Byte-Pair Encoding: start from raw bytes, greedily merge the most frequent pairs, repeat. Simple idea, nasty bottleneck — the naive version scans every word on every merge, costing O(V × M).</description></item></channel></rss>