All LLMs use tokenization. Are we doing it totally wrong?

T-FREE challenges how we do tokenization. Image source from Vellum.ai ().

What happens when you challenge one of the most basic assumptions in language AI? We’ve spent years making tokenizers bigger and more sophisticated, training them on more data, expanding their vocabularies. But what if that whole approach is fundamentally limiting us?

The researchers behind T-FREE, the , ask exactly this. Their answer could change how we build language models. Instead of using a fixed vocabulary of tokens, they show how to map words directly into sparse patterns – and in doing so, cut model size by 85% while matching standard performance.

When I first read about their approach, I was skeptical. Language models have used tokenizers since their inception – it seemed like questioning whether cars need wheels. But as I dug into the paper, I found myself getting increasingly excited. The researchers are showing us how our standard solutions have trapped us in a particular way of thinking, and I think that’s a very refreshing way to look at LLM performance.

All LLMs use tokenization. Are we doing it totally wrong?

Hugging Face claims its new robotics model can run on a...

FBI: Play ransomware breached critical organizations, including 900 victims

Hacker arrested for breaching 5,000 hosting accounts to mine crypto

Ukraine claims that it has hacked Tupolev

Recomended

Hugging Face claims its new robotics model can run on a MacBook.

FBI: Play ransomware breached critical organizations, including 900 victims

Hacker arrested for breaching 5,000 hosting accounts to mine crypto

Ukraine claims that it has hacked Tupolev

Does a TV use electricity in standby mode?

Reddit sues Anthropic over allegedly not paying training data