Why Sanskrit Breaks Your Tokenizer
Feed a Sanskrit sentence into GPT-4's tokenizer and watch what happens. You can see it in action here.
The word "Sachchidananda", meaning a poise of Consciousness, a certain Whole, from which all play of multiplicity begins, is a foundational concept in Indian/Hindu Ontology. It means existence-consciousness-bliss, sat-chit-ananda. Feed this word into a tokenizer and see how it gets shredded into byte-level fragments. The tokenizer has never seen it as a single word, let alone recognise it as a concept. Also, every Devanagari character costs 3 bytes in UTF-8, compared to 1 for ASCII. The tokenizer, trained mostly on English text, treats Sanskrit like noise to be chunked.
This isn't just a Sanskrit problem btw. The deeper issue is in how we work with languages that have an inner structure. Or even how we ought to approach problems involving non linguistic data(proteins, music, art, genome sequences).
The Numbers
When we built SanskritKatha, 50,000 AI-generated Sanskrit stories for training small language models, we needed a tokenizer. The obvious thing to do was to use an existing multilingual tokenizer. The results were interesting.
GPT-4's tiktoken tokenizer averages about 2 characters per token on Sanskrit text. Our custom BPE tokenizer, trained on Sanskrit, averages 4.76 characters per token. That was a 2-3x efficiency gain. The same story that costs 500 tokens with tiktoken costs around 200 with ours.
96.3% of our tokenizer's vocabulary is Devanagari. Common Sanskrit words like अस्ति (asti, "is") and बालकः (bālakah, "boy") encode as single tokens instead of being split into meaningless byte sequences.
Why This Matters Beyond Sanskrit
Token efficiency wasn't just about cost. The nature of tokens impacts what a model learns, and perhaps also influences how it learns. The latter is an open question we are yet to explore.
If a model's context window is measured in tokens, and if our tokenizer wastes 2-3x more tokens representing Sanskrit, by implication the model is seeing 2-3x less text per training run. Shorter windows simply means less contextual info to use when generating. The tokenizer's ignorance about Sanskrit turns out to be the model's ignorance about Sanskrit.
This generalises too. Every language that isn't well-represented in the tokenizer's training data pays this price. Tamil, Telugu, Kannada, Malayalam, all Indic scripts share the 3-byte-per-character UTF-8 problem. The languages that could benefit most from AI enabled applications are being hobbled by this issue.
Multilingual models aren't multilingual at the tokenizer level. They're English models that have an affinity to other languages, like a cousin by 4-5 hops of separation. You sort of know him but then again, you really don't have much of a clue.
What We Did
BPE (Byte-Pair Encoding) trained exclusively on our Sanskrit corpus. 10,000 vocabulary. NFC Unicode normalization. Script-aware pre-tokenization that understands Devanagari conjuncts and doesn't split them at byte boundaries.
The training data was important to us. Similar to the TinyStories paper, we had a corpus of simple children's stories (BalaKatha tier, ages 4-5, constrained vocabulary). We decided to have an additional corpus with slightly more complex narratives (KishoraKatha tier, ages 14-15). The tokenizer learns vocabulary from both levels, so it handles the full range of Sanskrit expression, from basic nouns to compound philosophical terms.
We preserved all 7,091 Sanskrit diacritical characters in our training data. ā, ī, ū, ś, ñ, ṣ. They encode phonological distinctions that change meaning. A tokenizer that ignores diacritics produces a model that can't distinguish between words.
The Deeper Problem
Sanskrit is morphologically rich in a way that English isn't. Case relationships that English expresses with prepositions ("to the village") Sanskrit expresses with noun suffixes (grāmam). Subject marking that English handles with separate pronouns, Sanskrit encodes in verb inflection. Compound words (samāsa) pack entire phrases into single tokens.
150 Sanskrit words carry roughly the narrative payload of 200-250 English words. This has direct implications for how we design training data, set context window sizes, and evaluate output quality. Word count targets that make sense for English TinyStories don't transfer.
We did not know any of this, stumbled into it by trying to work on SanskritKatha. The gap between "Sanskrit is an ancient language" and "here's what happens when you feed Sanskrit to a modern NLP pipeline" seems a nice place to find interesting problems!
The Takeaway
If you're working with any language that isn't English (or one of the handful of well-represented European languages), do look at how your tokenizer treats it. Feed in a sentence, count the tokens. Compare it to the same concept expressed in English. The ratio will tell you how much of a handicap the model starts with.
For Sanskrit, building a custom tokenizer was our first step in not looking at a token as a statistically convenient grouping of letters. We are yet far from looking at this from the level of akshara or looking at a word as living leaf on a word family branch held by a dhatu root.