The Mula-Bhashya Problem

Every knowledge tradition has a hierarchy of texts.

In the Hindu tradition there is shruti, revealed scripture, and bhashya, commentary. In Buddhist canon: sutra and shastra. Even in software, if you think about it: the code and the documentation.

In each case, the relationship is the same. The primary text has authority. The commentary has utility. Both are valuable. But they are not the same kind of thing, and conflating them distorts both.

When you train a language model on texts from these traditions, this distinction begins to matter.

How We Stumbled Into It

We're training a small language model, SASLM, on the collected works of Sri Aurobindo. He was a 20th-century Indian Rishi whose writings on Integral Yoga, consciousness, and human evolution span 23 volumes. Dense, precise, distinctive prose. The kind of writing where word choices have a certain inevitability about them.

The primary corpus: ~7.5M tokens of Sri Aurobindo's own writing. While this is an impressive corpus, we quickly found out that this is too little data to train even a Small Language Model.

Then we found a treasure, 500+ video transcripts of Integral Yoga practitioners lecturing on Sri Aurobindo's works. Cleaned and processed, that came to ~16M tokens. More than twice the volume of the primary texts.

The obvious thing was to combine everything and train. More data the merrier we thought!

It did help, but not in the way we wanted it to.

The Displacement Problem

When we trained on the combined corpus with equal weighting, the model's voice shifted. It started sounding more like the commentators...more pedagogical, more explanatory, more discursive...and less like Sri Aurobindo. The Rishi's characteristic precision, his way of building arguments through carefully layered sentences, was being diluted by the commentator's more accessible style.

This makes sense when you think about it. The commentary corpus was 2x larger. In a naive training regime, the model sees more commentary tokens than primary tokens. It learns what it sees most. The interpretation displaces the original.

This is an epistemological problem dressed up as a data problem.

What We Did

We built what we're calling hierarchical training data, a framework for encoding textual authority into the training pipeline.

Three components:

Provenance tokens. Every passage is wrapped in special tokens that mark its source. <mula> and </mula> for primary text. <bhashya> and </bhashya> for commentary. The model learns to distinguish between the two at the token level...it knows, in some sense, what kind of text it's processing.

Weighted sampling. The primary text gets roughly 4x effective influence through a combination of source-type weights, period weights (mature works weighted higher than early works), importance weights (core philosophical works weighted higher than reference material), and content-type weights (essays weighted differently from letters or poetry).

The result is an ~80/20 effective training split despite the commentary being 2x larger in raw volume. The primary voice dominates. The commentary informs without displacing.

The epistemological framing. We categorised this as an "Immutable Primary" model...one where the primary text is fixed and authoritative, and commentary exists to explain, not to amend. This is the shruti-bhashya model. Other traditions might use different patterns: a "Living Interpretation" model where commentary evolves the primary text (common in legal traditions), or an "Evolving Consensus" model where the boundary between primary and commentary is fluid (some scientific fields work this way).

The Result

19% perplexity improvement over training on primary text alone.

The commentary helped. It provided pedagogical scaffolding...explanations of concepts, cross-references between works, applied examples that the primary text assumes rather than states. The model trained on both sources understands Sri Aurobindo's vocabulary and conceptual framework better than the model trained only on his own writing.

But the voice stayed his. The provenance tokens and weighted sampling kept the primary text as the gravitational centre.

Where This Applies

The more I think about this, the more places I see the same pattern:

Religious and philosophical AI. Any project working with sacred or authoritative texts and their commentary traditions. The sutra + shastra problem. In each case, you want the model to understand through the lens of commentary but speak with the authority and precision of the primary text.

Legal AI. Statute and case law have fundamentally different authority. A model that conflates them, treating a judge's interpretation as equivalent to the law itself, makes a category error that could have real consequences.

Technical domains. The specification vs the tutorial. The API reference vs the blog post explaining the API. Stack Overflow answers vs the actual documentation. If you train a coding model on all of these equally, it will reproduce the tutorial voice (informal, sometimes imprecise) rather than the specification voice (formal, exact).

Historical research. Primary sources vs secondary analysis. A historian's interpretation of a letter is not the letter. Training a model on both without distinguishing them produces text that blends source and interpretation in ways a historian would find troubling.

The Default Is Wrong

The standard approach in NLP is to throw all your text into one pile and let the model sort it out. For most applications this works fine. A chatbot doesn't need to distinguish between different kinds of training data.

But for domains where textual authority matters, where the tradition itself makes careful distinctions about what is primary and what is derived, the default destroys information that matters. If the hierarchy is real in the tradition, it probably should be real in the training.

We built this for a specific project with specific needs. But the framework is general. Provenance tokens, weighted sampling, and explicit epistemological framing could apply to any corpus where not all text carries equal authority.

The interesting question, which we haven't answered yet, is whether the model actually learns the distinction between mula and bhashya at a representational level - whether the provenance tokens create separate conceptual spaces inside the model, or whether they just function as crude volume controls. That's a probing experiment we want to run. But even as a crude volume control, the improvement is significant enough to be worth the engineering.