Home Research Madhuram-Translate

Building Madhuram-Translate: From Tokenizer to Translation

July 31, 2025 15 min read Research

Experimental Model: Madhuram-Translate was an experimental research project. A better, more capable version will be made available in the near future.

Creating effective tokenizers for multiple languages, especially those with different scripts like English and Indic languages, presents unique challenges. This article documents the journey of building "Madhuram," a multilingual tokenizer supporting English, Hindi, Bengali, Kannada, Tamil, Punjabi, and Telugu - from expensive failures to an efficient, cost-effective solution that powers state-of-the-art translation performance.

Multilingual tokenization is particularly challenging because different language families have vastly different characteristics. While English uses straightforward alphabetic characters, Indic languages employ complex scripts with conjuncts, combining marks, and context-dependent character variations. These differences mean that tokenization strategies effective for one language family often fail catastrophically for others.

Failure #1: ByteLevel Tokenization Falls Short

The first approach seemed logical: use ByteLevel tokenization, which works well for English by treating text as raw bytes. However, this approach proved disastrous for Indic languages.

The Problem: Indic scripts have complex character compositions with combining marks, conjuncts, and multi-byte UTF-8 representations. ByteLevel tokenization fragments these meaningful linguistic units into meaningless byte sequences, destroying the semantic structure that's crucial for these languages.

Result: Poor tokenization quality with high fertility rates (too many tokens per word) and loss of linguistic meaning.

Failure #2: The Data Deluge Disaster

Learning from the ByteLevel failure, the second attempt took a different approach: throw more data at the problem.

The Approach:

This failure demonstrates the "data fallacy" - the assumption that poor model performance can always be solved by adding more training data. Key Insight: More data doesn't automatically solve tokenization quality issues.

Success #1: Small-Scale Breakthrough

Frustrated with expensive failures, the third attempt took a minimalist approach.

The Pivot:

Results:

The Revelation: Quality data curation and appropriate vocabulary sizing matter more than brute-force scaling.

Success #2: The Madhuram Tokenizer

Building on the small-scale success, the final iteration expanded thoughtfully.

Technical Specifications:

Data Engineering Strategy

The key was intelligent data curation rather than volume. The filtering process removed low-quality text, ensuring each training sample contributed meaningful linguistic information.

Language Balancing

Instead of equal representation, languages were rebalanced based on tokenization complexity. Bengali received a 6x multiplier due to its complex script, while English received only a 2x multiplier.

Comprehensive Performance Comparison

Madhuram's performance was evaluated against three established multilingual tokenizers: Gemma-3 27B, TWO AI's SUTRA, and Sarvam's Sarvam-1.

Fertility Analysis (Tokens per Word)

Language Madhuram SUTRA Sarvam-1 Gemma-3
English1.351.141.431.28
Hindi1.471.461.401.43
Punjabi1.551.251.682.87
Bengali1.711.852.071.72
Telugu2.092.232.142.88
Tamil2.162.282.172.42
Kannada2.242.472.373.33

Table 1: Fertility Rate comparison on FLORES dataset.

Key Findings:

Madhuram-Translate: Translation Performance

Building on our tokenizer, we developed Madhuram-Translate, a specialized translation model for English-to-Indic language translation.

Translation Performance on FLORES Dev

Evaluated using ChrF++ metric:

Language Pair Gemma-2-2B LLaMA-3.2-3B LlaMA-3.1-8B Sarvam-1 Madhuram
en-bn29.9130.637.2441.040.28
en-hi44.8138.4844.8537.5248.56
en-kn23.2126.8133.841.5442.49
en-pa21.2422.7829.8139.5342.99
en-ta32.7427.435.344.0244.58
en-te26.0524.5832.145.7644.47
Average29.6628.4435.5241.5643.86

Table 2: Translation performance on FLORES Dev dataset using ChrF++ metric.

Key Results:

Cost Analysis

The journey from failure to success shows dramatic improvements:

AttemptData SizeComputeTimeResult
Failure #2170 GB64 vCPUs11 hoursPoor quality
Success #1<500 MB4 vCPUs5 minGood quality
Success #2<950 MB4 vCPUs22 minSuperior quality

Table 3: Development iteration comparison.

Cost reduction: 470x cheaper than the failed attempt while achieving superior quality.

Key Lessons Learned

1. Quality Over Quantity

Aggressive data filtering and curation produced better results than massive, unfiltered datasets.

2. Language-Aware Balancing

Understanding each language's tokenization complexity enables better training balance.

3. Downstream Impact

Superior tokenization directly translates to better downstream performance, as demonstrated by Madhuram-Translate's exceptional translation quality.

Conclusion

Building effective multilingual tokenizers doesn't require massive resources or datasets. The Madhuram tokenizer demonstrates that thoughtful engineering, quality data curation, and iterative development can produce superior results compared to established tokenizers at a fraction of the cost.

The key insight: successful multilingual NLP isn't about having the most data or compute—it's about understanding your languages, curating quality data, and engineering solutions that respect linguistic diversity.


For collaboration opportunities or technical questions, please reach out to our team at [email protected].

Explore Our Research