Experimental Model: Madhuram-Translate was an experimental research project. A better, more capable version will be made available in the near future.
Creating effective tokenizers for multiple languages, especially those with different scripts like English and Indic languages, presents unique challenges. This article documents the journey of building "Madhuram," a multilingual tokenizer supporting English, Hindi, Bengali, Kannada, Tamil, Punjabi, and Telugu - from expensive failures to an efficient, cost-effective solution that powers state-of-the-art translation performance.
Multilingual tokenization is particularly challenging because different language families have vastly different characteristics. While English uses straightforward alphabetic characters, Indic languages employ complex scripts with conjuncts, combining marks, and context-dependent character variations. These differences mean that tokenization strategies effective for one language family often fail catastrophically for others.
Failure #1: ByteLevel Tokenization Falls Short
The first approach seemed logical: use ByteLevel tokenization, which works well for English by treating text as raw bytes. However, this approach proved disastrous for Indic languages.
The Problem: Indic scripts have complex character compositions with combining marks, conjuncts, and multi-byte UTF-8 representations. ByteLevel tokenization fragments these meaningful linguistic units into meaningless byte sequences, destroying the semantic structure that's crucial for these languages.
Result: Poor tokenization quality with high fertility rates (too many tokens per word) and loss of linguistic meaning.
Failure #2: The Data Deluge Disaster
Learning from the ByteLevel failure, the second attempt took a different approach: throw more data at the problem.
The Approach:
- Data: 170 GB of multilingual data
- Tokenizer: Vanilla BPE tokenizer
- Massive Compute Resources: 64 vCPUs
- Training Time: 10 hours for data preparation + 1 hour for tokenizer training
This failure demonstrates the "data fallacy" - the assumption that poor model performance can always be solved by adding more training data. Key Insight: More data doesn't automatically solve tokenization quality issues.
Success #1: Small-Scale Breakthrough
Frustrated with expensive failures, the third attempt took a minimalist approach.
The Pivot:
- Languages Covered: Reduced to just English and Hindi
- Tiny vocabulary: 15,000 tokens
- Dataset: Small, curated public dataset
- Modest hardware: 4 vCPUs
Results:
- Quality: Significant quality improvement
- Training Time: Under 5 minutes
- Cost Reduction: 16x cost reduction
The Revelation: Quality data curation and appropriate vocabulary sizing matter more than brute-force scaling.
Success #2: The Madhuram Tokenizer
Building on the small-scale success, the final iteration expanded thoughtfully.
Technical Specifications:
- Languages: 7 languages (English, Hindi, Bengali, Kannada, Tamil, Punjabi, Telugu)
- Vocabulary Size: 65,536 tokens
- Training Data: 1.2M carefully curated samples
- Data Size: <950 MB after aggressive cleaning
- Hardware: 4 vCPUs only
- Training Time: <20 minutes data prep + <2 minutes tokenizer training
Data Engineering Strategy
The key was intelligent data curation rather than volume. The filtering process removed low-quality text, ensuring each training sample contributed meaningful linguistic information.
Language Balancing
Instead of equal representation, languages were rebalanced based on tokenization complexity. Bengali received a 6x multiplier due to its complex script, while English received only a 2x multiplier.
Comprehensive Performance Comparison
Madhuram's performance was evaluated against three established multilingual tokenizers: Gemma-3 27B, TWO AI's SUTRA, and Sarvam's Sarvam-1.
Fertility Analysis (Tokens per Word)
| Language | Madhuram | SUTRA | Sarvam-1 | Gemma-3 |
|---|---|---|---|---|
| English | 1.35 | 1.14 | 1.43 | 1.28 |
| Hindi | 1.47 | 1.46 | 1.40 | 1.43 |
| Punjabi | 1.55 | 1.25 | 1.68 | 2.87 |
| Bengali | 1.71 | 1.85 | 2.07 | 1.72 |
| Telugu | 2.09 | 2.23 | 2.14 | 2.88 |
| Tamil | 2.16 | 2.28 | 2.17 | 2.42 |
| Kannada | 2.24 | 2.47 | 2.37 | 3.33 |
Table 1: Fertility Rate comparison on FLORES dataset.
Key Findings:
- Madhuram achieves the best overall fertility (1.796 average)
- Most consistent performance with lowest standard deviation (0.364)
- Exceptional performance on Kannada: 33% improvement over Gemma-3
Madhuram-Translate: Translation Performance
Building on our tokenizer, we developed Madhuram-Translate, a specialized translation model for English-to-Indic language translation.
Translation Performance on FLORES Dev
Evaluated using ChrF++ metric:
| Language Pair | Gemma-2-2B | LLaMA-3.2-3B | LlaMA-3.1-8B | Sarvam-1 | Madhuram |
|---|---|---|---|---|---|
| en-bn | 29.91 | 30.6 | 37.24 | 41.0 | 40.28 |
| en-hi | 44.81 | 38.48 | 44.85 | 37.52 | 48.56 |
| en-kn | 23.21 | 26.81 | 33.8 | 41.54 | 42.49 |
| en-pa | 21.24 | 22.78 | 29.81 | 39.53 | 42.99 |
| en-ta | 32.74 | 27.4 | 35.3 | 44.02 | 44.58 |
| en-te | 26.05 | 24.58 | 32.1 | 45.76 | 44.47 |
| Average | 29.66 | 28.44 | 35.52 | 41.56 | 43.86 |
Table 2: Translation performance on FLORES Dev dataset using ChrF++ metric.
Key Results:
- Overall Performance: Madhuram-Translate achieves an average ChrF++ score of 43.86
- Consistent Excellence: Strong performance across all language pairs
- Tokenization Advantage: Superior performance correlates with efficient tokenization
Cost Analysis
The journey from failure to success shows dramatic improvements:
| Attempt | Data Size | Compute | Time | Result |
|---|---|---|---|---|
| Failure #2 | 170 GB | 64 vCPUs | 11 hours | Poor quality |
| Success #1 | <500 MB | 4 vCPUs | 5 min | Good quality |
| Success #2 | <950 MB | 4 vCPUs | 22 min | Superior quality |
Table 3: Development iteration comparison.
Cost reduction: 470x cheaper than the failed attempt while achieving superior quality.
Key Lessons Learned
1. Quality Over Quantity
Aggressive data filtering and curation produced better results than massive, unfiltered datasets.
2. Language-Aware Balancing
Understanding each language's tokenization complexity enables better training balance.
3. Downstream Impact
Superior tokenization directly translates to better downstream performance, as demonstrated by Madhuram-Translate's exceptional translation quality.
Conclusion
Building effective multilingual tokenizers doesn't require massive resources or datasets. The Madhuram tokenizer demonstrates that thoughtful engineering, quality data curation, and iterative development can produce superior results compared to established tokenizers at a fraction of the cost.
The key insight: successful multilingual NLP isn't about having the most data or compute—it's about understanding your languages, curating quality data, and engineering solutions that respect linguistic diversity.
For collaboration opportunities or technical questions, please reach out to our team at [email protected].