Building Madhuram-Translate: From Tokenizer to Translation

Experimental Model: Madhuram-Translate was an experimental research project. A better, more capable version will be made available in the near future.

Creating effective tokenizers for multiple languages, especially those with different scripts like English and Indic languages, presents unique challenges. This article documents the journey of building "Madhuram," a multilingual tokenizer supporting English, Hindi, Bengali, Kannada, Tamil, Punjabi, and Telugu - from expensive failures to an efficient, cost-effective solution that powers state-of-the-art translation performance.

Multilingual tokenization is particularly challenging because different language families have vastly different characteristics. While English uses straightforward alphabetic characters, Indic languages employ complex scripts with conjuncts, combining marks, and context-dependent character variations. These differences mean that tokenization strategies effective for one language family often fail catastrophically for others.

Failure #1: ByteLevel Tokenization Falls Short

The first approach seemed logical: use ByteLevel tokenization, which works well for English by treating text as raw bytes. However, this approach proved disastrous for Indic languages.

The Problem: Indic scripts have complex character compositions with combining marks, conjuncts, and multi-byte UTF-8 representations. ByteLevel tokenization fragments these meaningful linguistic units into meaningless byte sequences, destroying the semantic structure that's crucial for these languages.

Result: Poor tokenization quality with high fertility rates (too many tokens per word) and loss of linguistic meaning.

Failure #2: The Data Deluge Disaster

Learning from the ByteLevel failure, the second attempt took a different approach: throw more data at the problem.

The Approach:

Data: 170 GB of multilingual data
Tokenizer: Vanilla BPE tokenizer
Massive Compute Resources: 64 vCPUs
Training Time: 10 hours for data preparation + 1 hour for tokenizer training

This failure demonstrates the "data fallacy" - the assumption that poor model performance can always be solved by adding more training data. Key Insight: More data doesn't automatically solve tokenization quality issues.

Success #1: Small-Scale Breakthrough

Frustrated with expensive failures, the third attempt took a minimalist approach.

The Pivot:

Languages Covered: Reduced to just English and Hindi
Tiny vocabulary: 15,000 tokens
Dataset: Small, curated public dataset
Modest hardware: 4 vCPUs

Results:

Quality: Significant quality improvement
Training Time: Under 5 minutes
Cost Reduction: 16x cost reduction

The Revelation: Quality data curation and appropriate vocabulary sizing matter more than brute-force scaling.

Success #2: The Madhuram Tokenizer

Building on the small-scale success, the final iteration expanded thoughtfully.

Technical Specifications:

Languages: 7 languages (English, Hindi, Bengali, Kannada, Tamil, Punjabi, Telugu)
Vocabulary Size: 65,536 tokens
Training Data: 1.2M carefully curated samples
Data Size: <950 MB after aggressive cleaning
Hardware: 4 vCPUs only
Training Time: <20 minutes data prep + <2 minutes tokenizer training

Data Engineering Strategy

The key was intelligent data curation rather than volume. The filtering process removed low-quality text, ensuring each training sample contributed meaningful linguistic information.

Language Balancing

Instead of equal representation, languages were rebalanced based on tokenization complexity. Bengali received a 6x multiplier due to its complex script, while English received only a 2x multiplier.

Comprehensive Performance Comparison

Madhuram's performance was evaluated against three established multilingual tokenizers: Gemma-3 27B, TWO AI's SUTRA, and Sarvam's Sarvam-1.

Fertility Analysis (Tokens per Word)

Language	Madhuram	SUTRA	Sarvam-1	Gemma-3
English	1.35	1.14	1.43	1.28
Hindi	1.47	1.46	1.40	1.43
Punjabi	1.55	1.25	1.68	2.87
Bengali	1.71	1.85	2.07	1.72
Telugu	2.09	2.23	2.14	2.88
Tamil	2.16	2.28	2.17	2.42
Kannada	2.24	2.47	2.37	3.33

Table 1: Fertility Rate comparison on FLORES dataset.

Key Findings:

Madhuram achieves the best overall fertility (1.796 average)
Most consistent performance with lowest standard deviation (0.364)
Exceptional performance on Kannada: 33% improvement over Gemma-3

Madhuram-Translate: Translation Performance

Building on our tokenizer, we developed Madhuram-Translate, a specialized translation model for English-to-Indic language translation.

Translation Performance on FLORES Dev

Evaluated using ChrF++ metric:

Language Pair	Gemma-2-2B	LLaMA-3.2-3B	LlaMA-3.1-8B	Sarvam-1	Madhuram
en-bn	29.91	30.6	37.24	41.0	40.28
en-hi	44.81	38.48	44.85	37.52	48.56
en-kn	23.21	26.81	33.8	41.54	42.49
en-pa	21.24	22.78	29.81	39.53	42.99
en-ta	32.74	27.4	35.3	44.02	44.58
en-te	26.05	24.58	32.1	45.76	44.47
Average	29.66	28.44	35.52	41.56	43.86

Table 2: Translation performance on FLORES Dev dataset using ChrF++ metric.

Key Results:

Overall Performance: Madhuram-Translate achieves an average ChrF++ score of 43.86
Consistent Excellence: Strong performance across all language pairs
Tokenization Advantage: Superior performance correlates with efficient tokenization

Cost Analysis

The journey from failure to success shows dramatic improvements:

Attempt	Data Size	Compute	Time	Result
Failure #2	170 GB	64 vCPUs	11 hours	Poor quality
Success #1	<500 MB	4 vCPUs	5 min	Good quality
Success #2	<950 MB	4 vCPUs	22 min	Superior quality

Table 3: Development iteration comparison.

Cost reduction: 470x cheaper than the failed attempt while achieving superior quality.

Key Lessons Learned

1. Quality Over Quantity

Aggressive data filtering and curation produced better results than massive, unfiltered datasets.

2. Language-Aware Balancing

Understanding each language's tokenization complexity enables better training balance.

3. Downstream Impact

Superior tokenization directly translates to better downstream performance, as demonstrated by Madhuram-Translate's exceptional translation quality.

Conclusion

Building effective multilingual tokenizers doesn't require massive resources or datasets. The Madhuram tokenizer demonstrates that thoughtful engineering, quality data curation, and iterative development can produce superior results compared to established tokenizers at a fraction of the cost.

The key insight: successful multilingual NLP isn't about having the most data or compute—it's about understanding your languages, curating quality data, and engineering solutions that respect linguistic diversity.

For collaboration opportunities or technical questions, please reach out to our team at [email protected].

Building Madhuram-Translate: From Tokenizer to Translation

Failure #1: ByteLevel Tokenization Falls Short

Failure #2: The Data Deluge Disaster

Success #1: Small-Scale Breakthrough

Success #2: The Madhuram Tokenizer

Data Engineering Strategy

Language Balancing

Comprehensive Performance Comparison

Fertility Analysis (Tokens per Word)

Madhuram-Translate: Translation Performance

Translation Performance on FLORES Dev

Cost Analysis

Key Lessons Learned

1. Quality Over Quantity

2. Language-Aware Balancing

3. Downstream Impact

Conclusion

Explore Our Research