Performance vs. Scale:
Our Benchmark Results

June 15, 2025
8 min read

At Maruth Labs, we've been on a mission to make advanced language model accessible to everyone, everywhere - including places with limited connectivity and computing resources. Today, we're pulling back the curtain on Madhuram, our compact 74M-parameter model that delivers impressive performance across several commonsense benchmark tasks.

Here’s a look at Madhuram - Maruth Labs’ breakthrough compact language model - and the performance achievements that demonstrate its industry-leading efficiency and accuracy. Madhuram is a 74 million-parameter model trained on 130 billion tokens, showcasing how rigorous research and smart engineering can rival much larger architectures while requiring only a fraction of the computational resources. By optimizing for on-device inference, Madhuram delivers robust zero-shot performance across standard NLP benchmarks, outperforming GPT3-Small in several settings despite using less parameters. Maruth Labs’ mission centers on efficiency and accessibility, ensuring that high-performance AI is available even on mobile and wearable devices without compromising quality or speed.

Intelligent Design over Massive Scale

We believe that model efficiency starts with architecture. Rather than scaling parameter counts to billions, Madhuram leverages unique mathematical enhancements to the architecture and streamlined transformer blocks to maximize representational power per parameter. This approach enables Madhuram to achieve competitive accuracy on language modeling and classification tasks with dramatically lower resource consumption, reducing both inference latency and energy usage.

Benchmark Performance

We've benchmarked Madhuram (our beta release) against several leading compact models across multiple standard NLP tasks. Here's what our comprehensive analysis reveals:

Madhuram vs GPT3-Small

We evaluated Madhuram in zero-shot mode against GPT3-Small in both zero-shot and few-shot settings under identical configurations:
Task Madhuram
(Zero-shot)
GPT3-Small
(Zero-shot)
GPT3-Small
(Few-shot)
ARC (Average) 36.84% 35.10% 34.10%
Hellaswag 33.95% 33.70% 33.50%
OBQA 31.40% 35.60% 37.00%
WinoGrande 51.78% 52.00% 51.30%
PIQA 63.76% 64.60% 64.30%
BoolQ 53.12% 49.70% 43.10%
RTE 54.86% 47.70% 52.30%
SuperGLUE 53.07% 40.60% 50.20%
WSC 63.46% 59.60% 58.70%
Average 49.14% 46.51% 47.17%

Table 1: Performance comparison between Madhuram and GPT3-Small across various NLP tasks. Highlighted cells indicate areas where Madhuram demonstrates notable advantages.

Comparison with Other Compact Models

Madhuram's performance becomes even more impressive when compared to other models in the compact LLM space:

Task Madhuram-74M MobileLLM-125M SmolLM2-135M GPT3-Small
ARC 36.84% 37.25% 42.40% 35.10%
Hellaswag 33.95% 39.50% 41.20% 33.70%
OBQA 31.4% 41.10% 34.00% 35.60%
PIQA 63.76% 65.70% 69.40% 64.20%
WinoGrande 51.3% 52.10% 51.30% 52.00%
BoolQ 53.12% 60.40% 60.30% 49.70%
Super Glue 53.07% 54.90% 53.37% 40.60%
RTE 54.86% 54.15% 49.82% 47.70%
WSC 63.46% 36.54% 36.54% 59.60%
Average 49.14% 49.07% 48.59% 46.51%

Table 2: Performance comparison of Madhuram with other compact models across different NLP tasks. Highlighted cells indicate top performance for each task.

Comparison with STAR-5

The emerging STAR-5 model from Liquid AI represents another important benchmark in the on-device LLM space:

Task Madhuram STAR-5/Quality Difference
ARC (Easy) 42.90% 39.10% +3.80%
Hellaswag 33.95% 29.20% +4.75%
WinoGrande 51.78% 52.10% -0.32%
PIQA 63.76% 62.10% +1.66%
SciQ 69.90% 72.70% -2.80%
Average 52.40% 51.00% +1.40%

Key Takeaways from Our Benchmarks

Comparative Analysis: Beyond the Numbers

The performance metrics only tell part of the story. What makes Madhuram truly stand out is how it achieves these results with significant resource constraints:

These benchmarks reinforce our core thesis: thoughtful model design and optimization can replace brute-force scaling. While the industry continues to push toward ever-larger models requiring specialized hardware, Madhuram demonstrates that impressive capabilities can be delivered in resource-constrained environments when efficiency is prioritized from the ground up.

Potential Deployment and Applications

Because of its compact size and low latency, Madhuram is ideal for:

What's Next for Madhuram?

We're continuing to improve Madhuram in several directions:

We believe that AI should be accessible to everyone, and Madhuram is a significant step toward that vision. By bringing powerful language AI to devices people already own, we're democratizing access to these transformative technologies.