Performance vs Scale: Madhuram LLM vs GPT-3 Small & Compact AI Models

At Maruth Labs, we've been on a mission to make advanced language model accessible to everyone, everywhere - including places with limited connectivity and computing resources. Today, we're pulling back the curtain on Madhuram, our compact 74M-parameter model that delivers impressive performance across several commonsense benchmark tasks.

Here’s a look at Madhuram - Maruth Labs’ breakthrough compact language model - and the performance achievements that demonstrate its industry-leading efficiency and accuracy. Madhuram is a 74 million-parameter model trained on 130 billion tokens, showcasing how rigorous research and smart engineering can rival much larger architectures while requiring only a fraction of the computational resources. By optimizing for on-device inference, Madhuram delivers robust zero-shot performance across standard NLP benchmarks, outperforming GPT3-Small in several settings despite using less parameters. Maruth Labs’ mission centers on efficiency and accessibility, ensuring that high-performance AI is available even on mobile and wearable devices without compromising quality or speed.

Intelligent Design over Massive Scale

We believe that model efficiency starts with architecture. Rather than scaling parameter counts to billions, Madhuram leverages unique mathematical enhancements to the architecture and streamlined transformer blocks to maximize representational power per parameter. This approach enables Madhuram to achieve competitive accuracy on language modeling and classification tasks with dramatically lower resource consumption, reducing both inference latency and energy usage.

Benchmark Performance

We've benchmarked Madhuram (our beta release) against several leading compact models across multiple standard NLP tasks. Here's what our comprehensive analysis reveals:

Madhuram vs GPT3-Small

We evaluated Madhuram in zero-shot mode against GPT3-Small in both zero-shot and few-shot settings under identical configurations:

Task	Madhuram (Zero-shot)	GPT3-Small (Zero-shot)	GPT3-Small (Few-shot)
ARC (Average)	36.84%	35.10%	34.10%
Hellaswag	33.95%	33.70%	33.50%
OBQA	31.40%	35.60%	37.00%
WinoGrande	51.78%	52.00%	51.30%
PIQA	63.76%	64.60%	64.30%
BoolQ	53.12%	49.70%	43.10%
RTE	54.86%	47.70%	52.30%
SuperGLUE	53.07%	40.60%	50.20%
WSC	63.46%	59.60%	58.70%
Average	49.14%	46.51%	47.17%

Table 1: Performance comparison between Madhuram and GPT3-Small across various NLP tasks. Highlighted cells indicate areas where Madhuram demonstrates notable advantages.

Overall Performance: Madhuram achieves an average accuracy of 49.14% across all tasks, outperforming GPT3-Small's zero-shot performance (46.51%) and approaching its few-shot performance (47.17%).
Task-Specific Strengths: Madhuram particularly excels in reasoning and knowledge-intensive tasks:
- BoolQ (Boolean Questions): Madhuram (53.12%) outperforms GPT3-Small in both zero-shot (49.70%) and few-shot (43.10%) settings.
- WSC (Winograd Schema Challenge): Madhuram demonstrates superior performance (63.46%) compared to GPT3-Small zero-shot (59.60%) and few-shot (58.70%).
- SuperGLUE: Madhuram (53.07%) substantially outperforms GPT3-Small zero-shot (40.60%) and achieves comparable results to few-shot (50.20%).
Consistent Performance: Even in tasks where GPT3-Small shows improvement with few-shot learning, Madhuram's zero-shot capabilities remain competitive, demonstrating strong generalization even without context examples.

Comparison with Other Compact Models

Madhuram's performance becomes even more impressive when compared to other models in the compact LLM space:

Task	Madhuram-74M	MobileLLM-125M	SmolLM2-135M	GPT3-Small
ARC	36.84%	37.25%	42.40%	35.10%
Hellaswag	33.95%	39.50%	41.20%	33.70%
OBQA	31.4%	41.10%	34.00%	35.60%
PIQA	63.76%	65.70%	69.40%	64.20%
WinoGrande	51.3%	52.10%	51.30%	52.00%
BoolQ	53.12%	60.40%	60.30%	49.70%
Super Glue	53.07%	54.90%	53.37%	40.60%
RTE	54.86%	54.15%	49.82%	47.70%
WSC	63.46%	36.54%	36.54%	59.60%
Average	49.14%	49.07%	48.59%	46.51%

Table 2: Performance comparison of Madhuram with other compact models across different NLP tasks. Highlighted cells indicate top performance for each task.

Comparison with STAR-5

The emerging STAR-5 model from Liquid AI represents another important benchmark in the on-device LLM space:

Task	Madhuram	STAR-5/Quality	Difference
ARC (Easy)	42.90%	39.10%	+3.80%
Hellaswag	33.95%	29.20%	+4.75%
WinoGrande	51.78%	52.10%	-0.32%
PIQA	63.76%	62.10%	+1.66%
SciQ	69.90%	72.70%	-2.80%
Average	52.40%	51.00%	+1.40%

Overall Edge: Madhuram demonstrates a slight edge in average performance (52.40% vs 51.00%).
Knowledge-Intensive Tasks: Madhuram outperforms STAR-5 on ARC-Easy (42.9% vs 39.1%) and Hellaswag (33.95% vs 29.2%).
Scientific Reasoning: STAR-5 shows stronger performance on SciQ (72.70% vs 69.90%).
Comparable Performance: Both models achieve nearly identical results on PIQA (63.76% vs 62.10%) and WinoGrande (51.78% vs 52.10%).

Key Takeaways from Our Benchmarks

Efficiency Breakthrough: Madhuram delivers performance comparable to models with 1.5-2x more parameters, demonstrating the effectiveness of our architectural optimizations.
Balanced Capabilities: Unlike some specialized models that excel in specific tasks but underperform in others, Madhuram maintains consistent performance across diverse challenges.
Zero-Shot Reasoning: Madhuram's strong zero-shot performance suggests excellent generalization capabilities, reducing the need for example-based prompting that increases token usage and latency.
Real-World Applicability: The benchmarks we've selected reflect practical usage scenarios for on-device deployment, focusing on commonsense reasoning, reading comprehension, and knowledge application.

Comparative Analysis: Beyond the Numbers

The performance metrics only tell part of the story. What makes Madhuram truly stand out is how it achieves these results with significant resource constraints:

Parameter Efficiency: At just 74M parameters, Madhuram requires less than 300MB of storage space compared to 500MB+ for comparable models, making it viable for deployment on a wider range of devices.
Training Efficiency: Madhuram was trained on 130 billion tokens, significantly less than many competing models, yet achieves comparable or superior results.
Inference Speed: Our optimized architecture delivers competitive inference speed, with token generation speeds suitable for real-time applications even on mid-range mobile devices.
Memory Footprint: During operation, Madhuram requires just 800MB of RAM, enabling deployment on devices with limited memory resources including older smartphones and IoT devices.

These benchmarks reinforce our core thesis: thoughtful model design and optimization can replace brute-force scaling. While the industry continues to push toward ever-larger models requiring specialized hardware, Madhuram demonstrates that impressive capabilities can be delivered in resource-constrained environments when efficiency is prioritized from the ground up.

Potential Deployment and Applications

Because of its compact size and low latency, Madhuram is ideal for:

Mobile AI Assistants: On-device natural language understanding for chatbots and virtual assistants without server round-trips.
Wearable Interfaces: Real-time language processing on smartwatches and AR glasses where power and compute are highly constrained.
Embedded Systems: NLP capabilities in edge devices such as IoT sensors and automotive control units.

What's Next for Madhuram?

We're continuing to improve Madhuram in several directions:

Developing specialized variants for specific use cases like education and healthcare.
Creating tools and APIs to make integration simpler for developers.
Deploying Madhuram on different edge-devices to define its capabilities and limitations.

We believe that AI should be accessible to everyone, and Madhuram is a significant step toward that vision. By bringing powerful language AI to devices people already own, we're democratizing access to these transformative technologies.