Skip to main content
Version: 25.9

What are leaderboards?

Snorkel AI's leaderboards benchmark large language models (LLMs) for specialized domains using expert data. These benchmarks are designed to evaluate how well large language models (LLMs) perform on complex, domain-specific tasks. Built using Snorkel’s proprietary Data-as-a-Service (DaaS) technology, these leaderboards showcase fine-grained model capabilities in realistic scenarios.

Specialized benchmarks matter

Generic benchmarks offer a baseline for comparing LLMs on common-sense or broad academic tasks, but fall short in specialized domains like insurance underwriting, legal contract review, healthcare documentation, and more. These domains require expert knowledge, contextual understanding, and task-specific reasoning that general-purpose models often lack.

The Snorkel AI leaderboards address this gap by creating benchmarks grounded in real-world tasks and workflows. The DaaS process incorporates domain-specific expert input for annotation, validation, and slicing, offering high quality evaluation pipelines.

Data development with human experts

Each leaderboard is built on datasets developed with Snorkel technology, including LLM-as-Judge (LLMAJ). This approach leverages:

  • Labeling and slicing: Data development tools ensure high-coverage training and evaluation sets.
  • SME-in-the-loop: Subject matter experts ensure high quality datasets that are representative of real world tasks.
  • Targeted evaluation criteria: Leaderboards focus on metrics that reflect practical utility, such as rationale quality, tool errors, or consistency.

In one example, Snorkel partnered with insurance SMEs to construct evaluation sets that mirrored realistic risk assessments. This data can be used to evaluate AI agents for insurance underwriting, highlighting differences between models that generic benchmarks could not detect.

Benefits

These leaderboards serve multiple users with real-world use cases:

  • Model buyers can compare vendor LLMs on domain-relevant tasks before integration.
  • AI builders can validate fine-tuned or customized models against realistic benchmarks.
  • Researchers gain access to curated datasets with rich evaluation protocols.

By focusing on data-first development, the leaderboards demonstrate how expert data drives performance in specialized AI.

Create your own benchmarks

You can use the Snorkel Expert Data-as-a-Service to build task-specific evaluation sets and custom metrics and rationales.

You can also track model progress on your benchmarks with Snorkel Evaluate.

For more, explore how Snorkel’s platform enables this in Building the AI Data Development Platform for Specialized AI.