Version: 0.95

What is data-centric AI?

This document introduces data-centric artificial intelligence (AI) and how it differs from model-centric AI. It also explains how the data-centric approach can benefit machine learning (ML) applications and how it can be applied in an organization.

Data-centric AI vs. model-centric AI

In data-centric AI, data is the key factor in how well an ML application performs. A data-centric approach means your effort is weighted towards making sure AI is learning what you want it to learn. Snorkel Flow makes this process of data development scalable with operations that accelerate labeling, managing, slicing, augmenting, and curating data efficiently. The model stays relatively fixed.

This contrasts with model-centric AI, where the choice of model is the differentiating factor. Model-centric AI uses training datasets as static collections of ground truth labels. The ML model is trained to fit that labeled training data as a static artifact.

Improvements in the data and improvements in the model are not mutually exclusive. Successful AI requires both well-conceived models and good data.

A brief history of data and models

Machine learning has always been about data. Historically, data science and machine learning teams would start with a dataset, and then achieve gains in ML application performance by iterating on static aspects of the data like feature engineering, or by working on algorithm design and bespoke model architecture.

Now that sophisticated models are available as off-the-shelf commodities, data science teams can provide value by shifting their focus towards continuously improving the data that makes their applications unique.

The power of iterating on data

Curated data shouldn't be treated as a static artifact.

Much like code forms the building blocks of traditional software, data forms the building blocks of ML applications. Just like you can iterate on code to improve traditional software, you can iterate on data to achieve higher performance in ML applications.

This holds true whether you're using your data in-context with prompt engineering and retrieval-augmented generation (RAG), or using data to fine-tune or train high-performance machine-learning (ML) models. If your focus is on fine-tuning and training models, today's models require significantly higher volumes of and higher quality training data.

Evaluating generative model responses is also a data task that benefits from iteration. For example, when you have a chatbot that can respond to user queries about your product, it may take several rounds of iteration to establish all of the criteria for an acceptable response, and then to label responses against those criteria.

Data-centric AI and the data development lifecycle

The dataset is the interface for collaboration between subject matter experts (SMEs) and data scientists, who turn those experts’ knowledge into software.

The process of creating and improving curated datasets is called data development.

Just like software development, data development can be done programmatically, saving many hours of SME time. Snorkel Flow is a data development platform that enables data development as a discipline.

Example: Identify risk exposure using AI

To illustrate the value of data-centric AI, here’s an example from one of Snorkel Flow’s customers.

Snorkel Flow helped a top-three US bank reduce the time it takes to identify and triage risk exposure in their loan portfolio from months to days.

Previously, this process required:

6 person-months of analyst time to manually label the bank’s documents
1 day to develop a model for the specific project
3 months to perform error analysis, validate the model, and review and improve the datasets

That six months of manual data labeling was a severe bottleneck. The bank was able to perform this risk exposure analysis only on an ad-hoc basis, and their datasets still suffered from inconsistent and unreliable labels. The labeling approaches weren’t auditable and were slow to adapt to existing issues.

Before adopting Snorkel Flow, the bank had these options for data labeling:

Outsourcing to labeling vendors: Outside manual-labeling vendors brought significant privacy concerns regarding the confidentiality of the bank’s data. Additionally, most annotators lacked domain expertise, so their work required additional review before use.
Labeling with in-house experts: Using in-house subject matter experts (SMEs) solely for labeling was significantly time consuming and had a high opportunity cost, as SME time could be used on higher-value efforts.

When the bank decided to use Snorkel Flow for data labeling, they significantly reduced total labeling time while increasing label accuracy:

Increased speed: After developing their initial model in Snorkel Flow, the bank could label ~250k new documents in less than 24 hours.
Guided error analysis: Data scientists could rapidly iterate on model errors to improve results, collaborating with SMEs and codifying their input into the existing labeling approach.
Minimal validation effort: Label validation and iteration could be accomplished in minutes with minor code modifications.

Key principles of data-centric AI

Here are some principles that can guide an organization or individual towards a data-centric approach to AI:

Spend effort on data to efficiently improve ML applications: As models become more user-friendly and commoditized, gains in ML application quality increasingly rely on curated and updatable datasets.
Iterate on datasets: Rather than treating your datasets as static inputs to ML applications, treat datasets like code and keep developing them.
Adopt programmatic data labeling: Data development should be programmatic. This allows developers to cope with the volume of training data that today’s deep-learning models require. Manually labeling millions of data points is not practical. A programmatic process for labeling and iterating on data is crucial to make consistent progress.
Collaborate with subject matter experts (SME): Data-centric AI should treat subject-matter experts (SMEs) as integral to the development process. Include SMEs who understand how to label and curate your data so data scientists can inject their domain expertise directly into the model. Once done, this expert knowledge can be codified and deployed for programmatic supervision.

Benefits of data-centric AI

Faster delivery, decreased cost, and higher accuracy of AI solutions

Organizations developing their AI solutions using data-centric AI capture additional value from creating higher-quality data:

Faster development: A Fortune 50 bank built a news analytics application 45x faster and with +25% higher accuracy than a previous system.
Cost savings: A large biotech firm saved an estimated $10 million on unstructured data extraction, achieving 99% accuracy using Snorkel Flow.
Higher accuracy: One of the largest custodian banks used Snorkel Flow to extract and add metadata to a dataset. The bank improved the accuracy of their AI agent answers from 25% to 95% by using RAG from the enhanced dataset.

Data-centric AI vs. model-centric AI​

A brief history of data and models​

The power of iterating on data​

Data-centric AI and the data development lifecycle​

Example: Identify risk exposure using AI​

Key principles of data-centric AI​

Benefits of data-centric AI​