Utilize embeddings
This page walks through how to create and use embeddings in Snorkel Flow. You can utilize embeddings across your end-to-end workflow: understanding your data, labeling programmatically, and model training.
There are two ways to compute embeddings:
Once you have computed your embeddings, you can then utilize them in Studio!
Compute embeddings during application creation
When you create an application using the guided workflow, the compute embeddings option will appear at the Development Settings step. This is set to Yes by default, with the field selection being the primary text/PDF field. The embeddings are calculated using SimCSE.
Compute embeddings in embedding home
You may also opt to compute additional embeddings during model development in Studio. To do so, click the Embeddings dropdown on the top-right corner of your screen, then click Add new embeddings. For sequence tagging applications, you can calculate RAG embeddings. For more information, see Prompting with document chunking (RAG).
In the modal, select the desired parameters, then click Compute.
You can track calculation progress in the table or by hovering over the Embeddings dropdown. Once the calculation is complete, you can use the new embeddings throughout Studio.
Embedding-powered features in Studio
Here is a list of features in Studio that are powered by embeddings:
- You can use the embedding-based cluster workflow to identify groups of similar datapoints, which can then be used to create cluster LFs.
- You can view an embedding map as well as the top n-grams in the Data summary pane.
- You can use the embedding field as an input when model training.
Supported use cases
The table below shows the availability of embedding compute and embedding-powered features per use case in the 2024.R2 LTS (v0.93) release:
Raw text classification | Raw text candidate extraction | Raw text sequence tagging | PDF classification | PDF extraction | All multi-label cases | |
---|---|---|---|---|---|---|
Compute embeddings (simcse, spacy, clip) | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ |
Compute RAG embeddings | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
Embedding-based cluster workflow (Cluster view) | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ |
Embedding map (in Data Summary Pane) | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
Top n-gram (in Data Summary Pane) | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
Embedding field for model training | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |