Analysis: Rinse and repeat
For a selected model in the Train pane, click Analysis to open the Analysis pane, where you can find analysis that helps you understand the performance of both your labeling functions and your models. It also offers suggestions on how to improve your end results.
In Snorkel Flow, this feedback can be directly acted upon by adding/modifying LFs or training a new model to target specific errors. Improving our model becomes an iterative, systematic process.
Clicking on Open Analysis will open the Analysis Modal. The Analysis Modal also allows you to compare multiple models and view the different analysis tools over different splits of the data.
If a split does not have predictions or ground truth labels, some of the analysis tools will not be available.
The predictions from a given model are recorded at the time the model is trained. To get predictions for any new data points added to a dataset, you will need to train a new model. On the other hand, the metrics on the Analyze page are computed live. As a result, the metrics may change in the following ways:
Scenario | Metrics Updated for Existing Model |
---|---|
New datapoint(s) added | No |
New GT addded | Yes |
Existing GT changed | Yes |
Clarity matrix
The Clarity Matrix helps identify mistakes made by your model with respect to your label package. This allows you to identify a reasonable next step that would improve model performance.
Clicking on an element of the Clarity Matrix activates the relevant filters in Studio.
By default, “All labels” is selected and the Clarity Matrix displays information for all class labels. If you are interested in a specific class label, please choose it from the dropdown at the top of the Analysis Modal.
Confusion matrix
The Confusion Matrix helps to identify major buckets of errors being made by your model. There are two views available: Matrix View and Table View.
Matrix View is not available for high cardinality tasks.
Clicking on an error bucket will activate filters in Studio to see all the samples in that error bucket.
Margin confusion matrix
When you lack ground truth, the margin confusion matrix is a useful tool to identify likely errors in a similar way to the confusion matrix. Instead of using ground truth labels, the margin confusion matrix uses the predictions of the model and the measure of margin which is the difference between the. two most confident predictions. For example [0.5, 0.4, 0.1] has a margin of 0.5 - 0.4 = 0.1. The smaller the margin, the more confused the model is between the top two predictions. We can use this to find examples where the model is likely to be incorrect, or at least examples where the model would benefit from extra LF coverage.
LF error contributions
The LF Error contributions lists all labeling functions whose incorrect predictions match the incorrect predictions from the end model.
Clicking on one of the LFs takes you to the Label page with a filter that shows examples where that LF is incorrect.
Label distributions
Label Distributions shows both the raw counts (on hover) and percentages of each class based on the labels from ground truth, predicted by the training set that the model uses (label model output), and predicted by the current model.
Class level metrics
Class level metrics displays model performance on per-class basis to help decide where to best focus efforts.
Clicking on one of the icon next to class name takes you to the Label page with the relevant filters activated.
To update the metric types you are interested in, click on Open Analysis to open the Analysis Modal, where you can click on Model Metrics to access Metric Types.
In addition to the built in metric types, one can register custom metrics via notebook. Refer to snorkelflow.client.metrics.register_custom_metric
in the Python SDK method for more details and examples.
Generalization
This provides two analyses:
- The improvement in data coverage by the end model compared to the label model created with your labeling functions. For example, if the labeling functions you created cover 80% of your training data, the improvement would be 20% because the end model can cover 100% of your training data.
- The generalization gap between your
train
andvalid
splits. If the difference in the accuracies for this is greater than 5%, then a warning about overfitting to the dev set is raised. Note that thetrain
split score ignores the datapoints in thedev
set, so in some cases this warning may not appear even if thedev
andvalid
scores have a difference more than 5%.
Some datasets can have large differences between the dev and valid set performances due to differences in underlying data distributions. If that’s the case, you might want to modify your data so that your dev and split come from the same distribution.
Precision-recall curve
The precision-recall curve plots the precision and the recall for different decision thresholds. This matters more for binary classification problems, and is especially useful for when your data suffers from class imbalance, e.g. there are more negative samples than positive samples.
LF quality view
These visualizations display the behavior of your labeling functions over your ground truth data, and relates it to the downstream performance of the label model. You can use them to find which labeling functions should be refined. It groups each data point into one of five categories:
- Correct/Correct (solid blue): all LFs voting on these points were correct, and so was the downstream label model
- Conflicting/Correct (shaded blue): only some LFs voting were correct, but the label model was able to denoise the labeling functions and produce a correct training set label
- Conflicting/Incorrect (shaded red): only some LFs voting were correct, and the label model was unable to denoise the labeling functions which then produced an incorrect training set label
- Incorrect/Incorrect (solid red): all LFs voting on these points were incorrect, and so was the downstream label model
- Uncovered (grey): no LFs voted on these points, and so no training set label was produced
By clicking on any of the colored bars, you can view the data points in each group. We recommend starting with the shaded red group and working your way down to the grey group in order to improve your training data.
Filter metrics by tags
You can use the “Tags Filter” dropdown to view model metrics on subsets of the data. Here, you will see all of your error analysis tags as defined in Error Analysis Tags. You can then check or uncheck individual tags to include or exclude those examples from the analysis. Note that “Untagged Data” refers to all data points that do not have an error analysis tag applied to them.
Error analysis by tags
You can use methods of error analysis like the tag-based analysis chart and tables to survey your data in the iterative development loop.
Tag-based analysis: Chart view
The tag-based analysis chart shows the correct (blue) and incorrect (red) data count for each tag. To also see the count for data without GT, toggle the “Show GT is Unknown” button.
Incorrect data signifies data where the ground truth is not equal to the prediction. You can go through incorrect data and categorize them into smaller error buckets by adding tags to the data based on the error types.
Filters can also be used to display only the tags you’re interested in viewing. You can display all tags by checking that option, or any number of them by unselecting irrelevant ones under the “display all tags” option.
If you click on any of the bars, the dataviewer on the right side will show the results filtered based on the clicked bar. The bars match to the left Y-axis.
To add new tags to the data, click on “Open tag editor” to bring up the tag editor for adding or removing tags.
The purple line above the bars represents the size of each error bucket accumulated from left to right. The values denoted by the purple line correspond to the percentages on the right y-axis.
If the user iterates through each error bucket from left to right and fixes all the errors in each of the error buckets, the % of correct data over the entire dataset will be the same as the number denoted on the purple line.
Tag-based analysis: Table view
The table view shows each of the tags and their respective maximum F1 macro growths of the entire selected dataset. The maximum F1 macro growth for each tag indicates the maximum increase of the F1 macro score (model performance) that we can achieve if all the errors in the tag get fixed.
One-click to add LFs from tags The Add LFs button (right column) allows you to add up to 5 suggested label functions (LFs) to their corresponding tags. Suggested LF’s are based on the tagged data and can be added using one click. A success notification appears if high quality LFs are added successfully.
A successful LF is one that has precision > 85%, and unsuccessful LFs cannot be added.
Applying tags to filter
After selecting a filter, you can apply tags over the results after filtering instead of individually tagging each result under the filter. Click the three dot overflow menu in the top right corner of the dataviewer, then click Apply tag to filter result.
Each time the modal is opened, all the tags will revert to unchecked. This enables you to apply the tags to all of the results under the filter again. However, this won’t apply tags to the filter action again if it’s already been done before.
This dropdown is meant to be used to apply tags only. Tagged documents won’t be able to be deselected and untagged after they have been tagged already.
Bulk apply tag in table view We can also bulk apply tags in table view either by using multi select or by using the select all/unselect all feature.
To multi select rows in table view and click on edit tags to add/remove tags:
Postprocessors
Apply postprocessors when turned on will reflect in the metrics, i.e., all post processors for the model node will be applied before calculating the metrics. This is turned on by default.
Suggestions
In the right pane, you will find suggestions. Details on each message you may see:
- Try tuning on valid set: The Label Model’s distribution has a class with distribution > 10% points different than the Ground Truth’s or Model’s output.
- The following LFs are correlated with model errors…: These LFs are causing a disproportionate amount of errors.
- Try tuning or oversampling based on the valid set: The train score is 5 points higher than the valid score.
- Resample dev set iterate on LFs: The valid score is 5 points higher than the train score.
- Try improving precision of LFs in Label: More than half the errors are from the LFs outputting incorrect predictions.
- Try training a model using LFs as features: More than half the errors are caused by the model incorrectly predicting a correct LF output.
- Try writing more LFs: The model predicts ABSTAIN on more than half of errors.