Configure and train models
This page walks through how to configure, train, and tune machine learning models on your labeled dataset in Studio.
After you create and develop labeling functions for your dataset, you can then begin training models!
Once you kick off you model, you can then:
Model training in Studio
To train a model, select Train a model in the Models pane. If you have already trained a model, and want to make minor adjustments to it without recreating the model from scratch, select the model from the dropdown, and then click the icon. This will automatically populate the model options based on the model selected.
We provide three modes for model training within the Snorkel Flow application:
- Fast model: Runs a single logistic regression model with no customizations for quick model results and iterations.
- Custom model: Allows you to customize your model architecture, hyperparameters, and hyperparameter searches.
- AutoML: With one click, allows you to run and exhaustive search over hyperparameters and model architectures, returning the best performing model
See Supported modeling libraries for more information about the different modeling libraries that we support.
Fast model training
The Fast model option runs a basic logistic regression model on a sample of your training data. The model runs significantly fast, making this an excellent option to quickly iterate and assess models while actively developing new labeling functions (LFs). You can select the following training options:
- Training Set size: The number of data points on which to run the model.
- Auto-train Fast Models: Turn this on to automatically train a fast model on a sample of your data when your set of active LFs changes.
Custom model training
The Custom model option gives you full control over the model architecture, hyperparameters, and hyperparameter searches. We offer default model configs for several commonly used models, which provides a good starting point for model training.
The following sections detail the options that are available to customize your model.
General options
- Build from previous model: Select a previously executed model to automatically populate the model options based on its settings. This allows you to make minor adjustments to the existing model without the need to recreate it from scratch.
- Name: A name for your model. The model's name is unique per model, per application.
- Description: An optional description of your model.
- Model architecture: The model type (e.g., logistic regression).
- Training sets: The training set of Snorkel labels that you'd like to train on. Choose from an existing set, or create a new training set.
- Input fields: The columns of the data that you want to train your model on. This option is unavailable for span-aware models.
Note
The order of the fields matter for BERT models because they truncate the sequence length to the specified max_seq_len
. Therefore, you should put the columns with more important information or those of shorter length first.
Features libraries
- Add LF labels for model training and inference: This option is only available for Scikit-Learn models. If selected, it will make the LF labels (i.e., labels that the LF outputs in the selected training set) available to the model as features. The model is likely to overfit to the LF labels, so this option is only recommended for LFs with high coverage and accuracy. Models that are trained with this option are currently not available for application export.
Train options
Model configs that are computationally expensive have been marked with a *
, to indicate they may have long training times without a GPU.
- Oversample data to match class distribution on valid split: if checked, Snorkel Flow will oversample the generated labels to match the class distribution of the valid split. This option is useful for applications with imbalanced classes.
- Override LF labels with ground truth labels where available: If checked, Snorkel Flow will override the generated labels with ground truth labels for model training, whenever they are available.
- Tune decision threshold on valid split: This option is only available for binary applications. If checked, it will use the valid split to pick a decision threshold that maximizes the primary metric (either accuracy or F1 score) on the valid split. This is especially useful for applications where there is significant class imbalance, as often occurs in text extraction applications.
- Train noise-aware model using probabilistic labels: This option is only available for Scikit-Learn models and cannot be combined with the Oversample data to match class distribution on valid split option. If selected, it will train the selected model using probabilistic training labels that the label model outputs, instead of integer predictions. This option can be helpful for applications with high cardinality (a large number of classes).
- Filter out low confidence labels: This option filters out Snorkel-generated labels that have low-confidence (i.e., close to random).
- Include dev split when training: This option uses data points in the dev set to train the model, in addition to data points in the train set. Note that using this option can result in the trained model being overfit to the dev set, and therefore the suggestions in the Analysis pane will not be as effective.
- Predict on dev only: This option runs predictions on the dev split only. This speeds up the model inference step. If you select True, then to treat the dev split as a holdout dataset, also set Include dev split when training to False.
- Max. Runs: This option specifies the maximum number of models to train with configs that are sampled from the specified hyperparameter search space.
Model options
As you make selections for your model architecture, train options, etc., the model config will automatically update to reflect those changes. You can also manually customize the JSON to define your desired model or to define a hyperparameter search. Clicking the Reset button will reset the config back to the default values for the options that you selected.
Hyperparameter search
You can define a grid search over hyperparameter options in the model config. To specify a set of hyperparameters to search over, set the value of the hyperparameter in the config to be a dictionary with a single key "SEARCH"
, whose value is the list of values to search over.
You can only do hyperparameter search if you provide a valid split.
For example, to search over models with regularization parameter C
of 1, 10, and 100
, set "C": {"SEARCH": [1, 10, 100]}
. Snorkel Flow will then train three models:
- one with
C = 1
- one with
C = 10
- one with
C = 100
Snorkel Flow will then report the hyperparameter options and metrics for the model with the best performance on the valid split.
You can also search over combinations of hyperparameters. For example, if you set "penalty": {"SEARCH": ["l1", "l2"]}
and "C": {"SEARCH": [1, 10, 100]}
in the model config, Snorkel Flow will train six models, using each combination of "penalty"
in ["l1", "l2"]
and regularization parameter "C"
in [1, 10, 100]
.
By default, Snorkel Flow searches a maximum of 8 trials at a time, so if you do specify more than 8 settings, it will randomly sample 8 models to run.
An example configuration of a hyperparameter search is shown below:
{
"classifier": {
"classifier_cls": "LogisticRegression",
"classifier_kwargs": {
"C": {"SEARCH": [1, 10, 100]},
"penalty": {"SEARCH": ["l1", "l2"]},
"solver": "liblinear"
}
},
"text_vectorizer": {
"vectorizer_cls": "CountVectorizer",
"vectorizer_kwargs": {
"ngram_range": [1, 2],
"n_features": 250000,
"lowercase": false
}
}
}
AutoML model training
The AutoML training option allows you to run an exhaustive search over hyperparameters and model architectures with one click, returning the best performing model.
To run AutoML, you must have ground truth labels in your valid split.
To start an AutoML run:
- From the Models pane, click Train new model, then select the AutoML tab.
- The default selection for search strategy is Grid Search. You can also choose Bayesian Optimization from the dropdown, which will start a smart range search on the corresponding search space. See the Ray tune API docs for more information about Bayesian optimization tuning.
- The default selection for model architecture is Logistic Regression, but you can search over multiple model architectures using the Model Architectures drop down. Additional model architectures that are available are XGBoost, DistiliBERT, and Fast Logistic Regression.
- The default selection for input fields is preset, but you can choose which fields are being used in the Input Fields drop down. Your fields preferences will be saved.
AutoML defaults to the latest training set. The progress of model runs can be tracked from the Models side bar. AutoML jobs are large and can take 10-12 hours to complete. However, while AutoML runs, you can:
- See the best model that has been trained so far. You can register that model, and start working with it as the AutoML job continues.
- View all the configs that have been trained so far under the best model trained so far panel. This allows you to view the search space for the AutoML run, and understand how different hyperparameters impact the model score.
Once an AutoML run is complete, you can see the best performing model across all the models trained in the job. If you hover next to the model name, you can see what hyperparameters returned the best performing model.
Span aware models
For extraction and entity classification applications, we provide span aware models that use features from the extracted spans and their contexts. To adjust the default options, set the values in the model config (under Model options) under extraction_options
:
left_context_len
: The number of characters to the left of the span to include in the context.right_context_len
: The number of characters to the right of the span to include in the context.mask_span
: Replace the span text itself with a special[Span]
token. This helps prevent over-fitting to the exact value of the spans in the dataset.
Distributed training with Ray Train
Beginning in version 0.91, Snorkel Flow supports distributed training with Ray Train for certain models. Enabling this feature moves model training from a single-threaded process in model-trainer, to a distributed job on the ray-gpu cluster. This is designed to initially support the HuggingFace BERT-family of models for single label classification.
Given its experimental nature, this feature may lead to issues or unexpected behaviors. If problems arise, disable Ray Train in your user settings and retry the training job. Additionally, for support with large training jobs, administrators can contact the success team. Any feedback on the feature is welcome.
When to consider enabling Ray Train
This feature is experimental, and turned off by default. Due to its development stage, it is recommended that you keep this feature turned off to maintain stability in your model training processes. That being said, here are some use cases where you may want to turn on Ray Train:
- You have a single label classification application. Currently this is the only application type that is supported.
- You are training a BERT or DistilBERT model. Logistic regression, XGBoost, and all other model frameworks will not be affected by Ray Train.
- You are facing out-of-memory errors with large datasets. This may show up as a snackbar error or crash in your model training job.
- Snorkel Flow has access to two or more GPUs. Ray Train will only lead to performance gains if your instance has access to multiple GPUs.
How to enable Ray Train
Follow these steps to enable Ray Train on your Snorkel Flow instance:
- Click your user name in the bottom right corner of your screen.
- Click User Settings, then click Feature Flag Management.
- Toggle the RAY_TRAIN flag to the on position.
After activation, model training can be configured as usual. Training jobs for BERT and DistilBERT will automatically run on the distributed system. They will potentially run slower, but have a reduced memory footprint.