snorkelflow.sdk.FTDataset
- class snorkelflow.sdk.FTDataset(df, dataset_uid, label_schema_uid, model_node_uid)
Bases:
object
- __init__(df, dataset_uid, label_schema_uid, model_node_uid)
\_\_init\_\_
__init__
Methods
__init__
(df, dataset_uid, label_schema_uid, ...)append
(ft_dataset)Append the given FTDataset to the current FTDataset.
create_annotation_batches
([assignees])Create an annotation batch for the ft dataset.
export_data
(format, filepath)Export the data in the FTDataset to the specified format and write to the provided filepath.
filter
([source_uids, splits, x_uids, ...])Filter the dataset based on the given filters.
get_data
()Get the data associated with the fine tuning dataset.
Get the x_uids in the FTDataset.
mix
(mix_on, weights, n_samples[, seed])Mix the dataset by split, source_uid, or slice based on the given weights, returning up to limit samples.
sample
(n[, seed])Sample n samples from the FTDataset.
save
(name)Save the FTDataset as a slice.
Resample the x_uids within the FTDataset as the dev set for the fine tuning application.
- append(ft_dataset)
Append the given FTDataset to the current FTDataset.
append
append
- create_annotation_batches(assignees=None)
Create an annotation batch for the ft dataset. A batch will be created for each split the x_uids in the dataset are a part of.
create\_annotation\_batches
create_annotation_batches
- export_data(format, filepath)
Export the data in the FTDataset to the specified format and write to the provided filepath.
export\_data
export_data
- filter(source_uids=None, splits=None, x_uids=None, feature_hashes=None, slices=None, has_gt=None)
Filter the dataset based on the given filters.
Parameters
Parameters
Returns
Returns
The filtered dataset
Return type
Return type
Name Type Default Info source_uids Optional[List[int]]
None
The source uids to filter by. splits Optional[List[str]]
None
The splits to filter by. x_uids Optional[List[str]]
None
The x uids to filter by. feature_hashes Optional[List[str]]
None
The feature hashes to filter by. slices Optional[List[Slice]]
None
The slices to filter by, rows within at least one slice will be included. has_gt Optional[bool]
None
Filter by the existence / non-existence of ground truth.
filter
filter
- get_data()
Get the data associated with the fine tuning dataset.
get\_data
get_data
- get_x_uids()
Get the x_uids in the FTDataset.
get\_x\_uids
get_x_uids
- mix(mix_on, weights, n_samples, seed=123)
Mix the dataset by split, source_uid, or slice based on the given weights, returning up to limit samples. We perform stratified sampling to determine the datapoints to include in the returned FTDataset. Datpoints are sampled without replacement. Note that you may receive less than limit datapoints if there is not enough data to sample and exactly respect the weights. If it’s not possible to honor the weight distribution, you will see a warning message when we are unable to include datapoints from a given population because the weight is too low. If the subsets of data we are mixing overlap, you may see a higher distribution of data than what is provided by the weights.
Parameters
Parameters
Returns
Returns
The mixed dataset
Return type
Return type
Name Type Default Info mix_on MixOn
The attribute to mix the dataset by. weights Dict[Union[str, int], int]
The weights to use for the mix, e.g.:
{
"slice1": 3,
"slice2": 5
}
# The resulting data returned would be 3/8ths from "slice1" and 5/8ths from "slice2".n_samples int
The number of samples to sample. seed int
123
The seed to use for the random sampling.
mix
mix
- sample(n, seed=None)
Sample n samples from the FTDataset.
sample
sample
- save(name)
Save the FTDataset as a slice. Use the FTDataset.filter method to restore the FTDataset from the slice.
save
save