Skip to main content
Version: 25.3

snorkelflow.sdk.FTDataset

class snorkelflow.sdk.FTDataset(df, dataset_uid, label_schema_uid, model_node_uid)

Bases: object

__init__

__init__(df, dataset_uid, label_schema_uid, model_node_uid)

Methods

__init__(df, dataset_uid, label_schema_uid, ...)

append(ft_dataset)

Append the given FTDataset to the current FTDataset.

create_annotation_batches([assignees])

Create an annotation batch for the ft dataset.

export_data(format, filepath)

Export the data in the FTDataset to the specified format and write to the provided filepath.

filter([source_uids, splits, x_uids, ...])

Filter the dataset based on the given filters.

get_data()

Get the data associated with the fine tuning dataset.

get_x_uids()

Get the x_uids in the FTDataset.

mix(mix_on, weights, n_samples[, seed])

Mix the dataset by split, source_uid, or slice based on the given weights, returning up to limit samples.

sample(n[, seed])

Sample n samples from the FTDataset.

save(name)

Save the FTDataset as a slice.

set_as_dev_set()

Resample the x_uids within the FTDataset as the dev set for the fine tuning application.

append

append(ft_dataset)

Append the given FTDataset to the current FTDataset.

Parameters

NameTypeDefaultInfo
ft_datasetFTDatasetThe FTDataset to append to the current FTDataset.

Returns

The appended FTDataset

Return type

FTDataset

create_annotation_batches

create_annotation_batches(assignees=None)

Create an annotation batch for the ft dataset. A batch will be created for each split the x_uids in the dataset are a part of.

Parameters

NameTypeDefaultInfo
assigneesOptional[List[int]]NoneThe user uids of the assignees of the annotation batch.

Returns

The created annotation batch

Return type

List[Dict[str, Any]]

export_data

export_data(format, filepath)

Export the data in the FTDataset to the specified format and write to the provided filepath.

Parameters

NameTypeDefaultInfo
formatExportFormatThe format to export the data to.
filepathstrThe filepath to write the exported data to.

Return type

None

filter

filter(source_uids=None, splits=None, x_uids=None, feature_hashes=None, slices=None, has_gt=None)

Filter the dataset based on the given filters.

Parameters

NameTypeDefaultInfo
source_uidsOptional[List[int]]NoneThe source uids to filter by.
splitsOptional[List[str]]NoneThe splits to filter by.
x_uidsOptional[List[str]]NoneThe x uids to filter by.
feature_hashesOptional[List[str]]NoneThe feature hashes to filter by.
slicesOptional[List[Slice]]NoneThe slices to filter by, rows within at least one slice will be included.
has_gtOptional[bool]NoneFilter by the existence / non-existence of ground truth.

Returns

The filtered dataset

Return type

FTDataset

get_data

get_data()

Get the data associated with the fine tuning dataset.

Returns

The data associated with the fine tuning dataset

Return type

pd.DataFrame

get_x_uids

get_x_uids()

Get the x_uids in the FTDataset.

Returns

The x_uids in the FTDataset

Return type

List[str]

mix

mix(mix_on, weights, n_samples, seed=123)

Mix the dataset by split, source_uid, or slice based on the given weights, returning up to limit samples. We perform stratified sampling to determine the datapoints to include in the returned FTDataset. Datpoints are sampled without replacement. Note that you may receive less than limit datapoints if there is not enough data to sample and exactly respect the weights. If it’s not possible to honor the weight distribution, you will see a warning message when we are unable to include datapoints from a given population because the weight is too low. If the subsets of data we are mixing overlap, you may see a higher distribution of data than what is provided by the weights.

Parameters

NameTypeDefaultInfo
mix_onMixOnThe attribute to mix the dataset by.
weightsDict[Union[str, int], int]

The weights to use for the mix, e.g.:

{
"slice1": 3,
"slice2": 5
}
# The resulting data returned would be 3/8ths from "slice1" and 5/8ths from "slice2".
n_samplesintThe number of samples to sample.
seedint123The seed to use for the random sampling.

Returns

The mixed dataset

Return type

FTDataset

sample

sample(n, seed=None)

Sample n samples from the FTDataset.

Parameters

NameTypeDefaultInfo
nintThe number of samples to sample.
seedUnion[int, RandomState, None]NoneThe seed to use for the random sampling.

Returns

The sampled dataset

Return type

FTDataset

save

save(name)

Save the FTDataset as a slice. Use the FTDataset.filter method to restore the FTDataset from the slice.

Parameters

NameTypeDefaultInfo
namestrThe name of the slice to save.

Returns

The created slice

Return type

Slice

set_as_dev_set

set_as_dev_set()

Resample the x_uids within the FTDataset as the dev set for the fine tuning application. Note that a dev set must only contain x_uids that are in the train set.

Return type

None