Skip to main content
Version: 0.96

snorkelflow.sdk.FTDataset

class snorkelflow.sdk.FTDataset(df, dataset_uid, label_schema_uid, model_node_uid)

Bases: object

__init__(df, dataset_uid, label_schema_uid, model_node_uid)

Methods

__init__(df, dataset_uid, label_schema_uid, ...)

append(ft_dataset)

Append the given FTDataset to the current FTDataset.

create_annotation_batches([assignees])

Create an annotation batch for the ft dataset.

export_data(format, filepath)

Export the data in the FTDataset to the specified format and write to the provided filepath.

filter([source_uids, splits, x_uids, ...])

Filter the dataset based on the given filters.

get_data()

Get the data associated with the fine tuning dataset.

get_x_uids()

Get the x_uids in the FTDataset.

mix(mix_on, weights, n_samples[, seed])

Mix the dataset by split, source_uid, or slice based on the given weights, returning up to limit samples.

sample(n[, seed])

Sample n samples from the FTDataset.

save(name)

Save the FTDataset as a slice.

set_as_dev_set()

Resample the x_uids within the FTDataset as the dev set for the fine tuning application.

append(ft_dataset)

Append the given FTDataset to the current FTDataset.

Parameters:

ft_dataset (FTDataset) – The FTDataset to append to the current FTDataset

Returns:

The appended FTDataset

Return type:

FTDataset

create_annotation_batches(assignees=None)

Create an annotation batch for the ft dataset. A batch will be created for each split the x_uids in the dataset are a part of.

Parameters:

assignees (Optional[List[int]], default: None) – The user uids of the assignees of the annotation batch

Returns:

The created annotation batch

Return type:

List[Dict[str, Any]]

export_data(format, filepath)

Export the data in the FTDataset to the specified format and write to the provided filepath.

Parameters:
  • format (ExportFormat) – The format to export the data to.

  • filepath (str) – The filepath to write the exported data to.

Return type:

None

filter(source_uids=None, splits=None, x_uids=None, feature_hashes=None, slices=None, has_gt=None)

Filter the dataset based on the given filters.

Parameters:
  • source_uids (Optional[List[int]], default: None) – The source uids to filter by

  • splits (Optional[List[str]], default: None) – The splits to filter by

  • x_uids (Optional[List[str]], default: None) – The x uids to filter by

  • feature_hashes (Optional[List[str]], default: None) – The feature hashes to filter by

  • slices (Optional[List[Slice]], default: None) – The slices to filter by, rows within at least one slice will be included

  • has_gt (Optional[bool], default: None) – Filter by the existence / non-existence of ground truth

Returns:

The filtered dataset

Return type:

FTDataset

get_data()

Get the data associated with the fine tuning dataset.

Returns:

The data associated with the fine tuning dataset

Return type:

pd.DataFrame

get_x_uids()

Get the x_uids in the FTDataset.

Returns:

The x_uids in the FTDataset

Return type:

List[str]

mix(mix_on, weights, n_samples, seed=123)

Mix the dataset by split, source_uid, or slice based on the given weights, returning up to limit samples. We perform stratified sampling to determine the datapoints to include in the returned FTDataset. Datpoints are sampled without replacement. Note that you may receive less than limit datapoints if there is not enough data to sample and exactly respect the weights. If it’s not possible to honor the weight distribution, you will see a warning message when we are unable to include datapoints from a given population because the weight is too low. If the subsets of data we are mixing overlap, you may see a higher distribution of data than what is provided by the weights.

Parameters:
  • mix_on (MixOn) – The attribute to mix the dataset by.

  • weights (Dict[Union[str, int], int]) –

    The weights to use for the mix, e.g.:

    {
    "slice1": 3,
    "slice2": 5
    }
    # The resulting data returned would be 3/8ths from "slice1" and 5/8ths from "slice2".

  • n_samples (int) – The number of samples to sample.

  • seed (int, default: 123) – The seed to use for the random sampling.

Returns:

The mixed dataset

Return type:

FTDataset

sample(n, seed=None)

Sample n samples from the FTDataset.

Parameters:
  • n (int) – The number of samples to sample.

  • seed (Union[int, RandomState, None], default: None) – The seed to use for the random sampling.

Returns:

The sampled dataset

Return type:

FTDataset

save(name)

Save the FTDataset as a slice. Use the FTDataset.filter method to restore the FTDataset from the slice.

Parameters:

name (str) – The name of the slice to save

Returns:

The created slice

Return type:

Slice

set_as_dev_set()

Resample the x_uids within the FTDataset as the dev set for the fine tuning application. Note that a dev set must only contain x_uids that are in the train set.

Return type:

None