Version: 25.3

snorkelflow.sdk.FTDataset

class snorkelflow.sdk.FTDataset(df, dataset_uid, label_schema_uid, model_node_uid)

Bases: object

__init__(df, dataset_uid, label_schema_uid, model_node_uid)

Methods

`__init__`(df, dataset_uid, label_schema_uid, ...)
`append`(ft_dataset)	Append the given FTDataset to the current FTDataset.
`create_annotation_batches`([assignees])	Create an annotation batch for the ft dataset.
`export_data`(format, filepath)	Export the data in the FTDataset to the specified format and write to the provided filepath.
`filter`([source_uids, splits, x_uids, ...])	Filter the dataset based on the given filters.
`get_data`()	Get the data associated with the fine tuning dataset.
`get_x_uids`()	Get the x_uids in the FTDataset.
`mix`(mix_on, weights, n_samples[, seed])	Mix the dataset by split, source_uid, or slice based on the given weights, returning up to limit samples.
`sample`(n[, seed])	Sample n samples from the FTDataset.
`save`(name)	Save the FTDataset as a slice.
`set_as_dev_set`()	Resample the x_uids within the FTDataset as the dev set for the fine tuning application.

append

append(ft_dataset)

Append the given FTDataset to the current FTDataset.

Parameters Parameters
Returns Returns: The appended FTDataset
Return type Return type: FTDataset

Name	Type	Default	Info
ft_dataset	`FTDataset`		The FTDataset to append to the current FTDataset.

create_annotation_batches

create_annotation_batches(assignees=None)

Create an annotation batch for the ft dataset. A batch will be created for each split the x_uids in the dataset are a part of.

Parameters Parameters
Returns Returns: The created annotation batch
Return type Return type: List[Dict[str, Any]]

Name	Type	Default	Info
assignees	`Optional[List[int]]`	`None`	The user uids of the assignees of the annotation batch.

export_data

export_data(format, filepath)

Export the data in the FTDataset to the specified format and write to the provided filepath.

Parameters Parameters
Return type Return type: None

Name	Type	Default	Info
format	`ExportFormat`		The format to export the data to.
filepath	`str`		The filepath to write the exported data to.

filter

filter(source_uids=None, splits=None, x_uids=None, feature_hashes=None, slices=None, has_gt=None)

Filter the dataset based on the given filters.

Parameters Parameters
Returns Returns: The filtered dataset
Return type Return type: FTDataset

Name	Type	Default	Info
source_uids	`Optional[List[int]]`	`None`	The source uids to filter by.
splits	`Optional[List[str]]`	`None`	The splits to filter by.
x_uids	`Optional[List[str]]`	`None`	The x uids to filter by.
feature_hashes	`Optional[List[str]]`	`None`	The feature hashes to filter by.
slices	`Optional[List[Slice]]`	`None`	The slices to filter by, rows within at least one slice will be included.
has_gt	`Optional[bool]`	`None`	Filter by the existence / non-existence of ground truth.

get_data

get_data()

Get the data associated with the fine tuning dataset.

Returns Returns: The data associated with the fine tuning dataset
Return type Return type: pd.DataFrame

get_x_uids

get_x_uids()

Get the x_uids in the FTDataset.

Returns Returns: The x_uids in the FTDataset
Return type Return type: List[str]

mix

mix(mix_on, weights, n_samples, seed=123)

Mix the dataset by split, source_uid, or slice based on the given weights, returning up to limit samples. We perform stratified sampling to determine the datapoints to include in the returned FTDataset. Datpoints are sampled without replacement. Note that you may receive less than limit datapoints if there is not enough data to sample and exactly respect the weights. If it’s not possible to honor the weight distribution, you will see a warning message when we are unable to include datapoints from a given population because the weight is too low. If the subsets of data we are mixing overlap, you may see a higher distribution of data than what is provided by the weights.

Parameters Parameters
Returns Returns: The mixed dataset
Return type Return type: FTDataset

Name	Type	Default	Info
mix_on	`MixOn`		The attribute to mix the dataset by.
weights	`Dict[Union[str, int], int]`		The weights to use for the mix, e.g.: `{ "slice1": 3, "slice2": 5 } # The resulting data returned would be 3/8ths from "slice1" and 5/8ths from "slice2".`
n_samples	`int`		The number of samples to sample.
seed	`int`	`123`	The seed to use for the random sampling.

sample

sample(n, seed=None)

Sample n samples from the FTDataset.

Parameters Parameters
Returns Returns: The sampled dataset
Return type Return type: FTDataset

Name	Type	Default	Info
n	`int`		The number of samples to sample.
seed	`Union[int, RandomState, None]`	`None`	The seed to use for the random sampling.

save

save(name)

Save the FTDataset as a slice. Use the FTDataset.filter method to restore the FTDataset from the slice.

Parameters Parameters
Returns Returns: The created slice
Return type Return type: Slice

Name	Type	Default	Info
name	`str`		The name of the slice to save.

set_as_dev_set

set_as_dev_set()

Resample the x_uids within the FTDataset as the dev set for the fine tuning application. Note that a dev set must only contain x_uids that are in the train set.

Return type Return type: None

\_\_init\_\_

__init__​

append

append​

Parameters

Parameters​

Returns

Returns​

Return type

Return type​

create\_annotation\_batches

create_annotation_batches​

Parameters

Parameters​

Returns

Returns​

Return type

Return type​

export\_data

export_data​

Parameters

Parameters​

Return type

Return type​

filter

filter​

Parameters

Parameters​

Returns

Returns​

Return type

Return type​

get\_data

get_data​

Returns

Returns​

Return type

Return type​

get\_x\_uids

get_x_uids​

Returns

Returns​

Return type

Return type​

mix

mix​

Parameters

Parameters​

Returns

Returns​

Return type

Return type​

sample

sample​

Parameters

Parameters​

Returns

Returns​

Return type

Return type​

save

save​

Parameters

Parameters​

Returns

Returns​

Return type

Return type​

set\_as\_dev\_set

set_as_dev_set​

Return type

Return type​

init

append

Parameters

Returns

Return type

create_annotation_batches

Parameters

Returns

Return type

export_data

Parameters

Return type

filter

Parameters

Returns

Return type

get_data

Returns

Return type

get_x_uids

Returns

Return type

mix

Parameters

Returns

Return type

sample

Parameters

Returns

Return type

save

Parameters

Returns

Return type

set_as_dev_set

Return type