Skip to main content
Version: 0.93

snorkelflow.client.nodes.get_dataset

snorkelflow.client.nodes.get_dataset(node, split='dev', batch_uid=None, combiner='AND', show_filtered_flag=False, gt_label=None, all_lfs_filter=None, no_lfs_filter=None, model_filters=None, lf_filters=None, training_set_filters=None, include_tag_uids=None, include_tag_type_uids=None, exclude_tag_uids=None, exclude_tag_type_uids=None)

Filter the dataset and return with GT labels and uids as index.

You can also filter for only data points that have a certain ground truth label or a certain predicted label by a machine learning model. This filter is handy when you want to do your own analysis on an error bucket.

Multiple conditions will be combined with AND semantics.

Examples

# Filter data where all LFs abstain.
df = sf.get_dataset(node, all_lfs_filter="UNKNOWN")
# Filter data where their GT labels are "LABEL".
df = sf.get_dataset(node, gt_label="LABEL")
# Filter data with a set of rules.
df = sf.get_dataset(
node,
gt_label="LABEL",
model_filters=[(1, "LABEL"), (2, "LABEL")],
lf_filters=[("my_lf", "LABEL")],
training_set_filters=[(1, "LABEL")],
combiner="AND"
)
# Generate column statistics and summaries for the dataset
df = sf.get_dataset(node)
df.describe(include="all")

Parameters

NameTypeDefaultInfo
nodeintUID of the node.
splitstr'dev'The split to be loaded. Only "dev", "valid", and "test" splits are allowed. Default to "dev".
batch_uidOptional[int]NoneUID of the annotation batch to filter.
combinerstr'AND'Combiner to apply across all filters, by default "AND"
show_filtered_flagboolFalseIf True, return all datapoints and a column called “filtered_flag” with “True” or “False” values specifying whether the data is included in the provided filters.
gt_labelOptional[str]NoneIf set, include only data points where ground truth is this label string.
all_lfs_filterOptional[str]NoneIf set, include only data points where all LFs vote for the label string passed.
no_lfs_filterOptional[str]NoneIf set, include only data points where no LFs vote for the label string passed.
model_filtersOptional[List[Tuple[int, str]]]None

Tuple where the first value is the model_id, and the second is predicted label or voting pattern.

If provided, include only data points that match this pattern.

lf_filtersOptional[List[Tuple[str, str]]]None

List of tuples where first value in tuple is LF name and second value is assigned label or voting pattern.

If provided, include only data points that match this pattern.

training_set_filtersOptional[List[Tuple[int, str]]]None

List of tuples where first value in tuple is training set ID and second value is assigned label or voting pattern.

If provided, include only data points that match this pattern.

include_tag_uidsOptional[List[int]]NoneList of tag types, only data points which include this tag type will be included.
include_tag_type_uidsOptional[List[int]]NoneList of tag types, only data points which include this tag type will be included.
exclude_tag_uidsOptional[List[int]]NoneList of tag types, only data points which exclude this tag type will be included. Exclusion will supersede inclusion in case of collision.
exclude_tag_type_uidsOptional[List[int]]NoneList of tag types, only data points which exclude this tag type will be included. Exclusion will supersede inclusion in case of collision.

Returns

A Pandas DataFrame for give split filtered according to parameters

Return type

pd.DataFrame