snorkelflow.client.nodes.get_dataset
- snorkelflow.client.nodes.get_dataset(node, split='dev', batch_uid=None, combiner='AND', show_filtered_flag=False, gt_label=None, all_lfs_filter=None, no_lfs_filter=None, model_filters=None, lf_filters=None, training_set_filters=None, include_tag_uids=None, include_tag_type_uids=None, exclude_tag_uids=None, exclude_tag_type_uids=None)
Filter the dataset and return with GT labels and uids as index.
You can also filter for only data points that have a certain ground truth label or a certain predicted label by a machine learning model. This filter is handy when you want to do your own analysis on an error bucket.
Multiple conditions will be combined with
AND
semantics.Examples
# Filter data where all LFs abstain.
df = sf.get_dataset(node, all_lfs_filter="UNKNOWN")
# Filter data where their GT labels are "LABEL".
df = sf.get_dataset(node, gt_label="LABEL")
# Filter data with a set of rules.
df = sf.get_dataset(
node,
gt_label="LABEL",
model_filters=[(1, "LABEL"), (2, "LABEL")],
lf_filters=[("my_lf", "LABEL")],
training_set_filters=[(1, "LABEL")],
combiner="AND"
)# Generate column statistics and summaries for the dataset
df = sf.get_dataset(node)
df.describe(include="all")Parameters
Parameters
Returns
Returns
A Pandas DataFrame for give split filtered according to parameters
Return type
Return type
pd.DataFrame
Name Type Default Info node int
UID of the node. split str
'dev'
The split to be loaded. Only "dev"
,"valid"
, and"test"
splits are allowed. Default to"dev"
.batch_uid Optional[int]
None
UID of the annotation batch to filter. combiner str
'AND'
Combiner to apply across all filters, by default "AND"
show_filtered_flag bool
False
If True, return all datapoints and a column called “filtered_flag” with “True” or “False” values specifying whether the data is included in the provided filters. gt_label Optional[str]
None
If set, include only data points where ground truth is this label string. all_lfs_filter Optional[str]
None
If set, include only data points where all LFs vote for the label string passed. no_lfs_filter Optional[str]
None
If set, include only data points where no LFs vote for the label string passed. model_filters Optional[List[Tuple[int, str]]]
None
Tuple where the first value is the model_id, and the second is predicted label or voting pattern.
If provided, include only data points that match this pattern.
lf_filters Optional[List[Tuple[str, str]]]
None
List of tuples where first value in tuple is LF name and second value is assigned label or voting pattern.
If provided, include only data points that match this pattern.
training_set_filters Optional[List[Tuple[int, str]]]
None
List of tuples where first value in tuple is training set ID and second value is assigned label or voting pattern.
If provided, include only data points that match this pattern.
include_tag_uids Optional[List[int]]
None
List of tag types, only data points which include this tag type will be included. include_tag_type_uids Optional[List[int]]
None
List of tag types, only data points which include this tag type will be included. exclude_tag_uids Optional[List[int]]
None
List of tag types, only data points which exclude this tag type will be included. Exclusion will supersede inclusion in case of collision. exclude_tag_type_uids Optional[List[int]]
None
List of tag types, only data points which exclude this tag type will be included. Exclusion will supersede inclusion in case of collision.