snorkelflow.client.nodes.get_node_data
- snorkelflow.client.nodes.get_node_data(node, split=None, data=True, data_columns=None, ground_truth=True, tags=False, comments=False, training_set_labels=False, training_set_probs=False, training_set_uid=None, training_set_overwrite_with_gt=False, training_set_filter_unlabeled=False, training_set_filter_uncertain_labels=True, training_set_tie_break_policy='abstain', training_set_sampler_config=None, model_predictions=False, model_probabilities=False, model_confidences=False, model_uid=None, user_format=False, rename_columns=None, lfs=None, lfs_column_prefix='lf_', start_date=None, end_date=None, max_input_rows=None, apply_postprocessors=True)
Get a dataframe from a model node, optionally including annotations and labels.
Parameters
Parameters
Returns
Returns
An [n_data_points, n_fields] pd.DataFrame containing the task data.
Return type
Return type
DataFrame
Raises
Raises
RuntimeError – If the data split does not exist.
ValueError – If an unrecognized column name is passed in through rename_columns
ValueError – If training-related arguments are set to True and no training_set_uid is specified.
ValueError – If model-related arguments are set to True and no model_uid is specified.
ValueError – If the given node UID doesn’t point to a model node.
Name Type Default Info node int
UID of the model node to get a dataframe from. split Union[str, List[str], None]
None
Name of data split to load, one or multiple of [“train”, “dev”, “valid”, “test”]. Fetching “train” will exclude examples in the current dev set. If None, load all available splits. data bool
True
If True, include specified columns from data_columns in the results. data_columns Optional[List[str]]
None
Optional list of columns needed in dataframe. Defaults to all columns. Specifying limited data_columns can significantly speed up loading. ground_truth bool
True
If True, add a column containing ground truth labels (missing = -1). tags bool
False
If True, add a column containing the tags assigned to each data point (missing = np.nan). comments bool
False
If True, add a column containing the comments assigned to each data point (missing = np.nan). training_set_uid Optional[int]
None
The uid of the training set whose labels and/or probabilities will be loaded. If training_set_labels or training_set_probs is True, this is required. training_set_labels bool
False
If True, add a column containing (int) training set labels (missing = -1). training_set_probs bool
False
If True, add a column containing (float) training set probabilities (missing = np.nan). training_set_overwrite_with_gt bool
False
If True, replace training set labels/probabilities with ground truth values where possible. training_set_filter_unlabeled bool
False
If True, drop data points where the training set label is -1 (i.e., “abstain”). If training_set_overwrite_with_gt is True, filter points with no LF or GT labels. training_set_filter_uncertain_labels bool
True
If True, set uncertain labels to “abstain”, with uncertain labels defined as probs < threshold. training_set_tie_break_policy str
'abstain'
How to derive (int) training set labels from (float) training set probs. If “random”, return a random choice among the tied classes. If “abstain”: Return -1 as the training set label. training_set_sampler_config Optional[Dict[str, Any]]
None
A dictionary with fields “strategy” (required), “params” (optional), and “class_counts” (optional) representing a sampler configuration. For details, see sampler-config. This setting only pertains to returned labels, not probs. model_predictions bool
False
If True, add a column containing (int) model predictions. model_probabilities bool
False
If True, add a column containing (List[float]) model probabilities. For high cardinality problems, this could become very large and you may consider only fetching model_predictions and/or model_confidences instead. model_confidences bool
False
If True, add a column containing (float) model confidences for the given predictions. Model confidence is defined here the probability assigned to the predicted class. model_uid Optional[int]
None
The uid of the model whose predictions, probabilities, and/or confidences will be loaded. If model_predictions or model_confidences is True, this is required. user_format bool
False
True if labels are to be provided in user format, False otherwise. rename_columns Optional[Dict[str, str]]
None
Optional dict of alternate column names. Accepted keys are “ground_truth”, “tags”, “comments”, “training_set_labels”, “training_set_probs”, “model_predictions”, “model_probabilities”, “model_confidences”. lfs Optional[List[str]]
None
String names of the labeling functions whose labels you would like to include in the final output. For LFs that are not part of any training set, their labels will be UNKNOWN (e.g., -1 when user_format=False for single-label) on split other than “dev”. lfs_column_prefix Optional[str]
'lf_'
String prefix to append to the column names of the lfs’ label column. Defaults to “lf_” (so resulting default column names will be “lf_{label_function_name}”). start_date Optional[str]
None
Include only data sources added after this date (e.g., YYYY-MM-DD). end_date Optional[str]
None
Include only data sources added before this date (e.g., YYYY-MM-DD). max_input_rows Optional[int]
None
Controls the number of rows returned from reading the dataset. apply_postprocessors bool
True
True if apply postprocessors.