snorkelflow.client.nodes.get_node_data
- snorkelflow.client.nodes.get_node_data(node, split=None, data=True, data_columns=None, ground_truth=True, tags=False, comments=False, training_set_labels=False, training_set_probs=False, training_set_uid=None, training_set_overwrite_with_gt=False, training_set_filter_unlabeled=False, training_set_filter_uncertain_labels=True, training_set_tie_break_policy='abstain', training_set_sampler_config=None, model_predictions=False, model_probabilities=False, model_confidences=False, model_uid=None, user_format=False, rename_columns=None, lfs=None, lfs_column_prefix='lf_', start_date=None, end_date=None, max_input_rows=None, apply_postprocessors=True)
Get a dataframe from a model node, optionally including annotations and labels.
Parameters
Parameters
Returns
Returns
An [n_data_points, n_fields] pd.DataFrame containing the task data.
Return type
Return type
DataFrameRaises
Raises
RuntimeError – If the data split does not exist.
ValueError – If an unrecognized column name is passed in through rename_columns
ValueError – If training-related arguments are set to True and no training_set_uid is specified.
ValueError – If model-related arguments are set to True and no model_uid is specified.
ValueError – If the given node UID doesn’t point to a model node.
Name Type Default Info node intUID of the model node to get a dataframe from. split Union[str, List[str], None]NoneName of data split to load, one or multiple of [“train”, “dev”, “valid”, “test”]. Fetching “train” will exclude examples in the current dev set. If None, load all available splits. data boolTrueIf True, include specified columns from data_columns in the results. data_columns Optional[List[str]]NoneOptional list of columns needed in dataframe. Defaults to all columns. Specifying limited data_columns can significantly speed up loading. ground_truth boolTrueIf True, add a column containing ground truth labels (missing = -1). tags boolFalseIf True, add a column containing the tags assigned to each data point (missing = np.nan). comments boolFalseIf True, add a column containing the comments assigned to each data point (missing = np.nan). training_set_uid Optional[int]NoneThe uid of the training set whose labels and/or probabilities will be loaded. If training_set_labels or training_set_probs is True, this is required. training_set_labels boolFalseIf True, add a column containing (int) training set labels (missing = -1). training_set_probs boolFalseIf True, add a column containing (float) training set probabilities (missing = np.nan). training_set_overwrite_with_gt boolFalseIf True, replace training set labels/probabilities with ground truth values where possible. training_set_filter_unlabeled boolFalseIf True, drop data points where the training set label is -1 (i.e., “abstain”). If training_set_overwrite_with_gt is True, filter points with no LF or GT labels. training_set_filter_uncertain_labels boolTrueIf True, set uncertain labels to “abstain”, with uncertain labels defined as probs < threshold. training_set_tie_break_policy str'abstain'How to derive (int) training set labels from (float) training set probs. If “random”, return a random choice among the tied classes. If “abstain”: Return -1 as the training set label. training_set_sampler_config Optional[Dict[str, Any]]NoneA dictionary with fields “strategy” (required), “params” (optional), and “class_counts” (optional) representing a sampler configuration. For details, see sampler-config. This setting only pertains to returned labels, not probs. model_predictions boolFalseIf True, add a column containing (int) model predictions. model_probabilities boolFalseIf True, add a column containing (List[float]) model probabilities. For high cardinality problems, this could become very large and you may consider only fetching model_predictions and/or model_confidences instead. model_confidences boolFalseIf True, add a column containing (float) model confidences for the given predictions. Model confidence is defined here the probability assigned to the predicted class. model_uid Optional[int]NoneThe uid of the model whose predictions, probabilities, and/or confidences will be loaded. If model_predictions or model_confidences is True, this is required. user_format boolFalseTrue if labels are to be provided in user format, False otherwise. rename_columns Optional[Dict[str, str]]NoneOptional dict of alternate column names. Accepted keys are “ground_truth”, “tags”, “comments”, “training_set_labels”, “training_set_probs”, “model_predictions”, “model_probabilities”, “model_confidences”. lfs Optional[List[str]]NoneString names of the labeling functions whose labels you would like to include in the final output. For LFs that are not part of any training set, their labels will be UNKNOWN (e.g., -1 when user_format=False for single-label) on split other than “dev”. lfs_column_prefix Optional[str]'lf_'String prefix to append to the column names of the lfs’ label column. Defaults to “lf_” (so resulting default column names will be “lf_{label_function_name}”). start_date Optional[str]NoneInclude only data sources added after this date (e.g., YYYY-MM-DD). end_date Optional[str]NoneInclude only data sources added before this date (e.g., YYYY-MM-DD). max_input_rows Optional[int]NoneControls the number of rows returned from reading the dataset. apply_postprocessors boolTrueTrue if apply postprocessors.