Skip to main content
Version: 0.96

snorkelflow.client.nodes.get_node_data

snorkelflow.client.nodes.get_node_data(node, split=None, data=True, data_columns=None, ground_truth=True, tags=False, comments=False, training_set_labels=False, training_set_probs=False, training_set_uid=None, training_set_overwrite_with_gt=False, training_set_filter_unlabeled=False, training_set_filter_uncertain_labels=True, training_set_tie_break_policy='abstain', training_set_sampler_config=None, model_predictions=False, model_probabilities=False, model_confidences=False, model_uid=None, user_format=False, rename_columns=None, lfs=None, lfs_column_prefix='lf_', start_date=None, end_date=None, max_input_rows=None, apply_postprocessors=True)

Get a dataframe from a model node, optionally including annotations and labels.

Parameters

NameTypeDefaultInfo
nodeintUID of the model node to get a dataframe from.
splitUnion[str, List[str], None]NoneName of data split to load, one or multiple of [“train”, “dev”, “valid”, “test”]. Fetching “train” will exclude examples in the current dev set. If None, load all available splits.
databoolTrueIf True, include specified columns from data_columns in the results.
data_columnsOptional[List[str]]NoneOptional list of columns needed in dataframe. Defaults to all columns. Specifying limited data_columns can significantly speed up loading.
ground_truthboolTrueIf True, add a column containing ground truth labels (missing = -1).
tagsboolFalseIf True, add a column containing the tags assigned to each data point (missing = np.nan).
commentsboolFalseIf True, add a column containing the comments assigned to each data point (missing = np.nan).
training_set_uidOptional[int]NoneThe uid of the training set whose labels and/or probabilities will be loaded. If training_set_labels or training_set_probs is True, this is required.
training_set_labelsboolFalseIf True, add a column containing (int) training set labels (missing = -1).
training_set_probsboolFalseIf True, add a column containing (float) training set probabilities (missing = np.nan).
training_set_overwrite_with_gtboolFalseIf True, replace training set labels/probabilities with ground truth values where possible.
training_set_filter_unlabeledboolFalseIf True, drop data points where the training set label is -1 (i.e., “abstain”). If training_set_overwrite_with_gt is True, filter points with no LF or GT labels.
training_set_filter_uncertain_labelsboolTrueIf True, set uncertain labels to “abstain”, with uncertain labels defined as probs < threshold.
training_set_tie_break_policystr'abstain'How to derive (int) training set labels from (float) training set probs. If “random”, return a random choice among the tied classes. If “abstain”: Return -1 as the training set label.
training_set_sampler_configOptional[Dict[str, Any]]NoneA dictionary with fields “strategy” (required), “params” (optional), and “class_counts” (optional) representing a sampler configuration. For details, see sampler-config. This setting only pertains to returned labels, not probs.
model_predictionsboolFalseIf True, add a column containing (int) model predictions.
model_probabilitiesboolFalseIf True, add a column containing (List[float]) model probabilities. For high cardinality problems, this could become very large and you may consider only fetching model_predictions and/or model_confidences instead.
model_confidencesboolFalseIf True, add a column containing (float) model confidences for the given predictions. Model confidence is defined here the probability assigned to the predicted class.
model_uidOptional[int]NoneThe uid of the model whose predictions, probabilities, and/or confidences will be loaded. If model_predictions or model_confidences is True, this is required.
user_formatboolFalseTrue if labels are to be provided in user format, False otherwise.
rename_columnsOptional[Dict[str, str]]NoneOptional dict of alternate column names. Accepted keys are “ground_truth”, “tags”, “comments”, “training_set_labels”, “training_set_probs”, “model_predictions”, “model_probabilities”, “model_confidences”.
lfsOptional[List[str]]NoneString names of the labeling functions whose labels you would like to include in the final output. For LFs that are not part of any training set, their labels will be UNKNOWN (e.g., -1 when user_format=False for single-label) on split other than “dev”.
lfs_column_prefixOptional[str]'lf_'String prefix to append to the column names of the lfs’ label column. Defaults to “lf_” (so resulting default column names will be “lf_{label_function_name}”).
start_dateOptional[str]NoneInclude only data sources added after this date (e.g., YYYY-MM-DD).
end_dateOptional[str]NoneInclude only data sources added before this date (e.g., YYYY-MM-DD).
max_input_rowsOptional[int]NoneControls the number of rows returned from reading the dataset.
apply_postprocessorsboolTrueTrue if apply postprocessors.

Returns

An [n_data_points, n_fields] pd.DataFrame containing the task data.

Return type

DataFrame

Raises

  • RuntimeError – If the data split does not exist.

  • ValueError – If an unrecognized column name is passed in through rename_columns

  • ValueError – If training-related arguments are set to True and no training_set_uid is specified.

  • ValueError – If model-related arguments are set to True and no model_uid is specified.

  • ValueError – If the given node UID doesn’t point to a model node.