Version: 0.96

snorkelflow.client.nodes.get_node_data

snorkelflow.client.nodes.get_node_data(node, split=None, data=True, data_columns=None, ground_truth=True, tags=False, comments=False, training_set_labels=False, training_set_probs=False, training_set_uid=None, training_set_overwrite_with_gt=False, training_set_filter_unlabeled=False, training_set_filter_uncertain_labels=True, training_set_tie_break_policy='abstain', training_set_sampler_config=None, model_predictions=False, model_probabilities=False, model_confidences=False, model_uid=None, user_format=False, rename_columns=None, lfs=None, lfs_column_prefix='lf_', start_date=None, end_date=None, max_input_rows=None, apply_postprocessors=True)

Get a dataframe from a model node, optionally including annotations and labels.

Parameters Parameters

Name	Type	Default	Info
node	`int`		UID of the model node to get a dataframe from.
split	`Union[str, List[str], None]`	`None`	Name of data split to load, one or multiple of [“train”, “dev”, “valid”, “test”]. Fetching “train” will exclude examples in the current dev set. If None, load all available splits.
data	`bool`	`True`	If True, include specified columns from data_columns in the results.
data_columns	`Optional[List[str]]`	`None`	Optional list of columns needed in dataframe. Defaults to all columns. Specifying limited data_columns can significantly speed up loading.
ground_truth	`bool`	`True`	If True, add a column containing ground truth labels (missing = -1).
tags	`bool`	`False`	If True, add a column containing the tags assigned to each data point (missing = np.nan).
comments	`bool`	`False`	If True, add a column containing the comments assigned to each data point (missing = np.nan).
training_set_uid	`Optional[int]`	`None`	The uid of the training set whose labels and/or probabilities will be loaded. If training_set_labels or training_set_probs is True, this is required.
training_set_labels	`bool`	`False`	If True, add a column containing (int) training set labels (missing = -1).
training_set_probs	`bool`	`False`	If True, add a column containing (float) training set probabilities (missing = np.nan).
training_set_overwrite_with_gt	`bool`	`False`	If True, replace training set labels/probabilities with ground truth values where possible.
training_set_filter_unlabeled	`bool`	`False`	If True, drop data points where the training set label is -1 (i.e., “abstain”). If training_set_overwrite_with_gt is True, filter points with no LF or GT labels.
training_set_filter_uncertain_labels	`bool`	`True`	If True, set uncertain labels to “abstain”, with uncertain labels defined as probs < threshold.
training_set_tie_break_policy	`str`	`'abstain'`	How to derive (int) training set labels from (float) training set probs. If “random”, return a random choice among the tied classes. If “abstain”: Return -1 as the training set label.
training_set_sampler_config	`Optional[Dict[str, Any]]`	`None`	A dictionary with fields “strategy” (required), “params” (optional), and “class_counts” (optional) representing a sampler configuration. For details, see sampler-config. This setting only pertains to returned labels, not probs.
model_predictions	`bool`	`False`	If True, add a column containing (int) model predictions.
model_probabilities	`bool`	`False`	If True, add a column containing (List[float]) model probabilities. For high cardinality problems, this could become very large and you may consider only fetching model_predictions and/or model_confidences instead.
model_confidences	`bool`	`False`	If True, add a column containing (float) model confidences for the given predictions. Model confidence is defined here the probability assigned to the predicted class.
model_uid	`Optional[int]`	`None`	The uid of the model whose predictions, probabilities, and/or confidences will be loaded. If model_predictions or model_confidences is True, this is required.
user_format	`bool`	`False`	True if labels are to be provided in user format, False otherwise.
rename_columns	`Optional[Dict[str, str]]`	`None`	Optional dict of alternate column names. Accepted keys are “ground_truth”, “tags”, “comments”, “training_set_labels”, “training_set_probs”, “model_predictions”, “model_probabilities”, “model_confidences”.
lfs	`Optional[List[str]]`	`None`	String names of the labeling functions whose labels you would like to include in the final output. For LFs that are not part of any training set, their labels will be UNKNOWN (e.g., -1 when user_format=False for single-label) on split other than “dev”.
lfs_column_prefix	`Optional[str]`	`'lf_'`	String prefix to append to the column names of the lfs’ label column. Defaults to “lf_” (so resulting default column names will be “lf_{label_function_name}”).
start_date	`Optional[str]`	`None`	Include only data sources added after this date (e.g., YYYY-MM-DD).
end_date	`Optional[str]`	`None`	Include only data sources added before this date (e.g., YYYY-MM-DD).
max_input_rows	`Optional[int]`	`None`	Controls the number of rows returned from reading the dataset.
apply_postprocessors	`bool`	`True`	True if apply postprocessors.

Returns Returns

An [n_data_points, n_fields] pd.DataFrame containing the task data.

Return type Return type

DataFrame

Raises Raises

RuntimeError – If the data split does not exist.
ValueError – If an unrecognized column name is passed in through rename_columns
ValueError – If training-related arguments are set to True and no training_set_uid is specified.
ValueError – If model-related arguments are set to True and no model_uid is specified.
ValueError – If the given node UID doesn’t point to a model node.

Parameters

Parameters​

Returns

Returns​

Return type

Return type​

Raises

Raises​

Parameters

Returns

Return type

Raises