snorkelai.sdk.client.gts.align_external_ground_truth
- snorkelai.sdk.client.gts.align_external_ground_truth(node_uid, x_uids, labels, user_format=False, scheduler=None)
Note: This SDK is only necessary when you use a non-dataframe format with sf.add_ground_truth for sequence tagging applications. Starting with version 0.95, if you use a dataframe with sf.add_ground_truth, the labels will automatically be aligned
(Sequence Tagging Only) This function changes external ground truth spans to compensate for the offsets caused by a text preprocessor.
Text Preprocessors may sometimes remove characters, resulting in misalignments between externally collected ground truth spans and the preprocessed text. For example, the default
AsciiCharFilter
preprocessor removes non-ascii characters which will cause some spans to shift leftwards, but external annotations will still have the original spans. Thus, it is necessary to use this function for applications with non-ascii characters.Examples
An example for a sequence tagging application with
AsciiCharFilter
as the preprocessornode_uid = sf.get_model_node(APP_NAME)
x_uids = ["doc::0", "doc::1"] # all x_uids with labels
labels = [
[[0, 20, "COMPANY"]], # labels for doc::0
[[10, 15, "COMPANY"], [20, 25, "COMPANY"]], # labels for doc::1
...
]
aligned_labels = sf.align_external_ground_truth(node_uid, x_uids, labels, user_format=True)Then, use
sf.add_ground_truth
to addaligned_labels
sf.add_ground_truth(node_uid, x_uids, aligned_labels, user_format=True)
Parameters
Parameters
Returns
Returns
List of aligned labels corresponding to x_uids.
Return type
Return type
List[Any]
Name Type Default Info node_uid int
The UID of the model node. x_uids List[str]
UIDs of data points. Can be a list or a 1D numpy array of strings. labels List[Any]
Label values. List or numpy array of labels. Must be the same length as x_uids. If user_format is True, check that labels have not been JSON serialized. user_format bool
False
True if labels are provided in user format, False otherwise. scheduler Optional[str]
None
Dask scheduler (threads, client, or group) to use.