operators
Built-in Operators.
Built-in operators are available in the full SDK and can be programmatically added to the DAG like below:
# Add the operator to the DAG
sf.add_node(
application=APP_NAME,
input_node_uids=[123],
output_node_uid=456,
op_type="ColumnRenamer",
op_config={"column_map": {"body": "email-body"}},
)
Featurizers
Featurizers
Text-based
Text-based
Truncates a column by given amount. | |
| Operator that adds a field with nearby textual features for each extracted span. |
Preprocessor that normalizes whitespace. | |
| Preprocessor that parses document and adds json doc column. |
| Preprocessor that parses document and adds tokens json column. |
| A Featurizer that yields all noun phrases according to spaCy. |
A Featurizer that yields all verb phrases according to a simple part-of-speech verb match. | |
| A Featurizer that yields all sentences according to spaCy. |
Featurizer that converts text to an embedding. | |
Featurizer that converts text to an embedding. | |
Preprocessor that removes non-ascii chars from selected column in place. | |
Preprocessor that removes non-latin chars from selected column in place. | |
Adds a column with aggregated spans for the current context_uid. | |
A SpanFeaturizer that yields a single empty span from (0,0) for each row | |
A SpanFeaturizer that yields all matches for a given regular expression | |
Extracts spans (slices of documents) that contain dates (using regex) | |
| Extracts spans (slices of documents) that contain paragraphs (using regex) |
Extracts spans (slices of documents) that contain numeric values | |
| Extracts spans (slices of documents) that contain email addresses (using regex) |
| Extracts spans (slices of documents) that contain US currency (using regex) |
A SpanFeaturizer that reads spans directly from its config. | |
A SpanFeaturizer that reads spans directly from a file with the expected span columns | |
| SpanFeaturizer that yields (and optionally links) spans in an entity-to-aliases dictionary |
| A SpanFeaturizer that yields (and optionally links) spans in an entity-to-aliases dictionary, which supports regexes. |
| (Optimized for keyword aliases) SpanFeaturizer that yields (and optionally links) spans, given an entity-to-aliases dictionary and doc-id-to-entity dictionary. |
| A SpanFeaturizer that yields every token, given a selected tokenization strategy |
| A SpanFeaturizer that yields all noun phrases according to spaCy |
A SpanFeaturizer that yields all matches for a given NER tag according to spaCy | |
| A SpanFeaturizer that yields all matches for a default list of NER tags according to spaCy |
PDF-based
PDF-based
| Featurizer that detects ngrams matching a regex pattern. |
| This operator adds a list of pages to retain based on the regex pattern provided. |
| Operator that computes basic features for each span using the associated RichDoc object (e.g., bounding box values of the span, page numbers, etc.) |
| Operator that computes row-level features for a span (eg. |
| Operator to compute structural Rich Doc features for span. |
| Operator to compute visual Rich Doc features for span. |
| Operator that filters out all RichDoc pages except the one containing the span to show on the frontend (for the sake of performance) |
A featurizer that identifies checkboxes in PDF documents. | |
An operator that assign checkbox-related features to the spans in the document. | |
Operator that parses hOCR and creates a RichDoc, a Snorkel-Flow-native representation of a pdf document with formatting preserved. | |
| Truncates a HOCR document to a certain # of pages. |
| A featurizer that identifies horizontal and vertical lines in PDF documents. |
Operator that filters horizontal and vertical lines to subset of pages. | |
Split PDFs into pages for subsequent filtering and faster processing. | |
Operator that parses a PDF into a RichDoc (see docs for details). | |
Operator that clusters horizontally aligned words using word spacing. | |
Extractor that creates one span per horizontal text cluster. | |
Featurizer that creates list of spans with one span per horizontal text cluster. | |
Truncates a PDF to a certain # of pages. | |
A filter that filters out rows without tables. | |
| A featurizer that detects tables in PDF documents. |
An operator that maps table predictions to spans. |
OCR
OCR
Operator that takes in a PDF URL and outputs the hOCR text. | |
| Takes in a PDF URL and runs Azure Form Recognizer on it. |
Filters
Filters
| A filter that includes/excludes all datapoints corresponding to specified label ints. |
| A filter that includes/excludes all datapoints corresponding to specified label strings. |
Filters rows based on the specified boolean column. | |
Includes or excludes rows based on a pandas query | |
| A filter that excludes rows with a text column larger than specificed size (in KB) |
| Filters rows based on the regex pattern provided. |
A filter that REMOVES all candidates that match a given regular expresion | |
A filter that removes all spans with a negative prediction. |
Extractors
Extractors
A SpanExtractor that explodes a field containing a list of spans. | |
A SpanExtractor that yields a single empty span from (0,0) for each row | |
A SpanExtractor that yields all matches for a given regular expression | |
Extracts spans (slices of documents) that contain dates (using regex) | |
| Extracts spans (slices of documents) that contain paragraphs (using regex) |
Extracts spans (slices of documents) that contain numbers (using regex) | |
| Extracts spans (slices of documents) that contain email addresses (using regex) |
| Extracts spans (slices of documents) that contain US currency (using regex) |
A SpanExtractor that reads spans directly from its config. | |
A SpanExtractor that reads spans directly from a file with columns ["char_start", "char_end", "context_uid", "span_field", "initial_label", "span_entity"] | |
SpanExtractor that yields (and optionally links) spans in an entity-to-aliases dictionary | |
| A SpanExtractor that yields (and optionally links) spans in an entity-to-aliases dictionary, which supports regexes. |
| (Optimized for keyword aliases) SpanExtractor that yields (and optionally links) spans, given an entity-to-aliases dictionary and doc-id-to-entity dictionary. |
| A SpanExtractor that yields every token, given a selected tokenization strategy |
| A SpanExtractor that yields all noun phrases according to spaCy |
A SpanExtractor that yields all matches for a given NER tag according to spaCy | |
|
Normalizers/linkers
Normalizers/linkers
Normalizes date spans into their canonical forms, e.g. 2020-01-01. | |
Normalizes US currency spans into their numerical values. | |
Normalizes spans by lowercasing them, then capitalizing the first letter of each word. | |
Normalizes numerical ordinals (1st, 2nd, etc) to string ordinals (first, second, etc). | |
Normalizes numerical cardinals (1, 2, etc) to string values (one, two, etc). | |
Copies the linked span entity from the extractor as the normalized span. | |
Copies span text as is. | |
Maps span_text to span_entity given an entity to alias dictionary. |
Model postprocessors
Model postprocessors
| Filter positive spans with length <= the provided span_length |
| Merge same-label nearby spans if the negative-label text in between spans matches this regex pattern. |
| Merge same-label nearby spans if the negative-label text in between spans is in [lower, upper] number of characters (inclusive). |
| Post Porcessor that labels anything that matches the provided pattern with the provided label |
| Remove leading & trailing whitespace in positive spans. |
| This postprocessor expands predictions to the token boundaries should a prediction boundary fall mid-token. |
Reducers
Reducers
No-Op Reducer that passes all Spans through | |
Reduces span predictions in a document to the span which occurs first. | |
Reduces spans predictions in a document to the span which occurs last. | |
| Reduces spans predictions in a document to the span with the most confident model prediction. |
| Reduces spans predictions in a document to the span which occurs most frequently. |
| Reduces span predictions for a document entity by computing the mean prediction. |
Reduces span predictions for a document entity to the prediction of the majority vote class. | |
| Reduces span predictions for a document entity to the most confident prediction. |
Reduces span predictions for a document entity to the first occuring span of that entity. | |
Reduces span predictions for a document entity to the last occuring span of that entity. |
Miscellaneous
Miscellaneous
| Preprocessor that renames columns |
| Processor that drops given columns from the DataFrame. |
Processor that concatenates dataframes along the index axis. | |
| Processor that concatenates columns of dataframes. |
Preprocessor that changes the datapoint type/columns. | |
| Feautizer that convert an array of dicts into a custom Table object |
No-op operator | |
| Preprocessor that scales a numerical column to mean=0 and std=1. |
| Operator that replaces values in either a new or existing column in the DataFrame with the given value |