Skip to main content
Version: 0.93

operators

Built-in Operators.

Built-in operators are available in the full SDK and can be programmatically added to the DAG like below:

# Add the operator to the DAG
sf.add_node(
application=APP_NAME,
input_node_uids=[123],
output_node_uid=456,
op_type="ColumnRenamer",
op_config={"column_map": {"body": "email-body"}},
)

Featurizers

Text-based

operators.truncate.TruncatePreprocessor(field)

Truncates a column by given amount.

operators.candidates.span_preview.SpanPreviewPreprocessor([...])

Operator that adds a field with nearby textual features for each extracted span.

operators.whitespace.WhitespacePreprocessor(fields)

Preprocessor that normalizes whitespace.

operators.spacy.SpacyPreprocessor(field[, ...])

Preprocessor that parses document and adds json doc column.

operators.spacy.SpacyTokenizer(text_field[, ...])

Preprocessor that parses document and adds tokens json column.

operators.spacy.NounChunkFeaturizer(field[, ...])

A Featurizer that yields all noun phrases according to spaCy.

operators.spacy.VerbPhraseFeaturizer(field)

A Featurizer that yields all verb phrases according to a simple part-of-speech verb match.

operators.spacy.SentenceFeaturizer(field[, ...])

A Featurizer that yields all sentences according to spaCy.

operators.embedding.EmbeddingFeaturizer(field)

Featurizer that converts text to an embedding.

operators.embedding.EmbeddingCandidateFeaturizer(...)

Featurizer that converts text to an embedding.

operators.special_char.AsciiCharFilter(field)

Preprocessor that removes non-ascii chars from selected column in place.

operators.special_char.LatinCharFilter(field)

Preprocessor that removes non-latin chars from selected column in place.

operators.candidates.context.ContextAggregator()

Adds a column with aggregated spans for the current context_uid.

operators.candidates.extractor.EmptySpanFeaturizer(field)

A SpanFeaturizer that yields a single empty span from (0,0) for each row

operators.candidates.extractor.RegexSpanFeaturizer(...)

A SpanFeaturizer that yields all matches for a given regular expression

operators.candidates.extractor.DateSpanFeaturizer(field)

Extracts spans (slices of documents) that contain dates (using regex)

operators.candidates.extractor.ParagraphSpanFeaturizer(field)

Extracts spans (slices of documents) that contain paragraphs (using regex)

operators.candidates.extractor.NumericSpanFeaturizer(field)

Extracts spans (slices of documents) that contain numeric values

operators.candidates.extractor.EmailAddressSpanFeaturizer(field)

Extracts spans (slices of documents) that contain email addresses (using regex)

operators.candidates.extractor.USCurrencySpanFeaturizer(field)

Extracts spans (slices of documents) that contain US currency (using regex)

operators.candidates.extractor.HardCodedSpanFeaturizer(...)

A SpanFeaturizer that reads spans directly from its config.

operators.candidates.extractor.SpansFileFeaturizer(path)

A SpanFeaturizer that reads spans directly from a file with the expected span columns

operators.candidates.extractor.EntityDictSpanFeaturizer(...)

SpanFeaturizer that yields (and optionally links) spans in an entity-to-aliases dictionary

operators.candidates.extractor.EntityDictRegexSpanFeaturizer(...)

A SpanFeaturizer that yields (and optionally links) spans in an entity-to-aliases dictionary, which supports regexes.

operators.candidates.extractor.DocEntityDictSpanFeaturizer(...)

(Optimized for keyword aliases) SpanFeaturizer that yields (and optionally links) spans, given an entity-to-aliases dictionary and doc-id-to-entity dictionary.

operators.candidates.extractor_spacy.TokenSpanFeaturizer(field)

A SpanFeaturizer that yields every token, given a selected tokenization strategy

operators.candidates.extractor_spacy.NounChunkSpanFeaturizer(...)

A SpanFeaturizer that yields all noun phrases according to spaCy

operators.candidates.extractor_spacy.TagSpanFeaturizer(...)

A SpanFeaturizer that yields all matches for a given NER tag according to spaCy

operators.candidates.extractor_spacy.SpacyNERSpanFeaturizer(...)

A SpanFeaturizer that yields all matches for a default list of NER tags according to spaCy

PDF-based

operators.candidates.rich_doc_features.RichDocRegexNGramDetector(regex)

Featurizer that detects ngrams matching a regex pattern.

operators.candidates.rich_doc_features.RichDocRegexPageFeaturizer(...)

This operator adds a list of pages to retain based on the regex pattern provided.

operators.candidates.rich_doc_features.RichDocSpanBaseFeaturesPreprocessor()

Operator that computes basic features for each span using the associated RichDoc object (e.g., bounding box values of the span, page numbers, etc.)

operators.candidates.rich_doc_features.RichDocSpanRowFeaturesPreprocessor([...])

Operator that computes row-level features for a span (eg.

operators.candidates.rich_doc_features.RichDocSpanStructuralPreprocessor([...])

Operator to compute structural Rich Doc features for span.

operators.candidates.rich_doc_features.RichDocSpanVisualPreprocessor([...])

Operator to compute visual Rich Doc features for span.

operators.candidates.rich_doc_page.RichDocPagePreprocessor()

Operator that filters out all RichDoc pages except the one containing the span to show on the frontend (for the sake of performance)

operators.pdf.checkbox.CheckboxFeaturizer([...])

A featurizer that identifies checkboxes in PDF documents.

operators.pdf.checkbox.CheckboxSpanMapper([...])

An operator that assign checkbox-related features to the spans in the document.

operators.pdf.hocr.HocrToRichDocParser(field)

Operator that parses hOCR and creates a RichDoc, a Snorkel-Flow-native representation of a pdf document with formatting preserved.

operators.pdf.hocr.TruncateHOCR(field[, ...])

Truncates a HOCR document to a certain # of pages.

operators.pdf.lines.LinesFeaturizer(field[, ...])

A featurizer that identifies horizontal and vertical lines in PDF documents.

operators.pdf.lines.LinesPageFilterFeaturizer(...)

Operator that filters horizontal and vertical lines to subset of pages.

operators.pdf.page_splitter.PageSplitter([...])

Split PDFs into pages for subsequent filtering and faster processing.

operators.pdf.parser.PDFToRichDocParser(field)

Operator that parses a PDF into a RichDoc (see docs for details).

operators.pdf.text_cluster.TextClusterer([...])

Operator that clusters horizontally aligned words using word spacing.

operators.pdf.text_cluster.TextClusterSpanExtractor()

Extractor that creates one span per horizontal text cluster.

operators.pdf.text_cluster.TextClusterSpanFeaturizer()

Featurizer that creates list of spans with one span per horizontal text cluster.

operators.pdf.truncate_pdf.TruncatePDF(...)

Truncates a PDF to a certain # of pages.

operators.row_filter.TableRowFilter([...])

A filter that filters out rows without tables.

operators.pdf.table.TableFeaturizer([field, ...])

A featurizer that detects tables in PDF documents.

operators.pdf.table.TableSpanMapper([...])

An operator that maps table predictions to spans.

OCR

operators.ocr.tesseract_featurizer.TesseractFeaturizer(...)

Operator that takes in a PDF URL and outputs the hOCR text.

operators.azure.azure_form_recognizer_parser.AzureFormRecognizerParser(...)

Takes in a PDF URL and runs Azure Form Recognizer on it.

Filters

operators.filter.LabelIntFilter(label_ints)

A filter that includes/excludes all datapoints corresponding to specified label ints.

operators.filter.LabelFilter(label_strs[, ...])

A filter that includes/excludes all datapoints corresponding to specified label strings.

operators.row_filter.BooleanColumnBasedRowFilter(...)

Filters rows based on the specified boolean column.

operators.row_filter.PandasQueryFilter(query)

Includes or excludes rows based on a pandas query

operators.row_filter.TextSizeFilter(field, ...)

A filter that excludes rows with a text column larger than specificed size (in KB)

operators.row_filter.RegexRowFilter(...[, ...])

Filters rows based on the regex pattern provided.

operators.candidates.filter.RegexSpanFilter(regex)

A filter that REMOVES all candidates that match a given regular expresion

operators.candidates.filter.ExtractedSpanFilter([...])

A filter that removes all spans with a negative prediction.

Extractors

operators.candidates.extractor.ListToRowsExploder(...)

A SpanExtractor that explodes a field containing a list of spans.

operators.candidates.extractor.EmptySpanExtractor(field)

A SpanExtractor that yields a single empty span from (0,0) for each row

operators.candidates.extractor.RegexSpanExtractor(...)

A SpanExtractor that yields all matches for a given regular expression

operators.candidates.extractor.DateSpanExtractor(field)

Extracts spans (slices of documents) that contain dates (using regex)

operators.candidates.extractor.ParagraphSpanExtractor(field)

Extracts spans (slices of documents) that contain paragraphs (using regex)

operators.candidates.extractor.NumericSpanExtractor(field)

Extracts spans (slices of documents) that contain numbers (using regex)

operators.candidates.extractor.EmailAddressSpanExtractor(field)

Extracts spans (slices of documents) that contain email addresses (using regex)

operators.candidates.extractor.USCurrencySpanExtractor(field)

Extracts spans (slices of documents) that contain US currency (using regex)

operators.candidates.extractor.HardCodedSpanExtractor(...)

A SpanExtractor that reads spans directly from its config.

operators.candidates.extractor.SpansFileExtractor(path)

A SpanExtractor that reads spans directly from a file with columns ["char_start", "char_end", "context_uid", "span_field", "initial_label", "span_entity"]

operators.candidates.extractor.EntityDictSpanExtractor(...)

SpanExtractor that yields (and optionally links) spans in an entity-to-aliases dictionary

operators.candidates.extractor.EntityDictRegexSpanExtractor(...)

A SpanExtractor that yields (and optionally links) spans in an entity-to-aliases dictionary, which supports regexes.

operators.candidates.extractor.DocEntityDictSpanExtractor(...)

(Optimized for keyword aliases) SpanExtractor that yields (and optionally links) spans, given an entity-to-aliases dictionary and doc-id-to-entity dictionary.

operators.candidates.extractor_spacy.TokenSpanExtractor(field)

A SpanExtractor that yields every token, given a selected tokenization strategy

operators.candidates.extractor_spacy.NounChunkSpanExtractor(field)

A SpanExtractor that yields all noun phrases according to spaCy

operators.candidates.extractor_spacy.TagSpanExtractor(...)

A SpanExtractor that yields all matches for a given NER tag according to spaCy

operators.candidates.extractor_spacy.SpacyNERSpanExtractor(...)

Normalizers/linkers

operators.candidates.normalizer.DateSpanNormalizer()

Normalizes date spans into their canonical forms, e.g. 2020-01-01.

operators.candidates.normalizer.USCurrencySpanNormalizer()

Normalizes US currency spans into their numerical values.

operators.candidates.normalizer.TextCasingSpanNormalizer()

Normalizes spans by lowercasing them, then capitalizing the first letter of each word.

operators.candidates.normalizer.OrdinalSpanNormalizer()

Normalizes numerical ordinals (1st, 2nd, etc) to string ordinals (first, second, etc).

operators.candidates.normalizer.NumericalSpanNormalizer()

Normalizes numerical cardinals (1, 2, etc) to string values (one, two, etc).

operators.candidates.normalizer.SpanEntityNormalizer()

Copies the linked span entity from the extractor as the normalized span.

operators.candidates.normalizer.IdentitySpanNormalizer()

Copies span text as is.

operators.candidates.linker.EntityDictLinker(...)

Maps span_text to span_entity given an entity to alias dictionary.

Model postprocessors

operators.post_processors.sequence_tagging_post_processors.SpanFilterByLengthPostProcessor(...)

Filter positive spans with length <= the provided span_length

operators.post_processors.sequence_tagging_post_processors.SpanMergeByRegexPatternPostProcessor(...)

Merge same-label nearby spans if the negative-label text in between spans matches this regex pattern.

operators.post_processors.sequence_tagging_post_processors.SpanMergeByNumberCharacterPostProcessor(...)

Merge same-label nearby spans if the negative-label text in between spans is in [lower, upper] number of characters (inclusive).

operators.post_processors.sequence_tagging_post_processors.SpanRegexPostProcessor(...)

Post Porcessor that labels anything that matches the provided pattern with the provided label

operators.post_processors.sequence_tagging_post_processors.SpanRemoveWhitespacePostProcessor(field)

Remove leading & trailing whitespace in positive spans.

operators.post_processors.sequence_tagging_post_processors.SubstringExpansionPostProcessor(...)

This postprocessor expands predictions to the token boundaries should a prediction boundary fall mid-token.

Reducers

operators.candidates.reducer.IdentityReducer([...])

No-Op Reducer that passes all Spans through

operators.candidates.reducer.DocumentFirstReducer([...])

Reduces span predictions in a document to the span which occurs first.

operators.candidates.reducer.DocumentLastReducer([...])

Reduces spans predictions in a document to the span which occurs last.

operators.candidates.reducer.DocumentMostConfidentReducer([...])

Reduces spans predictions in a document to the span with the most confident model prediction.

operators.candidates.reducer.DocumentMostCommonReducer([...])

Reduces spans predictions in a document to the span which occurs most frequently.

operators.candidates.reducer.EntityMeanPredictionReducer([...])

Reduces span predictions for a document entity by computing the mean prediction.

operators.candidates.reducer.EntityMostCommonReducer([...])

Reduces span predictions for a document entity to the prediction of the majority vote class.

operators.candidates.reducer.EntityMostConfidentReducer([...])

Reduces span predictions for a document entity to the most confident prediction.

operators.candidates.reducer.EntityFirstReducer([...])

Reduces span predictions for a document entity to the first occuring span of that entity.

operators.candidates.reducer.EntityLastReducer([...])

Reduces span predictions for a document entity to the last occuring span of that entity.

Miscellaneous

operators.rename.ColumnRenamer(column_map)

Preprocessor that renames columns

operators.drop.ColumnDropper(fields)

Processor that drops given columns from the DataFrame.

operators.concat.ConcatRows()

Processor that concatenates dataframes along the index axis.

operators.concat.ConcatColumns([join_type, ...])

Processor that concatenates columns of dataframes.

operators.change_datapoint.ChangeDatapoint(...)

Preprocessor that changes the datapoint type/columns.

operators.table.TableConverter(field[, ...])

Feautizer that convert an array of dicts into a custom Table object

operators.identity.IdentityOperator()

No-op operator

operators.scaler.StandardScaler(field[, ...])

Preprocessor that scales a numerical column to mean=0 and std=1.

operators.filler.ColumnFiller(field, value)

Operator that replaces values in either a new or existing column in the DataFrame with the given value