Skip to main content
Version: 0.94

operators

Built-in Operators.

Built-in operators are available in the full SDK and can be programmatically added to the DAG like below:

# Add the operator to the DAG
sf.add_node(
application=APP_NAME,
input_node_uids=[123],
output_node_uid=456,
op_type="ColumnRenamer",
op_config={"column_map": {"body": "email-body"}},
)

Featurizers

Text-based

operators.truncate.TruncatePreprocessor(field)Truncates a column by given amount.
operators.candidates.span_preview.SpanPreviewPreprocessor([...])Operator that adds a field with nearby textual features for each extracted span.
operators.whitespace.WhitespacePreprocessor(fields)Preprocessor that normalizes whitespace.
operators.spacy.SpacyPreprocessor(field[, ...])Preprocessor that parses document and adds json doc column.
operators.spacy.SpacyTokenizer(text_field[, ...])Preprocessor that parses document and adds tokens json column.
operators.spacy.NounChunkFeaturizer(field[, ...])A Featurizer that yields all noun phrases according to spaCy.
operators.spacy.VerbPhraseFeaturizer(field)A Featurizer that yields all verb phrases according to a simple part-of-speech verb match.
operators.spacy.SentenceFeaturizer(field[, ...])A Featurizer that yields all sentences according to spaCy.
operators.embedding.EmbeddingFeaturizer(field)Featurizer that converts text to an embedding.
operators.embedding.EmbeddingCandidateFeaturizer(...)Featurizer that converts text to an embedding.
operators.special_char.AsciiCharFilter(field)Preprocessor that removes non-ascii chars from selected column in place.
operators.special_char.LatinCharFilter(field)Preprocessor that removes non-latin chars from selected column in place.
operators.candidates.context.ContextAggregator()Adds a column with aggregated spans for the current context_uid.
operators.candidates.extractor.EmptySpanFeaturizer(field)A SpanFeaturizer that yields a single empty span from (0,0) for each row
operators.candidates.extractor.RegexSpanFeaturizer(...)A SpanFeaturizer that yields all matches for a given regular expression
operators.candidates.extractor.DateSpanFeaturizer(field)Extracts spans (slices of documents) that contain dates (using regex)
operators.candidates.extractor.ParagraphSpanFeaturizer(field)Extracts spans (slices of documents) that contain paragraphs (using regex)
operators.candidates.extractor.NumericSpanFeaturizer(field)Extracts spans (slices of documents) that contain numeric values
operators.candidates.extractor.EmailAddressSpanFeaturizer(field)Extracts spans (slices of documents) that contain email addresses (using regex)
operators.candidates.extractor.USCurrencySpanFeaturizer(field)Extracts spans (slices of documents) that contain US currency (using regex)
operators.candidates.extractor.HardCodedSpanFeaturizer(...)A SpanFeaturizer that reads spans directly from its config.
operators.candidates.extractor.SpansFileFeaturizer(path)A SpanFeaturizer that reads spans directly from a file with the expected span columns
operators.candidates.extractor.EntityDictSpanFeaturizer(...)SpanFeaturizer that yields (and optionally links) spans in an entity-to-aliases dictionary
operators.candidates.extractor.EntityDictRegexSpanFeaturizer(...)A SpanFeaturizer that yields (and optionally links) spans in an entity-to-aliases dictionary, which supports regexes.
operators.candidates.extractor.DocEntityDictSpanFeaturizer(...)(Optimized for keyword aliases) SpanFeaturizer that yields (and optionally links) spans, given an entity-to-aliases dictionary and doc-id-to-entity dictionary.
operators.candidates.extractor_spacy.TokenSpanFeaturizer(field)A SpanFeaturizer that yields every token, given a selected tokenization strategy
operators.candidates.extractor_spacy.NounChunkSpanFeaturizer(...)A SpanFeaturizer that yields all noun phrases according to spaCy
operators.candidates.extractor_spacy.TagSpanFeaturizer(...)A SpanFeaturizer that yields all matches for a given NER tag according to spaCy
operators.candidates.extractor_spacy.SpacyNERSpanFeaturizer(...)A SpanFeaturizer that yields all matches for a default list of NER tags according to spaCy

PDF-based

operators.candidates.rich_doc_features.RichDocRegexNGramDetector(regex)Featurizer that detects ngrams matching a regex pattern.
operators.candidates.rich_doc_features.RichDocRegexPageFeaturizer(...)This operator adds a list of pages to retain based on the regex pattern provided.
operators.candidates.rich_doc_features.RichDocSpanBaseFeaturesPreprocessor()Operator that computes basic features for each span using the associated RichDoc object (e.g., bounding box values of the span, page numbers, etc.)
operators.candidates.rich_doc_features.RichDocSpanRowFeaturesPreprocessor([...])Operator that computes row-level features for a span (eg.
operators.candidates.rich_doc_features.RichDocSpanStructuralPreprocessor([...])Operator to compute structural Rich Doc features for span.
operators.candidates.rich_doc_features.RichDocSpanVisualPreprocessor([...])Operator to compute visual Rich Doc features for span.
operators.candidates.rich_doc_page.RichDocPagePreprocessor()Operator that filters out all RichDoc pages except the one containing the span to show on the frontend (for the sake of performance)
operators.pdf.checkbox.CheckboxFeaturizer([...])A featurizer that identifies checkboxes in PDF documents.
operators.pdf.checkbox.CheckboxSpanMapper([...])An operator that assign checkbox-related features to the spans in the document.
operators.pdf.hocr.HocrToRichDocParser(field)Operator that parses hOCR and creates a RichDoc, a Snorkel-Flow-native representation of a pdf document with formatting preserved.
operators.pdf.hocr.TruncateHOCR(field[, ...])Truncates a HOCR document to a certain # of pages.
operators.pdf.lines.LinesFeaturizer(field[, ...])A featurizer that identifies horizontal and vertical lines in PDF documents.
operators.pdf.lines.LinesPageFilterFeaturizer(...)Operator that filters horizontal and vertical lines to subset of pages.
operators.pdf.page_splitter.PageSplitter([...])Split PDFs into pages for subsequent filtering and faster processing.
operators.pdf.parser.PDFToRichDocParser(field)Operator that parses a PDF into a RichDoc (see docs for details).
operators.pdf.parser2.PDFToRichDocParser2(field)Operator that parses a PDF into Snorkel's RichDoc representation.
operators.pdf.text_cluster.TextClusterer([...])Operator that clusters horizontally aligned words using word spacing.
operators.pdf.text_cluster.TextClusterSpanExtractor()Extractor that creates one span per horizontal text cluster.
operators.pdf.text_cluster.TextClusterSpanFeaturizer()Featurizer that creates list of spans with one span per horizontal text cluster.
operators.pdf.truncate_pdf.TruncatePDF(...)Truncates a PDF to a certain # of pages.
operators.row_filter.TableRowFilter([...])A filter that filters out rows without tables.
operators.pdf.table.TableFeaturizer([field, ...])A featurizer that detects tables in PDF documents.
operators.pdf.table.TableSpanMapper([...])An operator that maps table predictions to spans.

OCR

operators.ocr.tesseract_featurizer.TesseractFeaturizer(...)Operator that takes in a PDF URL and outputs the hOCR text.
operators.azure.azure_form_recognizer_parser.AzureFormRecognizerParser(...)Takes in a PDF URL and runs Azure Form Recognizer on it.

Filters

operators.filter.LabelIntFilter(label_ints)A filter that includes/excludes all datapoints corresponding to specified label ints.
operators.filter.LabelFilter(label_strs[, ...])A filter that includes/excludes all datapoints corresponding to specified label strings.
operators.row_filter.BooleanColumnBasedRowFilter(...)Filters rows based on the specified boolean column.
operators.row_filter.PandasQueryFilter(query)Includes or excludes rows based on a pandas query
operators.row_filter.TextSizeFilter(field, ...)A filter that excludes rows with a text column larger than specificed size (in KB)
operators.row_filter.RegexRowFilter(...[, ...])Filters rows based on the regex pattern provided.
operators.candidates.filter.RegexSpanFilter(regex)A filter that REMOVES all candidates that match a given regular expresion
operators.candidates.filter.ExtractedSpanFilter([...])A filter that removes all spans with a negative prediction.

Extractors

operators.candidates.extractor.ListToRowsExploder(...)A SpanExtractor that explodes a field containing a list of spans.
operators.candidates.extractor.EmptySpanExtractor(field)A SpanExtractor that yields a single empty span from (0,0) for each row
operators.candidates.extractor.RegexSpanExtractor(...)A SpanExtractor that yields all matches for a given regular expression
operators.candidates.extractor.DateSpanExtractor(field)Extracts spans (slices of documents) that contain dates (using regex)
operators.candidates.extractor.ParagraphSpanExtractor(field)Extracts spans (slices of documents) that contain paragraphs (using regex)
operators.candidates.extractor.NumericSpanExtractor(field)Extracts spans (slices of documents) that contain numbers (using regex)
operators.candidates.extractor.EmailAddressSpanExtractor(field)Extracts spans (slices of documents) that contain email addresses (using regex)
operators.candidates.extractor.USCurrencySpanExtractor(field)Extracts spans (slices of documents) that contain US currency (using regex)
operators.candidates.extractor.HardCodedSpanExtractor(...)A SpanExtractor that reads spans directly from its config.
operators.candidates.extractor.SpansFileExtractor(path)A SpanExtractor that reads spans directly from a file with columns ["char_start", "char_end", "context_uid", "span_field", "initial_label", "span_entity"]
operators.candidates.extractor.EntityDictSpanExtractor(...)SpanExtractor that yields (and optionally links) spans in an entity-to-aliases dictionary
operators.candidates.extractor.EntityDictRegexSpanExtractor(...)A SpanExtractor that yields (and optionally links) spans in an entity-to-aliases dictionary, which supports regexes.
operators.candidates.extractor.DocEntityDictSpanExtractor(...)(Optimized for keyword aliases) SpanExtractor that yields (and optionally links) spans, given an entity-to-aliases dictionary and doc-id-to-entity dictionary.
operators.candidates.extractor_spacy.TokenSpanExtractor(field)A SpanExtractor that yields every token, given a selected tokenization strategy
operators.candidates.extractor_spacy.NounChunkSpanExtractor(field)A SpanExtractor that yields all noun phrases according to spaCy
operators.candidates.extractor_spacy.TagSpanExtractor(...)A SpanExtractor that yields all matches for a given NER tag according to spaCy
operators.candidates.extractor_spacy.SpacyNERSpanExtractor(...)

Normalizers/linkers

operators.candidates.normalizer.DateSpanNormalizer()Normalizes date spans into their canonical forms, e.g. 2020-01-01.
operators.candidates.normalizer.USCurrencySpanNormalizer()Normalizes US currency spans into their numerical values.
operators.candidates.normalizer.TextCasingSpanNormalizer()Normalizes spans by lowercasing them, then capitalizing the first letter of each word.
operators.candidates.normalizer.OrdinalSpanNormalizer()Normalizes numerical ordinals (1st, 2nd, etc) to string ordinals (first, second, etc).
operators.candidates.normalizer.NumericalSpanNormalizer()Normalizes numerical cardinals (1, 2, etc) to string values (one, two, etc).
operators.candidates.normalizer.SpanEntityNormalizer()Copies the linked span entity from the extractor as the normalized span.
operators.candidates.normalizer.IdentitySpanNormalizer()Copies span text as is.
operators.candidates.linker.EntityDictLinker(...)Maps span_text to span_entity given an entity to alias dictionary.

Model postprocessors

operators.post_processors.sequence_tagging_post_processors.SpanFilterByLengthPostProcessor(...)Filter positive spans with length <= the provided span_length
operators.post_processors.sequence_tagging_post_processors.SpanMergeByRegexPatternPostProcessor(...)Merge same-label nearby spans if the negative-label text in between spans matches this regex pattern.
operators.post_processors.sequence_tagging_post_processors.SpanMergeByNumberCharacterPostProcessor(...)Merge same-label nearby spans if the negative-label text in between spans is in [lower, upper] number of characters (inclusive).
operators.post_processors.sequence_tagging_post_processors.SpanRegexPostProcessor(...)Post Porcessor that labels anything that matches the provided pattern with the provided label
operators.post_processors.sequence_tagging_post_processors.SpanRemoveWhitespacePostProcessor(field)Remove leading & trailing whitespace in positive spans.
operators.post_processors.sequence_tagging_post_processors.SubstringExpansionPostProcessor(...)This postprocessor expands predictions to the token boundaries should a prediction boundary fall mid-token.

Reducers

operators.candidates.reducer.IdentityReducer([...])No-Op Reducer that passes all Spans through
operators.candidates.reducer.DocumentFirstReducer([...])Reduces span predictions in a document to the span which occurs first.
operators.candidates.reducer.DocumentLastReducer([...])Reduces spans predictions in a document to the span which occurs last.
operators.candidates.reducer.DocumentMostConfidentReducer([...])Reduces spans predictions in a document to the span with the most confident model prediction.
operators.candidates.reducer.DocumentMostCommonReducer([...])Reduces spans predictions in a document to the span which occurs most frequently.
operators.candidates.reducer.EntityMeanPredictionReducer([...])Reduces span predictions for a document entity by computing the mean prediction.
operators.candidates.reducer.EntityMostCommonReducer([...])Reduces span predictions for a document entity to the prediction of the majority vote class.
operators.candidates.reducer.EntityMostConfidentReducer([...])Reduces span predictions for a document entity to the most confident prediction.
operators.candidates.reducer.EntityFirstReducer([...])Reduces span predictions for a document entity to the first occuring span of that entity.
operators.candidates.reducer.EntityLastReducer([...])Reduces span predictions for a document entity to the last occuring span of that entity.

Miscellaneous

operators.rename.ColumnRenamer(column_map)Preprocessor that renames columns
operators.drop.ColumnDropper(fields)Processor that drops given columns from the DataFrame.
operators.concat.ConcatRows()Processor that concatenates dataframes along the index axis.
operators.concat.ConcatColumns([join_type, ...])Processor that concatenates columns of dataframes.
operators.change_datapoint.ChangeDatapoint(...)Preprocessor that changes the datapoint type/columns.
operators.table.TableConverter(field[, ...])Feautizer that convert an array of dicts into a custom Table object
operators.identity.IdentityOperator()No-op operator
operators.scaler.StandardScaler(field[, ...])Preprocessor that scales a numerical column to mean=0 and std=1.
operators.filler.ColumnFiller(field, value)Operator that replaces values in either a new or existing column in the DataFrame with the given value