Version: 0.94

operators

Built-in Operators.

Built-in operators are available in the full SDK and can be programmatically added to the DAG like below:

# Add the operator to the DAG
sf.add_node(
    application=APP_NAME,
    input_node_uids=[123],
    output_node_uid=456,
    op_type="ColumnRenamer",
    op_config={"column_map": {"body": "email-body"}},
)

Featurizers

Text-based

`operators.truncate.TruncatePreprocessor`(field)	Truncates a column by given amount.
`operators.candidates.span_preview.SpanPreviewPreprocessor`([...])	Operator that adds a field with nearby textual features for each extracted span.
`operators.whitespace.WhitespacePreprocessor`(fields)	Preprocessor that normalizes whitespace.
`operators.spacy.SpacyPreprocessor`(field[, ...])	Preprocessor that parses document and adds json doc column.
`operators.spacy.SpacyTokenizer`(text_field[, ...])	Preprocessor that parses document and adds tokens json column.
`operators.spacy.NounChunkFeaturizer`(field[, ...])	A Featurizer that yields all noun phrases according to spaCy.
`operators.spacy.VerbPhraseFeaturizer`(field)	A Featurizer that yields all verb phrases according to a simple part-of-speech verb match.
`operators.spacy.SentenceFeaturizer`(field[, ...])	A Featurizer that yields all sentences according to spaCy.
`operators.embedding.EmbeddingFeaturizer`(field)	Featurizer that converts text to an embedding.
`operators.embedding.EmbeddingCandidateFeaturizer`(...)	Featurizer that converts text to an embedding.
`operators.special_char.AsciiCharFilter`(field)	Preprocessor that removes non-ascii chars from selected column in place.
`operators.special_char.LatinCharFilter`(field)	Preprocessor that removes non-latin chars from selected column in place.
`operators.candidates.context.ContextAggregator`()	Adds a column with aggregated spans for the current context_uid.
`operators.candidates.extractor.EmptySpanFeaturizer`(field)	A SpanFeaturizer that yields a single empty span from (0,0) for each row
`operators.candidates.extractor.RegexSpanFeaturizer`(...)	A SpanFeaturizer that yields all matches for a given regular expression
`operators.candidates.extractor.DateSpanFeaturizer`(field)	Extracts spans (slices of documents) that contain dates (using regex)
`operators.candidates.extractor.ParagraphSpanFeaturizer`(field)	Extracts spans (slices of documents) that contain paragraphs (using regex)
`operators.candidates.extractor.NumericSpanFeaturizer`(field)	Extracts spans (slices of documents) that contain numeric values
`operators.candidates.extractor.EmailAddressSpanFeaturizer`(field)	Extracts spans (slices of documents) that contain email addresses (using regex)
`operators.candidates.extractor.USCurrencySpanFeaturizer`(field)	Extracts spans (slices of documents) that contain US currency (using regex)
`operators.candidates.extractor.HardCodedSpanFeaturizer`(...)	A SpanFeaturizer that reads spans directly from its config.
`operators.candidates.extractor.SpansFileFeaturizer`(path)	A SpanFeaturizer that reads spans directly from a file with the expected span columns
`operators.candidates.extractor.EntityDictSpanFeaturizer`(...)	SpanFeaturizer that yields (and optionally links) spans in an entity-to-aliases dictionary
`operators.candidates.extractor.EntityDictRegexSpanFeaturizer`(...)	A SpanFeaturizer that yields (and optionally links) spans in an entity-to-aliases dictionary, which supports regexes.
`operators.candidates.extractor.DocEntityDictSpanFeaturizer`(...)	(Optimized for keyword aliases) SpanFeaturizer that yields (and optionally links) spans, given an entity-to-aliases dictionary and doc-id-to-entity dictionary.
`operators.candidates.extractor_spacy.TokenSpanFeaturizer`(field)	A SpanFeaturizer that yields every token, given a selected tokenization strategy
`operators.candidates.extractor_spacy.NounChunkSpanFeaturizer`(...)	A SpanFeaturizer that yields all noun phrases according to spaCy
`operators.candidates.extractor_spacy.TagSpanFeaturizer`(...)	A SpanFeaturizer that yields all matches for a given NER tag according to spaCy
`operators.candidates.extractor_spacy.SpacyNERSpanFeaturizer`(...)	A SpanFeaturizer that yields all matches for a default list of NER tags according to spaCy

PDF-based

`operators.candidates.rich_doc_features.RichDocRegexNGramDetector`(regex)	Featurizer that detects ngrams matching a regex pattern.
`operators.candidates.rich_doc_features.RichDocRegexPageFeaturizer`(...)	This operator adds a list of pages to retain based on the regex pattern provided.
`operators.candidates.rich_doc_features.RichDocSpanBaseFeaturesPreprocessor`()	Operator that computes basic features for each span using the associated RichDoc object (e.g., bounding box values of the span, page numbers, etc.)
`operators.candidates.rich_doc_features.RichDocSpanRowFeaturesPreprocessor`([...])	Operator that computes row-level features for a span (eg.
`operators.candidates.rich_doc_features.RichDocSpanStructuralPreprocessor`([...])	Operator to compute structural Rich Doc features for span.
`operators.candidates.rich_doc_features.RichDocSpanVisualPreprocessor`([...])	Operator to compute visual Rich Doc features for span.
`operators.candidates.rich_doc_page.RichDocPagePreprocessor`()	Operator that filters out all RichDoc pages except the one containing the span to show on the frontend (for the sake of performance)
`operators.pdf.checkbox.CheckboxFeaturizer`([...])	A featurizer that identifies checkboxes in PDF documents.
`operators.pdf.checkbox.CheckboxSpanMapper`([...])	An operator that assign checkbox-related features to the spans in the document.
`operators.pdf.hocr.HocrToRichDocParser`(field)	Operator that parses hOCR and creates a RichDoc, a Snorkel-Flow-native representation of a pdf document with formatting preserved.
`operators.pdf.hocr.TruncateHOCR`(field[, ...])	Truncates a HOCR document to a certain # of pages.
`operators.pdf.lines.LinesFeaturizer`(field[, ...])	A featurizer that identifies horizontal and vertical lines in PDF documents.
`operators.pdf.lines.LinesPageFilterFeaturizer`(...)	Operator that filters horizontal and vertical lines to subset of pages.
`operators.pdf.page_splitter.PageSplitter`([...])	Split PDFs into pages for subsequent filtering and faster processing.
`operators.pdf.parser.PDFToRichDocParser`(field)	Operator that parses a PDF into a RichDoc (see docs for details).
`operators.pdf.parser2.PDFToRichDocParser2`(field)	Operator that parses a PDF into Snorkel's RichDoc representation.
`operators.pdf.text_cluster.TextClusterer`([...])	Operator that clusters horizontally aligned words using word spacing.
`operators.pdf.text_cluster.TextClusterSpanExtractor`()	Extractor that creates one span per horizontal text cluster.
`operators.pdf.text_cluster.TextClusterSpanFeaturizer`()	Featurizer that creates list of spans with one span per horizontal text cluster.
`operators.pdf.truncate_pdf.TruncatePDF`(...)	Truncates a PDF to a certain # of pages.
`operators.row_filter.TableRowFilter`([...])	A filter that filters out rows without tables.
`operators.pdf.table.TableFeaturizer`([field, ...])	A featurizer that detects tables in PDF documents.
`operators.pdf.table.TableSpanMapper`([...])	An operator that maps table predictions to spans.

OCR

`operators.ocr.tesseract_featurizer.TesseractFeaturizer`(...)	Operator that takes in a PDF URL and outputs the hOCR text.
`operators.azure.azure_form_recognizer_parser.AzureFormRecognizerParser`(...)	Takes in a PDF URL and runs Azure Form Recognizer on it.

Filters

`operators.filter.LabelIntFilter`(label_ints)	A filter that includes/excludes all datapoints corresponding to specified label ints.
`operators.filter.LabelFilter`(label_strs[, ...])	A filter that includes/excludes all datapoints corresponding to specified label strings.
`operators.row_filter.BooleanColumnBasedRowFilter`(...)	Filters rows based on the specified boolean column.
`operators.row_filter.PandasQueryFilter`(query)	Includes or excludes rows based on a pandas query
`operators.row_filter.TextSizeFilter`(field, ...)	A filter that excludes rows with a text column larger than specificed size (in KB)
`operators.row_filter.RegexRowFilter`(...[, ...])	Filters rows based on the regex pattern provided.
`operators.candidates.filter.RegexSpanFilter`(regex)	A filter that REMOVES all candidates that match a given regular expresion
`operators.candidates.filter.ExtractedSpanFilter`([...])	A filter that removes all spans with a negative prediction.

Extractors

`operators.candidates.extractor.ListToRowsExploder`(...)	A SpanExtractor that explodes a field containing a list of spans.
`operators.candidates.extractor.EmptySpanExtractor`(field)	A SpanExtractor that yields a single empty span from (0,0) for each row
`operators.candidates.extractor.RegexSpanExtractor`(...)	A SpanExtractor that yields all matches for a given regular expression
`operators.candidates.extractor.DateSpanExtractor`(field)	Extracts spans (slices of documents) that contain dates (using regex)
`operators.candidates.extractor.ParagraphSpanExtractor`(field)	Extracts spans (slices of documents) that contain paragraphs (using regex)
`operators.candidates.extractor.NumericSpanExtractor`(field)	Extracts spans (slices of documents) that contain numbers (using regex)
`operators.candidates.extractor.EmailAddressSpanExtractor`(field)	Extracts spans (slices of documents) that contain email addresses (using regex)
`operators.candidates.extractor.USCurrencySpanExtractor`(field)	Extracts spans (slices of documents) that contain US currency (using regex)
`operators.candidates.extractor.HardCodedSpanExtractor`(...)	A SpanExtractor that reads spans directly from its config.
`operators.candidates.extractor.SpansFileExtractor`(path)	A SpanExtractor that reads spans directly from a file with columns ["char_start", "char_end", "context_uid", "span_field", "initial_label", "span_entity"]
`operators.candidates.extractor.EntityDictSpanExtractor`(...)	SpanExtractor that yields (and optionally links) spans in an entity-to-aliases dictionary
`operators.candidates.extractor.EntityDictRegexSpanExtractor`(...)	A SpanExtractor that yields (and optionally links) spans in an entity-to-aliases dictionary, which supports regexes.
`operators.candidates.extractor.DocEntityDictSpanExtractor`(...)	(Optimized for keyword aliases) SpanExtractor that yields (and optionally links) spans, given an entity-to-aliases dictionary and doc-id-to-entity dictionary.
`operators.candidates.extractor_spacy.TokenSpanExtractor`(field)	A SpanExtractor that yields every token, given a selected tokenization strategy
`operators.candidates.extractor_spacy.NounChunkSpanExtractor`(field)	A SpanExtractor that yields all noun phrases according to spaCy
`operators.candidates.extractor_spacy.TagSpanExtractor`(...)	A SpanExtractor that yields all matches for a given NER tag according to spaCy
`operators.candidates.extractor_spacy.SpacyNERSpanExtractor`(...)

Normalizers/linkers

`operators.candidates.normalizer.DateSpanNormalizer`()	Normalizes date spans into their canonical forms, e.g. 2020-01-01.
`operators.candidates.normalizer.USCurrencySpanNormalizer`()	Normalizes US currency spans into their numerical values.
`operators.candidates.normalizer.TextCasingSpanNormalizer`()	Normalizes spans by lowercasing them, then capitalizing the first letter of each word.
`operators.candidates.normalizer.OrdinalSpanNormalizer`()	Normalizes numerical ordinals (1st, 2nd, etc) to string ordinals (first, second, etc).
`operators.candidates.normalizer.NumericalSpanNormalizer`()	Normalizes numerical cardinals (1, 2, etc) to string values (one, two, etc).
`operators.candidates.normalizer.SpanEntityNormalizer`()	Copies the linked span entity from the extractor as the normalized span.
`operators.candidates.normalizer.IdentitySpanNormalizer`()	Copies span text as is.
`operators.candidates.linker.EntityDictLinker`(...)	Maps span_text to span_entity given an entity to alias dictionary.

Model postprocessors

`operators.post_processors.sequence_tagging_post_processors.SpanFilterByLengthPostProcessor`(...)	Filter positive spans with length <= the provided span_length
`operators.post_processors.sequence_tagging_post_processors.SpanMergeByRegexPatternPostProcessor`(...)	Merge same-label nearby spans if the negative-label text in between spans matches this regex pattern.
`operators.post_processors.sequence_tagging_post_processors.SpanMergeByNumberCharacterPostProcessor`(...)	Merge same-label nearby spans if the negative-label text in between spans is in [lower, upper] number of characters (inclusive).
`operators.post_processors.sequence_tagging_post_processors.SpanRegexPostProcessor`(...)	Post Porcessor that labels anything that matches the provided pattern with the provided label
`operators.post_processors.sequence_tagging_post_processors.SpanRemoveWhitespacePostProcessor`(field)	Remove leading & trailing whitespace in positive spans.
`operators.post_processors.sequence_tagging_post_processors.SubstringExpansionPostProcessor`(...)	This postprocessor expands predictions to the token boundaries should a prediction boundary fall mid-token.

Reducers

`operators.candidates.reducer.IdentityReducer`([...])	No-Op Reducer that passes all Spans through
`operators.candidates.reducer.DocumentFirstReducer`([...])	Reduces span predictions in a document to the span which occurs first.
`operators.candidates.reducer.DocumentLastReducer`([...])	Reduces spans predictions in a document to the span which occurs last.
`operators.candidates.reducer.DocumentMostConfidentReducer`([...])	Reduces spans predictions in a document to the span with the most confident model prediction.
`operators.candidates.reducer.DocumentMostCommonReducer`([...])	Reduces spans predictions in a document to the span which occurs most frequently.
`operators.candidates.reducer.EntityMeanPredictionReducer`([...])	Reduces span predictions for a document entity by computing the mean prediction.
`operators.candidates.reducer.EntityMostCommonReducer`([...])	Reduces span predictions for a document entity to the prediction of the majority vote class.
`operators.candidates.reducer.EntityMostConfidentReducer`([...])	Reduces span predictions for a document entity to the most confident prediction.
`operators.candidates.reducer.EntityFirstReducer`([...])	Reduces span predictions for a document entity to the first occuring span of that entity.
`operators.candidates.reducer.EntityLastReducer`([...])	Reduces span predictions for a document entity to the last occuring span of that entity.

Miscellaneous

`operators.rename.ColumnRenamer`(column_map)	Preprocessor that renames columns
`operators.drop.ColumnDropper`(fields)	Processor that drops given columns from the DataFrame.
`operators.concat.ConcatRows`()	Processor that concatenates dataframes along the index axis.
`operators.concat.ConcatColumns`([join_type, ...])	Processor that concatenates columns of dataframes.
`operators.change_datapoint.ChangeDatapoint`(...)	Preprocessor that changes the datapoint type/columns.
`operators.table.TableConverter`(field[, ...])	Feautizer that convert an array of dicts into a custom Table object
`operators.identity.IdentityOperator`()	No-op operator
`operators.scaler.StandardScaler`(field[, ...])	Preprocessor that scales a numerical column to mean=0 and std=1.
`operators.filler.ColumnFiller`(field, value)	Operator that replaces values in either a new or existing column in the DataFrame with the given value

Featurizers

Featurizers​

Text-based

Text-based​

PDF-based

PDF-based​

OCR

OCR​

Filters

Filters​

Extractors

Extractors​

Normalizers/linkers

Normalizers/linkers​

Model postprocessors

Model postprocessors​

Reducers

Reducers​

Miscellaneous

Miscellaneous​

Featurizers

Text-based

PDF-based

OCR

Filters

Extractors

Normalizers/linkers

Model postprocessors

Reducers

Miscellaneous