operators
Built-in Operators.
Built-in operators are available in the full SDK and can be programmatically added to the DAG like below:
# Add the operator to the DAG
sf.add_node(
application=APP_NAME,
input_node_uids=[123],
output_node_uids=[456],
op_type="ColumnRenamer",
op_config={"column_map": {"body": "email-body"}},
)
Featurizers
Featurizers
Text-based
Text-based
operators.truncate.TruncatePreprocessor (field) | Truncates a column by given amount. |
operators.candidates.span_preview.SpanPreviewPreprocessor ([...]) | Operator that adds a field with nearby textual features for each extracted span. |
operators.whitespace.WhitespacePreprocessor (fields) | Preprocessor that normalizes whitespace. |
operators.spacy.SpacyPreprocessor (field[, ...]) | Preprocessor that parses document and adds json doc column. |
operators.spacy.SpacyTokenizer (text_field[, ...]) | Preprocessor that parses document and adds tokens column. |
operators.spacy.NounChunkFeaturizer (field[, ...]) | A Featurizer that yields all noun phrases according to spaCy. |
operators.spacy.VerbPhraseFeaturizer (field) | A Featurizer that yields all verb phrases according to a simple part-of-speech verb match. |
operators.spacy.SentenceFeaturizer (field[, ...]) | A Featurizer that yields all sentences according to spaCy. |
operators.embedding.EmbeddingFeaturizer (field) | Featurizer that converts text to an embedding. |
operators.embedding.EmbeddingCandidateFeaturizer (...) | Featurizer that converts text to an embedding. |
operators.special_char.AsciiCharFilter (field) | Preprocessor that removes non-ascii chars from selected column in place. |
operators.special_char.LatinCharFilter (field) | Preprocessor that removes non-latin chars from selected column in place. |
operators.candidates.context.ContextAggregator () | Adds a column with aggregated spans for the current context_uid. |
operators.candidates.extractor.EmptySpanFeaturizer (field) | A SpanFeaturizer that yields a single empty span from (0,0) for each row |
operators.candidates.extractor.RegexSpanFeaturizer (...) | A SpanFeaturizer that yields all matches for a given regular expression |
operators.candidates.extractor.DateSpanFeaturizer (field) | Extracts spans (slices of documents) that contain dates (using regex) |
operators.candidates.extractor.ParagraphSpanFeaturizer (field) | Extracts spans (slices of documents) that contain paragraphs (using regex) |
operators.candidates.extractor.NumericSpanFeaturizer (field) | Extracts spans (slices of documents) that contain numeric values |
operators.candidates.extractor.EmailAddressSpanFeaturizer (field) | Extracts spans (slices of documents) that contain email addresses (using regex) |
operators.candidates.extractor.USCurrencySpanFeaturizer (field) | Extracts spans (slices of documents) that contain US currency (using regex) |
operators.candidates.extractor.HardCodedSpanFeaturizer (...) | A SpanFeaturizer that reads spans directly from its config. |
operators.candidates.extractor.SpansFileFeaturizer (path) | A SpanFeaturizer that reads spans directly from a file with the expected span columns |
operators.candidates.extractor.EntityDictSpanFeaturizer (...) | SpanFeaturizer that yields (and optionally links) spans in an entity-to-aliases dictionary |
operators.candidates.extractor.EntityDictRegexSpanFeaturizer (...) | A SpanFeaturizer that yields (and optionally links) spans in an entity-to-aliases dictionary, which supports regexes. |
operators.candidates.extractor.DocEntityDictSpanFeaturizer (...) | (Optimized for keyword aliases) SpanFeaturizer that yields (and optionally links) spans, given an entity-to-aliases dictionary and doc-id-to-entity dictionary. |
operators.candidates.extractor_spacy.TokenSpanFeaturizer (field) | A SpanFeaturizer that yields every token, given a selected tokenization strategy |
operators.candidates.extractor_spacy.NounChunkSpanFeaturizer (...) | A SpanFeaturizer that yields all noun phrases according to spaCy |
operators.candidates.extractor_spacy.TagSpanFeaturizer (...) | A SpanFeaturizer that yields all matches for a given NER tag according to spaCy |
operators.candidates.extractor_spacy.SpacyNERSpanFeaturizer (...) | A SpanFeaturizer that yields all matches for a default list of NER tags according to spaCy |
PDF-based
PDF-based
operators.candidates.rich_doc_features.RichDocRegexNGramDetector (regex) | Featurizer that detects ngrams matching a regex pattern. |
operators.candidates.rich_doc_features.RichDocRegexPageFeaturizer (...) | This operator adds a list of pages to retain based on the regex pattern provided. |
operators.candidates.rich_doc_features.RichDocSpanBaseFeaturesPreprocessor () | Operator that computes basic features for each span using the associated RichDoc object (e.g., bounding box values of the span, page numbers, etc.) |
operators.candidates.rich_doc_features.RichDocSpanRowFeaturesPreprocessor ([...]) | Operator that computes row-level features for a span (eg. |
operators.candidates.rich_doc_features.RichDocSpanStructuralPreprocessor ([...]) | Operator to compute structural Rich Doc features for span. |
operators.candidates.rich_doc_features.RichDocSpanVisualPreprocessor ([...]) | Operator to compute visual Rich Doc features for span. |
operators.candidates.rich_doc_page.RichDocPagePreprocessor () | Operator that filters out all RichDoc pages except the one containing the span to show on the frontend (for the sake of performance) |
operators.pdf.checkbox.CheckboxFeaturizer ([...]) | A featurizer that identifies checkboxes in PDF documents. |
operators.pdf.checkbox.CheckboxSpanMapper ([...]) | An operator that assign checkbox-related features to the spans in the document. |
operators.pdf.hocr.HocrToRichDocParser (field) | Operator that parses hOCR and creates a RichDoc, a Snorkel-Flow-native representation of a pdf document with formatting preserved. |
operators.pdf.hocr.TruncateHOCR (field[, ...]) | Truncates a HOCR document to a certain # of pages. |
operators.pdf.lines.LinesFeaturizer (field[, ...]) | A featurizer that identifies horizontal and vertical lines in PDF documents. |
operators.pdf.lines.LinesPageFilterFeaturizer (...) | Operator that filters horizontal and vertical lines to subset of pages. |
operators.pdf.page_splitter.PageSplitter ([...]) | Split PDFs into pages for subsequent filtering and faster processing. |
operators.pdf.parser.PDFToRichDocParser (field) | Operator that parses a PDF into a RichDoc (see docs for details). |
operators.pdf.parser2.PDFToRichDocParser2 (field) | Operator that parses a PDF into Snorkel's RichDoc representation. |
operators.pdf.text_cluster.TextClusterer ([...]) | Operator that clusters horizontally aligned words using word spacing. |
operators.pdf.text_cluster.TextClusterSpanExtractor () | Extractor that creates one span per horizontal text cluster. |
operators.pdf.text_cluster.TextClusterSpanFeaturizer () | Featurizer that creates list of spans with one span per horizontal text cluster. |
operators.pdf.truncate_pdf.TruncatePDF (...) | Truncates a PDF to a certain # of pages. |
operators.row_filter.TableRowFilter ([...]) | A filter that filters out rows without tables. |
operators.pdf.table.TableFeaturizer ([field, ...]) | A featurizer that detects tables in PDF documents. |
operators.pdf.table.TableSpanMapper ([...]) | An operator that maps table predictions to spans. |
OCR
OCR
operators.ocr.tesseract_featurizer.TesseractFeaturizer (...) | Operator that takes in a PDF URL and outputs the hOCR text. |
operators.azure.azure_form_recognizer_parser.AzureFormRecognizerParser (...) | Takes in a PDF URL and runs Azure Form Recognizer on it. |
Filters
Filters
operators.filter.LabelIntFilter (label_ints) | A filter that includes/excludes all datapoints corresponding to specified label ints. |
operators.filter.LabelFilter (label_strs[, ...]) | A filter that includes/excludes all datapoints corresponding to specified label strings. |
operators.filter.MultiLabelFilter (label_strs) | A filter that includes/excludes all datapoints corresponding to specified label strings for multilabel classification. |
operators.row_filter.BooleanColumnBasedRowFilter (...) | Filters rows based on the specified boolean column. |
operators.row_filter.PandasQueryFilter (query) | Includes or excludes rows based on a pandas query |
operators.row_filter.TextSizeFilter (field, ...) | A filter that excludes rows with a text column larger than specificed size (in KB) |
operators.row_filter.RegexRowFilter (...[, ...]) | Filters rows based on the regex pattern provided. |
operators.candidates.filter.RegexSpanFilter (regex) | A filter that REMOVES all candidates that match a given regular expresion |
operators.candidates.filter.ExtractedSpanFilter ([...]) | A filter that removes all spans with a negative prediction. |
Extractors
Extractors
operators.candidates.extractor.ListToRowsExploder (...) | A SpanExtractor that explodes a field containing a list of spans. |
operators.candidates.extractor.EmptySpanExtractor (field) | A SpanExtractor that yields a single empty span from (0,0) for each row |
operators.candidates.extractor.RegexSpanExtractor (...) | A SpanExtractor that yields all matches for a given regular expression |
operators.candidates.extractor.DateSpanExtractor (field) | Extracts spans (slices of documents) that contain dates (using regex) |
operators.candidates.extractor.ParagraphSpanExtractor (field) | Extracts spans (slices of documents) that contain paragraphs (using regex) |
operators.candidates.extractor.NumericSpanExtractor (field) | Extracts spans (slices of documents) that contain numbers (using regex) |
operators.candidates.extractor.EmailAddressSpanExtractor (field) | Extracts spans (slices of documents) that contain email addresses (using regex) |
operators.candidates.extractor.USCurrencySpanExtractor (field) | Extracts spans (slices of documents) that contain US currency (using regex) |
operators.candidates.extractor.HardCodedSpanExtractor (...) | A SpanExtractor that reads spans directly from its config. |
operators.candidates.extractor.SpansFileExtractor (path) | A SpanExtractor that reads spans directly from a file with columns ["char_start", "char_end", "context_uid", "span_field", "initial_label", "span_entity"] |
operators.candidates.extractor.EntityDictSpanExtractor (...) | SpanExtractor that yields (and optionally links) spans in an entity-to-aliases dictionary |
operators.candidates.extractor.EntityDictRegexSpanExtractor (...) | A SpanExtractor that yields (and optionally links) spans in an entity-to-aliases dictionary, which supports regexes. |
operators.candidates.extractor.DocEntityDictSpanExtractor (...) | (Optimized for keyword aliases) SpanExtractor that yields (and optionally links) spans, given an entity-to-aliases dictionary and doc-id-to-entity dictionary. |
operators.candidates.extractor_spacy.TokenSpanExtractor (field) | A SpanExtractor that yields every token, given a selected tokenization strategy |
operators.candidates.extractor_spacy.NounChunkSpanExtractor (field) | A SpanExtractor that yields all noun phrases according to spaCy |
operators.candidates.extractor_spacy.TagSpanExtractor (...) | A SpanExtractor that yields all matches for a given NER tag according to spaCy |
operators.candidates.extractor_spacy.SpacyNERSpanExtractor (...) |
Normalizers/linkers
Normalizers/linkers
operators.candidates.normalizer.DateSpanNormalizer () | Normalizes date spans into their canonical forms, e.g. 2020-01-01. |
operators.candidates.normalizer.USCurrencySpanNormalizer () | Normalizes US currency spans into their numerical values. |
operators.candidates.normalizer.TextCasingSpanNormalizer () | Normalizes spans by lowercasing them, then capitalizing the first letter of each word. |
operators.candidates.normalizer.OrdinalSpanNormalizer () | Normalizes numerical ordinals (1st, 2nd, etc) to string ordinals (first, second, etc). |
operators.candidates.normalizer.NumericalSpanNormalizer () | Normalizes numerical cardinals (1, 2, etc) to string values (one, two, etc). |
operators.candidates.normalizer.SpanEntityNormalizer () | Copies the linked span entity from the extractor as the normalized span. |
operators.candidates.normalizer.IdentitySpanNormalizer () | Copies span text as is. |
operators.candidates.linker.EntityDictLinker (...) | Maps span_text to span_entity given an entity to alias dictionary. |
Model postprocessors
Model postprocessors
operators.post_processors.sequence_tagging_post_processors.SpanFilterByLengthPostProcessor (...) | Filter positive spans with length <= the provided span_length |
operators.post_processors.sequence_tagging_post_processors.SpanMergeByRegexPatternPostProcessor (...) | Merge same-label nearby spans if the negative-label text in between spans matches this regex pattern. |
operators.post_processors.sequence_tagging_post_processors.SpanMergeByNumberCharacterPostProcessor (...) | Merge same-label nearby spans if the negative-label text in between spans is in [lower, upper] number of characters (inclusive). |
operators.post_processors.sequence_tagging_post_processors.SpanRegexPostProcessor (...) | Post Porcessor that labels anything that matches the provided pattern with the provided label |
operators.post_processors.sequence_tagging_post_processors.SpanRemoveWhitespacePostProcessor (field) | Remove leading & trailing whitespace in positive spans. |
operators.post_processors.sequence_tagging_post_processors.SubstringExpansionPostProcessor (...) | This postprocessor expands predictions to the token boundaries should a prediction boundary fall mid-token. |
Reducers
Reducers
operators.candidates.reducer.IdentityReducer ([...]) | No-Op Reducer that passes all Spans through |
operators.candidates.reducer.DocumentFirstReducer ([...]) | Reduces span predictions in a document to the span which occurs first. |
operators.candidates.reducer.DocumentLastReducer ([...]) | Reduces spans predictions in a document to the span which occurs last. |
operators.candidates.reducer.DocumentMostConfidentReducer ([...]) | Reduces spans predictions in a document to the span with the most confident model prediction. |
operators.candidates.reducer.DocumentMostCommonReducer ([...]) | Reduces spans predictions in a document to the span which occurs most frequently. |
operators.candidates.reducer.EntityMeanPredictionReducer ([...]) | Reduces span predictions for a document entity by computing the mean prediction. |
operators.candidates.reducer.EntityMostCommonReducer ([...]) | Reduces span predictions for a document entity to the prediction of the majority vote class. |
operators.candidates.reducer.EntityMostConfidentReducer ([...]) | Reduces span predictions for a document entity to the most confident prediction. |
operators.candidates.reducer.EntityFirstReducer ([...]) | Reduces span predictions for a document entity to the first occuring span of that entity. |
operators.candidates.reducer.EntityLastReducer ([...]) | Reduces span predictions for a document entity to the last occuring span of that entity. |
Miscellaneous
Miscellaneous
operators.rename.ColumnRenamer (column_map) | Preprocessor that renames columns |
operators.drop.ColumnDropper (fields) | Processor that drops given columns from the DataFrame. |
operators.concat.ConcatRows () | Processor that concatenates dataframes along the index axis. |
operators.concat.ConcatColumns ([join_type, ...]) | Processor that concatenates columns of dataframes. |
operators.change_datapoint.ChangeDatapoint (...) | Preprocessor that changes the datapoint type/columns. |
operators.table.TableConverter (field[, ...]) | Feautizer that convert an array of dicts into a custom Table object |
operators.identity.IdentityOperator () | No-op operator |
operators.scaler.StandardScaler (field[, ...]) | Preprocessor that scales a numerical column to mean=0 and std=1. |
operators.filler.ColumnFiller (field, value) | Operator that replaces values in either a new or existing column in the DataFrame with the given value |
operators.trace.TraceToStepFlattener (field) | Flattens a nested JSON field that contains a hierarchical trace into individual steps. |