operators.candidates.rich_doc_features.RichDocRegexPageFeaturizer
- class operators.candidates.rich_doc_features.RichDocRegexPageFeaturizer(regex_pattern, case_sensitive=False)
This operator adds a list of pages to retain based on the regex pattern provided. The regex pattern is searched for over the text in each page of the document. The list of pages to retain is stored in the context_pages field.
This operator is the first step in filtering out pages based on keywords in native PDF extraction applications. The user should add this operator followed by a PDFToRichDocParser. The context_pages field should be provided as the “Pages field” input to the PDFToRichDocParser.
Parameters
Parameters
Name Type Default Info regex_pattern str
The regular expression pattern we use to filter rows. case_sensitive bool
False
If False, ignore case when considering regular expression matches (defaults to False).