Skip to main content
Version: 0.91

operators.candidates.rich_doc_features.RichDocRegexPageFeaturizer

class operators.candidates.rich_doc_features.RichDocRegexPageFeaturizer(regex_pattern, case_sensitive=False)

This operator adds a list of pages to retain based on the regex pattern provided. The regex pattern is searched for over the text in each page of the document. The list of pages to retain is stored in the context_pages field.

This operator is the first step in filtering out pages based on keywords in native PDF extraction applications. The user should add this operator followed by a PDFToRichDocParser. The context_pages field should be provided as the “Pages field” input to the PDFToRichDocParser.

Parameters:
  • regex_pattern (str) – The regular expression pattern we use to filter rows

  • case_sensitive (bool, default: False) – If False, ignore case when considering regular expression matches (defaults to False)