Skip to main content
Version: 0.94

operators.candidates.extractor.ParagraphSpanExtractor

class operators.candidates.extractor.ParagraphSpanExtractor(field, col_suffix=None)

Extracts spans (slices of documents) that contain paragraphs (using regex)

This operator uses a regex pattern to extract all paragraphs as spans from the parent document. Trailing newline characters are preserved for each paragraph

Parameters:
  • field (str) – The dataframe column to extract paragraph spans from

  • col_suffix (Optional[str], default: None) – An optional suffix for the column containing the extracted spans