Skip to main content
Version: 25.1

operators.candidates.rich_doc_features.RichDocSpanRowFeaturesPreprocessor

class operators.candidates.rich_doc_features.RichDocSpanRowFeaturesPreprocessor(row_id=False, row_text_before=0, row_text_inline=False, row_text_after=0, row_header=False, inferred_row_headers=False, row_header_json='{"scope": "page", "multi_row": true, "min_margin": 10, "max_gap": 20, "max_left_page_pct": 50}', mask_span=True, feature_suffix='')

Operator that computes row-level features for a span (eg. text from the span’s row, text before and after the span’s row etc)

This operator compute row-level features for richer information for each span. The list of computed features can be found in the Returns section (optionally with a suffix on each feature name).

This operator usually co-exists with RichDocSpanBaseFeaturesPreprocessor and RichDocSpanBaseFeaturesPreprocessor to create RichDoc representation and features.

Parameters:
  • row_id (int) – If True, calculate the rich_doc_row_id feature

  • row_text_before (str) – If positive, include this many rows before span in rich_doc_row_text_before

  • row_text_inline (bool) – If True, calculate the rich_doc_row_text_inline feature

  • row_text_after (int) – If positive, include this many rows before span in rich_doc_row_text_after

  • row_header (bool) – If True, calculate the rich_doc_row_header feature

  • inferred_row_headers (bool) – If True, calculate the rich_doc_inferred_row_headers feature

  • mask_span (str) – If True, replace the span content with ‘-SPAN-’ in rich_doc_row_text_inline

  • row_header_json (bool) – JSON string containing additional settings for row header features.

  • feature_suffix (str) – If None, auto-generate suffixes for features based on their parameters. Otherwise, append this string to each feature (use empty string for no suffixes)

Returns:

  • rich_doc_row_id – The (int) index of the span’s row

  • rich_doc_row_text_before – The text in the rows 1 to X before the span’s row

  • rich_doc_row_text_after – The text in the rows 1 to X after the span’s row

  • rich_doc_row_text_inline – The text from the span’s row

  • rich_doc_row_header – The text in the span’s row header

  • rich_doc_inferred_row_headers – The text in the span’s inferred row headers