Skip to main content
Version: 25.1

operators.candidates.rich_doc_features.RichDocSpanRowFeaturesPreprocessor

class operators.candidates.rich_doc_features.RichDocSpanRowFeaturesPreprocessor(row_id=False, row_text_before=0, row_text_inline=False, row_text_after=0, row_header=False, inferred_row_headers=False, row_header_json='{"scope": "page", "multi_row": true, "min_margin": 10, "max_gap": 20, "max_left_page_pct": 50}', mask_span=True, feature_suffix='')

Operator that computes row-level features for a span (eg. text from the span’s row, text before and after the span’s row etc)

This operator compute row-level features for richer information for each span. The list of computed features can be found in the Returns section (optionally with a suffix on each feature name).

This operator usually co-exists with RichDocSpanBaseFeaturesPreprocessor and RichDocSpanBaseFeaturesPreprocessor to create RichDoc representation and features.

Parameters

NameTypeDefaultInfo
row_idintIf True, calculate the rich_doc_row_id feature.
row_text_beforestrIf positive, include this many rows before span in rich_doc_row_text_before.
row_text_inlineboolIf True, calculate the rich_doc_row_text_inline feature.
row_text_afterintIf positive, include this many rows before span in rich_doc_row_text_after.
row_headerboolIf True, calculate the rich_doc_row_header feature.
inferred_row_headersboolIf True, calculate the rich_doc_inferred_row_headers feature.
mask_spanstrIf True, replace the span content with ‘-SPAN-’ in rich_doc_row_text_inline.
row_header_jsonboolJSON string containing additional settings for row header features.
feature_suffixstrIf None, auto-generate suffixes for features based on their parameters. Otherwise, append this string to each feature (use empty string for no suffixes).

Returns

  • rich_doc_row_id – The (int) index of the span’s row

  • rich_doc_row_text_before – The text in the rows 1 to X before the span’s row

  • rich_doc_row_text_after – The text in the rows 1 to X after the span’s row

  • rich_doc_row_text_inline – The text from the span’s row

  • rich_doc_row_header – The text in the span’s row header

  • rich_doc_inferred_row_headers – The text in the span’s inferred row headers