operators.candidates.rich_doc_features.RichDocSpanRowFeaturesPreprocessor
- class operators.candidates.rich_doc_features.RichDocSpanRowFeaturesPreprocessor(row_id=False, row_text_before=0, row_text_inline=False, row_text_after=0, row_header=False, inferred_row_headers=False, row_header_json='{"scope": "page", "multi_row": true, "min_margin": 10, "max_gap": 20, "max_left_page_pct": 50}', mask_span=True, feature_suffix='')
Operator that computes row-level features for a span (eg. text from the span’s row, text before and after the span’s row etc)
This operator compute row-level features for richer information for each span. The list of computed features can be found in the Returns section (optionally with a suffix on each feature name).
This operator usually co-exists with RichDocSpanBaseFeaturesPreprocessor and RichDocSpanBaseFeaturesPreprocessor to create RichDoc representation and features.
Parameters
Parameters
Returns
Returns
rich_doc_row_id – The (int) index of the span’s row
rich_doc_row_text_before – The text in the rows 1 to X before the span’s row
rich_doc_row_text_after – The text in the rows 1 to X after the span’s row
rich_doc_row_text_inline – The text from the span’s row
rich_doc_row_header – The text in the span’s row header
rich_doc_inferred_row_headers – The text in the span’s inferred row headers
Name Type Default Info row_id int
If True, calculate the rich_doc_row_id feature. row_text_before str
If positive, include this many rows before span in rich_doc_row_text_before. row_text_inline bool
If True, calculate the rich_doc_row_text_inline feature. row_text_after int
If positive, include this many rows before span in rich_doc_row_text_after. row_header bool
If True, calculate the rich_doc_row_header feature. inferred_row_headers bool
If True, calculate the rich_doc_inferred_row_headers feature. mask_span str
If True, replace the span content with ‘-SPAN-’ in rich_doc_row_text_inline. row_header_json bool
JSON string containing additional settings for row header features. feature_suffix str
If None, auto-generate suffixes for features based on their parameters. Otherwise, append this string to each feature (use empty string for no suffixes).