operators.candidates.rich_doc_features.RichDocSpanRowFeaturesPreprocessor
- class operators.candidates.rich_doc_features.RichDocSpanRowFeaturesPreprocessor(row_id=False, row_text_before=0, row_text_inline=False, row_text_after=0, row_header=False, inferred_row_headers=False, row_header_json='{"scope": "page", "multi_row": true, "min_margin": 10, "max_gap": 20, "max_left_page_pct": 50}', mask_span=True, feature_suffix='')
Operator that computes row-level features for a span (eg. text from the span’s row, text before and after the span’s row etc)
This operator compute row-level features for richer information for each span. The list of computed features can be found in the Returns section (optionally with a suffix on each feature name).
This operator usually co-exists with RichDocSpanBaseFeaturesPreprocessor and RichDocSpanBaseFeaturesPreprocessor to create RichDoc representation and features.
- Parameters:
row_id (int) – If True, calculate the rich_doc_row_id feature
row_text_before (str) – If positive, include this many rows before span in rich_doc_row_text_before
row_text_inline (bool) – If True, calculate the rich_doc_row_text_inline feature
row_text_after (int) – If positive, include this many rows before span in rich_doc_row_text_after
row_header (bool) – If True, calculate the rich_doc_row_header feature
inferred_row_headers (bool) – If True, calculate the rich_doc_inferred_row_headers feature
mask_span (str) – If True, replace the span content with ‘-SPAN-’ in rich_doc_row_text_inline
row_header_json (bool) – JSON string containing additional settings for row header features.
feature_suffix (str) – If None, auto-generate suffixes for features based on their parameters. Otherwise, append this string to each feature (use empty string for no suffixes)
- Returns:
rich_doc_row_id – The (int) index of the span’s row
rich_doc_row_text_before – The text in the rows 1 to X before the span’s row
rich_doc_row_text_after – The text in the rows 1 to X after the span’s row
rich_doc_row_text_inline – The text from the span’s row
rich_doc_row_header – The text in the span’s row header
rich_doc_inferred_row_headers – The text in the span’s inferred row headers