operators.candidates.rich_doc_features.RichDocSpanVisualPreprocessor
- class operators.candidates.rich_doc_features.RichDocSpanVisualPreprocessor(location='center', scope='page', threshold=50, threshold_unit='pixels', threshold_dir='left_or_right', mask_span_ngrams=True, ngram_range_min=1, ngram_range_max=2, feature_name_override=None)
Operator to compute visual Rich Doc features for span.
Operator to compute visual Rich Doc features for span. Available Features (optionally with a suffix on the feature name):
Note: rich_doc_aligned_ngrams: The ngrams in the given [scope] whose [location] is within [threshold] [threshold_unit]s
Parameters
Parameters
Name Type Default Info location str
'center'
The location of the span and ngrams to compare (left / center / right / top / middle / bottom). scope str
'page'
The scope to search for ngrams within (word / line / par / area / page). threshold int
50
The maximum threshold used when comparing two location values. threshold_dir str
'left_or_right'
A specific direction for restricting the search for aligned ngrams (left_only, right_only, left_or_right, up_only, down_only, up_or_down). mask_span_ngrams bool
True
If True, replace the span with -SPAN- in all ngrams. ngram_range_min int
1
The lower bound of ngrams to include (e.g., 1 = unigrams, 2 = bigrams, etc.). ngram_range_max int
2
The upper bound of ngrams to include (e.g., 1 = unigrams, 2 = bigrams, etc.). feature_name_override Optional[str]
None
If not None, use this as the generated column name (instead of an auto-generated name).