operators.candidates.rich_doc_features.RichDocSpanVisualPreprocessor
- class operators.candidates.rich_doc_features.RichDocSpanVisualPreprocessor(location='center', scope='page', threshold=50, threshold_unit='pixels', threshold_dir='left_or_right', mask_span_ngrams=True, ngram_range_min=1, ngram_range_max=2, feature_name_override=None)
Operator to compute visual Rich Doc features for span.
Operator to compute visual Rich Doc features for span. Available Features (optionally with a suffix on the feature name):
Note: rich_doc_aligned_ngrams: The ngrams in the given [scope] whose [location] is within [threshold] [threshold_unit]s
- Parameters:
location (
str
, default:'center'
) – The location of the span and ngrams to compare (left / center / right / top / middle / bottom)scope (
str
, default:'page'
) – The scope to search for ngrams within (word / line / par / area / page).threshold (
int
, default:50
) – The maximum threshold used when comparing two location valuesthreshold_dir (
str
, default:'left_or_right'
) – A specific direction for restricting the search for aligned ngrams (left_only, right_only, left_or_right, up_only, down_only, up_or_down)mask_span_ngrams (
bool
, default:True
) – If True, replace the span with -SPAN- in all ngramsngram_range_min (
int
, default:1
) – The lower bound of ngrams to include (e.g., 1 = unigrams, 2 = bigrams, etc.)ngram_range_max (
int
, default:2
) – The upper bound of ngrams to include (e.g., 1 = unigrams, 2 = bigrams, etc.)feature_name_override (
Optional
[str
], default:None
) – If not None, use this as the generated column name (instead of an auto-generated name).