Skip to main content
Version: 0.91

operators.candidates.rich_doc_features.RichDocSpanVisualPreprocessor

class operators.candidates.rich_doc_features.RichDocSpanVisualPreprocessor(location='center', scope='page', threshold=50, threshold_unit='pixels', threshold_dir='left_or_right', mask_span_ngrams=True, ngram_range_min=1, ngram_range_max=2, feature_name_override=None)

Operator to compute visual Rich Doc features for span.

Operator to compute visual Rich Doc features for span. Available Features (optionally with a suffix on the feature name):

Note: rich_doc_aligned_ngrams: The ngrams in the given [scope] whose [location] is within [threshold] [threshold_unit]s

Parameters

NameTypeDefaultInfo
locationstr'center'The location of the span and ngrams to compare (left / center / right / top / middle / bottom).
scopestr'page'The scope to search for ngrams within (word / line / par / area / page).
thresholdint50The maximum threshold used when comparing two location values.
threshold_dirstr'left_or_right'A specific direction for restricting the search for aligned ngrams (left_only, right_only, left_or_right, up_only, down_only, up_or_down).
mask_span_ngramsboolTrueIf True, replace the span with -SPAN- in all ngrams.
ngram_range_minint1The lower bound of ngrams to include (e.g., 1 = unigrams, 2 = bigrams, etc.).
ngram_range_maxint2The upper bound of ngrams to include (e.g., 1 = unigrams, 2 = bigrams, etc.).
feature_name_overrideOptional[str]NoneIf not None, use this as the generated column name (instead of an auto-generated name).