Skip to main content
Version: 0.91

operators.candidates.rich_doc_features.RichDocSpanVisualPreprocessor

class operators.candidates.rich_doc_features.RichDocSpanVisualPreprocessor(location='center', scope='page', threshold=50, threshold_unit='pixels', threshold_dir='left_or_right', mask_span_ngrams=True, ngram_range_min=1, ngram_range_max=2, feature_name_override=None)

Operator to compute visual Rich Doc features for span.

Operator to compute visual Rich Doc features for span. Available Features (optionally with a suffix on the feature name):

Note: rich_doc_aligned_ngrams: The ngrams in the given [scope] whose [location] is within [threshold] [threshold_unit]s

Parameters:
  • location (str, default: 'center') – The location of the span and ngrams to compare (left / center / right / top / middle / bottom)

  • scope (str, default: 'page') – The scope to search for ngrams within (word / line / par / area / page).

  • threshold (int, default: 50) – The maximum threshold used when comparing two location values

  • threshold_dir (str, default: 'left_or_right') – A specific direction for restricting the search for aligned ngrams (left_only, right_only, left_or_right, up_only, down_only, up_or_down)

  • mask_span_ngrams (bool, default: True) – If True, replace the span with -SPAN- in all ngrams

  • ngram_range_min (int, default: 1) – The lower bound of ngrams to include (e.g., 1 = unigrams, 2 = bigrams, etc.)

  • ngram_range_max (int, default: 2) – The upper bound of ngrams to include (e.g., 1 = unigrams, 2 = bigrams, etc.)

  • feature_name_override (Optional[str], default: None) – If not None, use this as the generated column name (instead of an auto-generated name).