Skip to main content
Version: 0.94

operators.pdf.text_cluster.TextClusterer

class operators.pdf.text_cluster.TextClusterer(wordspacing_tolerance=0.75, merge_words_between_vertical_lines=False, merge_rows_between_horizontal_lines=False, pages_field='context_pages')

Operator that clusters horizontally aligned words using word spacing.

This operator clusters horizontally aligned word that stays within a predefined word spacing. Text Clusters are the group of words that are separated by a single space. The heuristics employed here relies on the fact that if the words are separated by width > the max_width in standard typography, then they are separate word clusters. The max_width = wordspacing_tolerance * vertical width between 2 previous words.

Optionally Merges Word clusters vertically into regions using bounding horizontal lines. Needs LinesFeaturizer if merge_words_between_vertical_lines or merge_rows_between_horizontal_lines are set.

Parameters:
  • wordspacing_tolerance (float, default: 0.75) – The ratio (relatively to the vertical width) to consider 2 words belong to a same cluster.

  • merge_words_between_vertical_lines (bool, default: False) – If True, and provided LinesFeaturizer, will cluster words between vertical lines together.

  • merge_words_between_horizontal_lines – If True, and provided LinesFeaturizer, will cluster words between horizontal lines together.

  • pages_field (Optional[str], default: 'context_pages') – The name of the column containing the page numbers on which to run the operator on. Defaults to RichDocCols.CONTEXT_PAGES.

Returns:

A serialized TextClusters class containing all the text clusters information.

Return type:

{RichDocCols.TEXT_CLUSTERS}