operators.pdf.text_cluster.TextClusterer
- class operators.pdf.text_cluster.TextClusterer(wordspacing_tolerance=0.75, merge_words_between_vertical_lines=False, merge_rows_between_horizontal_lines=False, pages_field='context_pages')
Operator that clusters horizontally aligned words using word spacing.
This operator clusters horizontally aligned word that stays within a predefined word spacing. Text Clusters are the group of words that are separated by a single space. The heuristics employed here relies on the fact that if the words are separated by width > the max_width in standard typography, then they are separate word clusters. The max_width = wordspacing_tolerance * vertical width between 2 previous words.
Optionally Merges Word clusters vertically into regions using bounding horizontal lines. Needs LinesFeaturizer if merge_words_between_vertical_lines or merge_rows_between_horizontal_lines are set.
Parameters
Parameters
Returns
Returns
A serialized TextClusters class containing all the text clusters information.
Return type
Return type
{RichDocCols.TEXT_CLUSTERS}
Name Type Default Info wordspacing_tolerance float
0.75
The ratio (relatively to the vertical width) to consider 2 words belong to a same cluster. merge_words_between_vertical_lines bool
False
If True, and provided LinesFeaturizer, will cluster words between vertical lines together. merge_words_between_horizontal_lines If True, and provided LinesFeaturizer, will cluster words between horizontal lines together. pages_field Optional[str]
'context_pages'
The name of the column containing the page numbers on which to run the operator on. Defaults to RichDocCols.CONTEXT_PAGES.