operators.pdf.text_cluster.TextClusterer
- class operators.pdf.text_cluster.TextClusterer(wordspacing_tolerance=0.75, merge_words_between_vertical_lines=False, merge_rows_between_horizontal_lines=False, pages_field='context_pages')
Operator that clusters horizontally aligned words using word spacing.
This operator clusters horizontally aligned word that stays within a predefined word spacing. Text Clusters are the group of words that are separated by a single space. The heuristics employed here relies on the fact that if the words are separated by width > the max_width in standard typography, then they are separate word clusters. The max_width = wordspacing_tolerance * vertical width between 2 previous words.
Optionally Merges Word clusters vertically into regions using bounding horizontal lines. Needs LinesFeaturizer if merge_words_between_vertical_lines or merge_rows_between_horizontal_lines are set.
- Parameters:
wordspacing_tolerance (
float
, default:0.75
) – The ratio (relatively to the vertical width) to consider 2 words belong to a same cluster.merge_words_between_vertical_lines (
bool
, default:False
) – If True, and provided LinesFeaturizer, will cluster words between vertical lines together.merge_words_between_horizontal_lines – If True, and provided LinesFeaturizer, will cluster words between horizontal lines together.
pages_field (
Optional
[str
], default:'context_pages'
) – The name of the column containing the page numbers on which to run the operator on. Defaults to RichDocCols.CONTEXT_PAGES.
- Returns:
A serialized TextClusters class containing all the text clusters information.
- Return type:
{RichDocCols.TEXT_CLUSTERS}