Skip to main content
Version: 0.95

operators.whitespace.WhitespacePreprocessor

class operators.whitespace.WhitespacePreprocessor(fields, to_replace=None, output_field_suffix='')

Preprocessor that normalizes whitespace.

This operator finds all of the different types of whitespace in a given text field and normalizes it to the regular space character (U+0020). By default, the following non-standard space characters with the regular space: U+00A0, U+2000 to U+200A, U+202F, U+205F, U+3000. See https://en.wikipedia.org/wiki/Whitespace_character for more details on what these UTF-8 code points mean.

Parameters:
  • fields (List[str]) – The fields to apply whitespace pre-processing to.

  • to_replace (Optional[str], default: None) – A string containing all characters to be replaced with a regular whitespace (U+0020).

  • output_field_suffix (Optional[str], default: '') – To avoid updating in place, optionally specify a suffix to add to specified fields.