Skip to main content
Version: 0.95

operators.whitespace.WhitespacePreprocessor

class operators.whitespace.WhitespacePreprocessor(fields, to_replace=None, output_field_suffix='')

Preprocessor that normalizes whitespace.

This operator finds all of the different types of whitespace in a given text field and normalizes it to the regular space character (U+0020). By default, the following non-standard space characters with the regular space: U+00A0, U+2000 to U+200A, U+202F, U+205F, U+3000. See https://en.wikipedia.org/wiki/Whitespace_character for more details on what these UTF-8 code points mean.

Parameters

NameTypeDefaultInfo
fieldsList[str]The fields to apply whitespace pre-processing to.
to_replaceOptional[str]NoneA string containing all characters to be replaced with a regular whitespace (U+0020).
output_field_suffixOptional[str]''To avoid updating in place, optionally specify a suffix to add to specified fields.