operators.whitespace.WhitespacePreprocessor
- class operators.whitespace.WhitespacePreprocessor(fields, to_replace=None, output_field_suffix='')
Preprocessor that normalizes whitespace.
This operator finds all of the different types of whitespace in a given text field and normalizes it to the regular space character (U+0020). By default, the following non-standard space characters with the regular space: U+00A0, U+2000 to U+200A, U+202F, U+205F, U+3000. See https://en.wikipedia.org/wiki/Whitespace_character for more details on what these UTF-8 code points mean.
- Parameters:
fields (
List
[str
]) – The fields to apply whitespace pre-processing to.to_replace (
Optional
[str
], default:None
) – A string containing all characters to be replaced with a regular whitespace (U+0020).output_field_suffix (
Optional
[str
], default:''
) – To avoid updating in place, optionally specify a suffix to add to specified fields.