Tokenizers documentation

Normalizers

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Normalizers

Python
Rust
Node

BertNormalizer

class tokenizers.normalizers.BertNormalizer

( clean_text = True handle_chinese_chars = True strip_accents = None lowercase = True )

Parameters

  • clean_text (bool, optional, defaults to True) — Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one.
  • handle_chinese_chars (bool, optional, defaults to True) — Whether to handle chinese chars by putting spaces around them.
  • strip_accents (bool, optional) — Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert).
  • lowercase (bool, optional, defaults to True) — Whether to lowercase.

BertNormalizer

Takes care of normalizing raw text before giving it to a Bert model. This includes cleaning the text, handling accents, chinese chars and lowercasing

Lowercase

class tokenizers.normalizers.Lowercase

( )

Lowercase Normalizer

NFC

class tokenizers.normalizers.NFC

( )

NFC Unicode Normalizer

NFD

class tokenizers.normalizers.NFD

( )

NFD Unicode Normalizer

NFKC

class tokenizers.normalizers.NFKC

( )

NFKC Unicode Normalizer

NFKD

class tokenizers.normalizers.NFKD

( )

NFKD Unicode Normalizer

Nmt

class tokenizers.normalizers.Nmt

( )

Nmt normalizer

Normalizer

class tokenizers.normalizers.Normalizer

( )

Base class for all normalizers

This class is not supposed to be instantiated directly. Instead, any implementation of a Normalizer will return an instance of this class when instantiated.

normalize

( normalized )

Parameters

  • normalized (NormalizedString) — The normalized string on which to apply this Normalizer

Normalize a NormalizedString in-place

This method allows to modify a NormalizedString to keep track of the alignment information. If you just want to see the result of the normalization on a raw string, you can use normalize_str()

normalize_str

( sequence ) str

Parameters

  • sequence (str) — A string to normalize

Returns

str

A string after normalization

Normalize the given string

This method provides a way to visualize the effect of a Normalizer but it does not keep track of the alignment information. If you need to get/convert offsets, you can use normalize()

Precompiled

class tokenizers.normalizers.Precompiled

( precompiled_charsmap )

Precompiled normalizer Don’t use manually it is used for compatiblity for SentencePiece.

Replace

class tokenizers.normalizers.Replace

( pattern content )

Replace normalizer

Sequence

class tokenizers.normalizers.Sequence

( )

Parameters

  • normalizers (List[Normalizer]) — A list of Normalizer to be run as a sequence

Allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order

Strip

class tokenizers.normalizers.Strip

( left = True right = True )

Strip normalizer

StripAccents

class tokenizers.normalizers.StripAccents

( )

StripAccents normalizer