flair.tokenization.StaccatoTokenizer#
- class flair.tokenization.StaccatoTokenizerView on GitHub#
Bases:
Tokenizer
A string-based tokenizer that splits text into tokens based on the following rules: - Punctuation characters are split into individual tokens - Sequences of numbers are kept together as single tokens - Kanji characters are split into individual tokens - Uninterrupted sequences of letters (Latin, Cyrillic, etc.) and kana are preserved as single tokens - Whitespace and common zero-width characters are ignored.
- __init__()View on GitHub#
Methods
__init__
()from_dict
(config)Instantiate the tokenizer from a configuration dictionary.
to_dict
()Serialize the tokenizer's configuration to a dictionary.
tokenize
(text)Tokenize the input text using re.findall to extract valid tokens.
Attributes
name
- tokenize(text)View on GitHub#
Tokenize the input text using re.findall to extract valid tokens.
- Parameters:
text (
str
) – The input text to tokenize- Return type:
list
[str
]- Returns:
A list of tokens (strings)
- to_dict()View on GitHub#
Serialize the tokenizer’s configuration to a dictionary.
- Return type:
dict
- classmethod from_dict(config)View on GitHub#
Instantiate the tokenizer from a configuration dictionary.
- Return type: