flair.tokenization.StaccatoTokenizer#
- class flair.tokenization.StaccatoTokenizerView on GitHub#
Bases:
Tokenizer
A string-based tokenizer that splits text into tokens based on the following rules: - Punctuation characters are split into individual tokens - Sequences of numbers are kept together as single tokens - Kanji characters are split into individual tokens - Uninterrupted sequences of letters (Latin, Cyrillic, etc.) and kana are preserved as single tokens
- __init__()View on GitHub#
Methods
__init__
()tokenize
(text)Tokenize the input text according to the defined rules.
Attributes
name
- tokenize(text)View on GitHub#
Tokenize the input text according to the defined rules.
- Parameters:
text (
str
) – The input text to tokenize- Return type:
list
[str
]- Returns:
A list of tokens