flair.tokenization.StaccatoTokenizer#

class flair.tokenization.StaccatoTokenizerView on GitHub#

Bases: Tokenizer

A string-based tokenizer that splits text into tokens based on the following rules: - Punctuation characters are split into individual tokens - Sequences of numbers are kept together as single tokens - Kanji characters are split into individual tokens - Uninterrupted sequences of letters (Latin, Cyrillic, etc.) and kana are preserved as single tokens

__init__()View on GitHub#

Methods

__init__()

tokenize(text)

Tokenize the input text according to the defined rules.

Attributes

name

tokenize(text)View on GitHub#

Tokenize the input text according to the defined rules.

Parameters:

text (str) – The input text to tokenize

Return type:

list[str]

Returns:

A list of tokens