flair.tokenization.StaccatoTokenizer#

class flair.tokenization.StaccatoTokenizerView on GitHub#

Bases: Tokenizer

A string-based tokenizer that splits text into tokens based on the following rules: - Punctuation characters are split into individual tokens - Sequences of numbers are kept together as single tokens - Kanji characters are split into individual tokens - Uninterrupted sequences of letters (Latin, Cyrillic, etc.) and kana are preserved as single tokens - Whitespace and common zero-width characters are ignored.

__init__()View on GitHub#

Methods

__init__()

from_dict(config)

Instantiate the tokenizer from a configuration dictionary.

to_dict()

Serialize the tokenizer's configuration to a dictionary.

tokenize(text)

Tokenize the input text using re.findall to extract valid tokens.

Attributes

name

tokenize(text)View on GitHub#

Tokenize the input text using re.findall to extract valid tokens.

Parameters:

text (str) – The input text to tokenize

Return type:

list[str]

Returns:

A list of tokens (strings)

to_dict()View on GitHub#

Serialize the tokenizer’s configuration to a dictionary.

Return type:

dict

classmethod from_dict(config)View on GitHub#

Instantiate the tokenizer from a configuration dictionary.

Return type:

StaccatoTokenizer