flair.tokenization.StaccatoTokenizer#

class flair.tokenization.StaccatoTokenizerView on GitHub #

Bases: Tokenizer

A string-based tokenizer that splits text into tokens based on the following rules: - Punctuation characters are split into individual tokens - Sequences of numbers are kept together as single tokens - Kanji characters are split into individual tokens - Uninterrupted sequences of letters (Latin, Cyrillic, etc.) and kana are preserved as single tokens - Whitespace and common zero-width characters are ignored.

__init__()View on GitHub #

Methods

`__init__`()
`from_dict`(config)	Instantiate the tokenizer from a configuration dictionary.
`to_dict`()	Serialize the tokenizer's configuration to a dictionary.
`tokenize`(text)	Tokenize the input text using re.findall to extract valid tokens.

Attributes

name

tokenize(text)View on GitHub #

Tokenize the input text using re.findall to extract valid tokens.

Parameters:: text (str) – The input text to tokenize
Return type:: list[str]
Returns:: A list of tokens (strings)

to_dict()View on GitHub #

Serialize the tokenizer’s configuration to a dictionary.

Return type:: dict

classmethod from_dict(config)View on GitHub #

Instantiate the tokenizer from a configuration dictionary.

Return type:: StaccatoTokenizer

Table of Contents

flair.tokenization.StaccatoTokenizer#