flair.tokenization#

JapaneseTokenizer

Tokenizer using konoha to support popular japanese tokenizers.

SciSpacyTokenizer

Tokenizer that uses the en_core_sci_sm Spacy model and some special heuristics.

SegtokTokenizer

Tokenizer using segtok, a third party library dedicated to rules-based Indo-European languages.

SpaceTokenizer

Tokenizer based on space character only.

SpacyTokenizer

Tokenizer using spacy under the hood.

StaccatoTokenizer

A string-based tokenizer that splits text into tokens based on the following rules: - Punctuation characters are split into individual tokens - Sequences of numbers are kept together as single tokens - Kanji characters are split into individual tokens - Uninterrupted sequences of letters (Latin, Cyrillic, etc.) and kana are preserved as single tokens

Tokenizer

An abstract class representing a Tokenizer.

TokenizerWrapper

Helper class to wrap tokenizer functions to the class-based tokenizer interface.