flair.tokenization#

class flair.tokenization.TokenizerView on GitHub#

Bases: ABC

An abstract class representing a Tokenizer.

Tokenizers are used to represent algorithms and models to split plain text into individual tokens / words. All subclasses should overwrite tokenize(), which splits the given plain text into tokens. Moreover, subclasses may overwrite name(), returning a unique identifier representing the tokenizer’s configuration.

abstract tokenize(text)View on GitHub#
Return type:

list[str]

property name: str#
class flair.tokenization.SpacyTokenizer(model)View on GitHub#

Bases: Tokenizer

Tokenizer using spacy under the hood.

Parameters:

model – a Spacy V2 model or the name of the model to load.

tokenize(text)View on GitHub#
Return type:

list[str]

property name: str#
class flair.tokenization.SegtokTokenizerView on GitHub#

Bases: Tokenizer

Tokenizer using segtok, a third party library dedicated to rules-based Indo-European languages.

For further details see: fnl/segtok

tokenize(text)View on GitHub#
Return type:

list[str]

static run_tokenize(text)View on GitHub#
Return type:

list[str]

class flair.tokenization.SpaceTokenizerView on GitHub#

Bases: Tokenizer

Tokenizer based on space character only.

tokenize(text)View on GitHub#
Return type:

list[str]

static run_tokenize(text)View on GitHub#
Return type:

list[str]

class flair.tokenization.JapaneseTokenizer(tokenizer, sudachi_mode='A')View on GitHub#

Bases: Tokenizer

Tokenizer using konoha to support popular japanese tokenizers.

Tokenizer using konoha, a third party library which supports multiple Japanese tokenizer such as MeCab, Janome and SudachiPy.

For further details see:

himkt/konoha

tokenize(text)View on GitHub#
Return type:

list[str]

property name: str#
class flair.tokenization.TokenizerWrapper(tokenizer_func)View on GitHub#

Bases: Tokenizer

Helper class to wrap tokenizer functions to the class-based tokenizer interface.

tokenize(text)View on GitHub#
Return type:

list[str]

property name: str#
class flair.tokenization.SciSpacyTokenizerView on GitHub#

Bases: Tokenizer

Tokenizer that uses the en_core_sci_sm Spacy model and some special heuristics.

Implementation of Tokenizer which uses the en_core_sci_sm Spacy model extended by special heuristics to consider characters such as “(”, “)” “-” as additional token separators. The latter distinguishes this implementation from SpacyTokenizer.

Note, you if you want to use the “normal” SciSpacy tokenization just use SpacyTokenizer.

tokenize(text)View on GitHub#
Return type:

list[str]

property name: str#