flair.tokenization#
- class flair.tokenization.TokenizerView on GitHub#
Bases:
ABCAn abstract class representing a
Tokenizer.Tokenizers are used to represent algorithms and models to split plain text into individual tokens / words. All subclasses should overwrite
tokenize(), which splits the given plain text into tokens. Moreover, subclasses may overwritename(), returning a unique identifier representing the tokenizer’s configuration.- abstract tokenize(text)View on GitHub#
- Return type:
List[str]
- property name: str#
- class flair.tokenization.SpacyTokenizer(model)View on GitHub#
Bases:
TokenizerTokenizer using spacy under the hood.
- Parameters:
model – a Spacy V2 model or the name of the model to load.
- tokenize(text)View on GitHub#
- Return type:
List[str]
- property name: str#
- class flair.tokenization.SegtokTokenizerView on GitHub#
Bases:
TokenizerTokenizer using segtok, a third party library dedicated to rules-based Indo-European languages.
For further details see: fnl/segtok
- tokenize(text)View on GitHub#
- Return type:
List[str]
- static run_tokenize(text)View on GitHub#
- Return type:
List[str]
- class flair.tokenization.SpaceTokenizerView on GitHub#
Bases:
TokenizerTokenizer based on space character only.
- tokenize(text)View on GitHub#
- Return type:
List[str]
- static run_tokenize(text)View on GitHub#
- Return type:
List[str]
- class flair.tokenization.JapaneseTokenizer(tokenizer, sudachi_mode='A')View on GitHub#
Bases:
TokenizerTokenizer using konoha to support popular japanese tokenizers.
Tokenizer using konoha, a third party library which supports multiple Japanese tokenizer such as MeCab, Janome and SudachiPy.
- For further details see:
- tokenize(text)View on GitHub#
- Return type:
List[str]
- property name: str#
- class flair.tokenization.TokenizerWrapper(tokenizer_func)View on GitHub#
Bases:
TokenizerHelper class to wrap tokenizer functions to the class-based tokenizer interface.
- tokenize(text)View on GitHub#
- Return type:
List[str]
- property name: str#
- class flair.tokenization.SciSpacyTokenizerView on GitHub#
Bases:
TokenizerTokenizer that uses the en_core_sci_sm Spacy model and some special heuristics.
Implementation of
Tokenizerwhich uses the en_core_sci_sm Spacy model extended by special heuristics to consider characters such as “(”, “)” “-” as additional token separators. The latter distinguishes this implementation fromSpacyTokenizer.Note, you if you want to use the “normal” SciSpacy tokenization just use
SpacyTokenizer.- tokenize(text)View on GitHub#
- Return type:
List[str]
- property name: str#