flair.tokenization#
- class flair.tokenization.TokenizerView on GitHub#
Bases:
ABC
An abstract class representing a
Tokenizer
.Tokenizers are used to represent algorithms and models to split plain text into individual tokens / words. All subclasses should overwrite
tokenize()
, which splits the given plain text into tokens. Moreover, subclasses may overwritename()
, returning a unique identifier representing the tokenizer’s configuration.- abstract tokenize(text)View on GitHub#
- Return type:
list
[str
]
- property name: str#
- class flair.tokenization.SpacyTokenizer(model)View on GitHub#
Bases:
Tokenizer
Tokenizer using spacy under the hood.
- Parameters:
model – a Spacy V2 model or the name of the model to load.
- tokenize(text)View on GitHub#
- Return type:
list
[str
]
- property name: str#
- class flair.tokenization.SegtokTokenizerView on GitHub#
Bases:
Tokenizer
Tokenizer using segtok, a third party library dedicated to rules-based Indo-European languages.
For further details see: fnl/segtok
- tokenize(text)View on GitHub#
- Return type:
list
[str
]
- static run_tokenize(text)View on GitHub#
- Return type:
list
[str
]
- class flair.tokenization.SpaceTokenizerView on GitHub#
Bases:
Tokenizer
Tokenizer based on space character only.
- tokenize(text)View on GitHub#
- Return type:
list
[str
]
- static run_tokenize(text)View on GitHub#
- Return type:
list
[str
]
- class flair.tokenization.JapaneseTokenizer(tokenizer, sudachi_mode='A')View on GitHub#
Bases:
Tokenizer
Tokenizer using konoha to support popular japanese tokenizers.
Tokenizer using konoha, a third party library which supports multiple Japanese tokenizer such as MeCab, Janome and SudachiPy.
- For further details see:
- tokenize(text)View on GitHub#
- Return type:
list
[str
]
- property name: str#
- class flair.tokenization.TokenizerWrapper(tokenizer_func)View on GitHub#
Bases:
Tokenizer
Helper class to wrap tokenizer functions to the class-based tokenizer interface.
- tokenize(text)View on GitHub#
- Return type:
list
[str
]
- property name: str#
- class flair.tokenization.SciSpacyTokenizerView on GitHub#
Bases:
Tokenizer
Tokenizer that uses the en_core_sci_sm Spacy model and some special heuristics.
Implementation of
Tokenizer
which uses the en_core_sci_sm Spacy model extended by special heuristics to consider characters such as “(”, “)” “-” as additional token separators. The latter distinguishes this implementation fromSpacyTokenizer
.Note, you if you want to use the “normal” SciSpacy tokenization just use
SpacyTokenizer
.- tokenize(text)View on GitHub#
- Return type:
list
[str
]
- property name: str#