flair.tokenization.SegtokTokenizer#

class flair.tokenization.SegtokTokenizer(additional_split_characters=None)View on GitHub#

Bases: Tokenizer

Tokenizer using segtok, a third party library dedicated to rules-based Indo-European languages.

For further details see: fnl/segtok

__init__(additional_split_characters=None)View on GitHub#

Initializes the SegtokTokenizer with an optional parameter for additional characters that should always be split.

The default behavior uses simple rules to split text into tokens. If you want to ensure that certain characters always become their own token, you can change default behavior by setting the additional_split_characters parameter.

Parameters:

additional_split_characters (Optional[list[str]]) – An optional list of characters that should always be split. For instance, if you want to make sure that paragraph symbols always become their own token, instantiate with additional_split_characters = [’§’]

Methods

__init__([additional_split_characters])

Initializes the SegtokTokenizer with an optional parameter for additional characters that should always be split.

run_tokenize(text)

tokenize(text)

Attributes

name

tokenize(text)View on GitHub#
Return type:

list[str]

static run_tokenize(text)View on GitHub#
Return type:

list[str]