flair.tokenization.SegtokTokenizer#

class flair.tokenization.SegtokTokenizer(additional_split_characters=None)View on GitHub #

Bases: Tokenizer

Tokenizer using segtok, a third party library dedicated to rules-based Indo-European languages.

For further details see: fnl/segtok

__init__(additional_split_characters=None)View on GitHub #

Initializes the SegtokTokenizer.

The default behavior uses simple rules to split text into tokens. If you want to ensure that certain characters always become their own token, you can change default behavior by setting the additional_split_characters parameter.

Parameters:: additional_split_characters (Optional[list[str]]) – An optional list of characters that should always be split. For instance, if you want to make sure that paragraph symbols always become their own token, instantiate with additional_split_characters = [’§’]

Methods

`__init__`([additional_split_characters])	Initializes the SegtokTokenizer.
`from_dict`(config)	Instantiate the tokenizer from a configuration dictionary.
`run_tokenize`(text)
`to_dict`()	Serialize the tokenizer's configuration to a dictionary.
`tokenize`(text)

Attributes

name

tokenize(text)View on GitHub #

Return type:: list[str]

static run_tokenize(text)View on GitHub #

Return type:: list[str]

to_dict()View on GitHub #

Serialize the tokenizer’s configuration to a dictionary.

Return type:: dict

classmethod from_dict(config)View on GitHub #

Instantiate the tokenizer from a configuration dictionary.

Return type:: SegtokTokenizer

Table of Contents

flair.tokenization.SegtokTokenizer#