flair.tokenization.SegtokTokenizer#
- class flair.tokenization.SegtokTokenizer(additional_split_characters=None)View on GitHub#
Bases:
Tokenizer
Tokenizer using segtok, a third party library dedicated to rules-based Indo-European languages.
For further details see: fnl/segtok
- __init__(additional_split_characters=None)View on GitHub#
Initializes the SegtokTokenizer with an optional parameter for additional characters that should always be split.
The default behavior uses simple rules to split text into tokens. If you want to ensure that certain characters always become their own token, you can change default behavior by setting the
additional_split_characters
parameter.- Parameters:
additional_split_characters (
Optional
[list
[str
]]) – An optional list of characters that should always be split. For instance, if you want to make sure that paragraph symbols always become their own token, instantiate with additional_split_characters = [’§’]
Methods
__init__
([additional_split_characters])Initializes the SegtokTokenizer with an optional parameter for additional characters that should always be split.
run_tokenize
(text)tokenize
(text)Attributes
name
- tokenize(text)View on GitHub#
- Return type:
list
[str
]
- static run_tokenize(text)View on GitHub#
- Return type:
list
[str
]