flair.datasets.sequence_labeling.JsonlDataset#
- class flair.datasets.sequence_labeling.JsonlDataset(path_to_jsonl_file, encoding='utf-8', text_column_name='data', label_column_name='label', metadata_column_name='metadata', label_type='ner', use_tokenizer=True)View on GitHub#
Bases:
FlairDataset
- __init__(path_to_jsonl_file, encoding='utf-8', text_column_name='data', label_column_name='label', metadata_column_name='metadata', label_type='ner', use_tokenizer=True)View on GitHub#
Instantiates a JsonlDataset and converts all annotated char spans to token tags using the IOB scheme.
The expected file format is:
{ "<text_column_name>": "<text>", "<label_column_name>": [[<start_char_index>, <end_char_index>, <label>],...], "<metadata_column_name>": [[<metadata_key>, <metadata_value>],...] }
- Parameters:
path_to_jsonl_file (
Union
[str
,Path
]) – File to readencoding (
str
) – file encoding (default “utf-8”)text_column_name (
str
) – Name of the text columnlabel_column_name (
str
) – Name of the label columnmetadata_column_name (
str
) – Name of the metadata columnlabel_type (
str
) – The type of label to predict (default “ner”)use_tokenizer (
Union
[bool
,Tokenizer
]) – Specify a custom tokenizer to split the text into tokens.
Methods
__init__
(path_to_jsonl_file[, encoding, ...])Instantiates a JsonlDataset and converts all annotated char spans to token tags using the IOB scheme.
- is_in_memory()View on GitHub#
- Return type:
bool