flair.datasets.sequence_labeling.JsonlDataset#

class flair.datasets.sequence_labeling.JsonlDataset(path_to_jsonl_file, encoding='utf-8', text_column_name='data', label_column_name='label', metadata_column_name='metadata', label_type='ner', use_tokenizer=True)View on GitHub #

Bases: FlairDataset

__init__(path_to_jsonl_file, encoding='utf-8', text_column_name='data', label_column_name='label', metadata_column_name='metadata', label_type='ner', use_tokenizer=True)View on GitHub #

Instantiates a JsonlDataset and converts all annotated char spans to token tags using the IOB scheme.

The expected file format is:

{
    "<text_column_name>": "<text>",
    "<label_column_name>": [[<start_char_index>, <end_char_index>, <label>],...],
    "<metadata_column_name>": [[<metadata_key>, <metadata_value>],...]
}

Parameters:

path_to_jsonl_file (Union[str, Path]) – File to read
encoding (str) – file encoding (default “utf-8”)
text_column_name (str) – Name of the text column
label_column_name (str) – Name of the label column
metadata_column_name (str) – Name of the metadata column
label_type (str) – The type of label to predict (default “ner”)
use_tokenizer (Union[bool, Tokenizer]) – Specify a custom tokenizer to split the text into tokens.

Methods

`__init__`(path_to_jsonl_file[, encoding, ...])	Instantiates a JsonlDataset and converts all annotated char spans to token tags using the IOB scheme.
`is_in_memory`()	Returns True if the entire dataset is currently loaded in memory, False otherwise.

is_in_memory()View on GitHub #

Returns True if the entire dataset is currently loaded in memory, False otherwise.

Return type:: bool

Table of Contents

flair.datasets.sequence_labeling.JsonlDataset#