flair.datasets.sequence_labeling.JsonlDataset#

class flair.datasets.sequence_labeling.JsonlDataset(path_to_jsonl_file, encoding='utf-8', text_column_name='data', label_column_name='label', metadata_column_name='metadata', label_type='ner', use_tokenizer=True)View on GitHub#

Bases: FlairDataset

__init__(path_to_jsonl_file, encoding='utf-8', text_column_name='data', label_column_name='label', metadata_column_name='metadata', label_type='ner', use_tokenizer=True)View on GitHub#

Instantiates a JsonlDataset and converts all annotated char spans to token tags using the IOB scheme.

The expected file format is:

{
    "<text_column_name>": "<text>",
    "<label_column_name>": [[<start_char_index>, <end_char_index>, <label>],...],
    "<metadata_column_name>": [[<metadata_key>, <metadata_value>],...]
}
Parameters:
  • path_to_jsonl_file (Union[str, Path]) – File to read

  • encoding (str) – file encoding (default “utf-8”)

  • text_column_name (str) – Name of the text column

  • label_column_name (str) – Name of the label column

  • metadata_column_name (str) – Name of the metadata column

  • label_type (str) – The type of label to predict (default “ner”)

  • use_tokenizer (Union[bool, Tokenizer]) – Specify a custom tokenizer to split the text into tokens.

Methods

__init__(path_to_jsonl_file[, encoding, ...])

Instantiates a JsonlDataset and converts all annotated char spans to token tags using the IOB scheme.

is_in_memory()

is_in_memory()View on GitHub#
Return type:

bool