flair.datasets.text_text.DataPairDataset#

class flair.datasets.text_text.DataPairDataset(path_to_data, columns=[0, 1, 2], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub#

Bases: FlairDataset

__init__(path_to_data, columns=[0, 1, 2], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub#

Creates a Dataset for pairs of sentences/paragraphs.

The file needs to be in a column format, where each line has a column for the first sentence/paragraph, the second sentence/paragraph and the label seperated by e.g. ‘t’ (just like in the glue RTE-dataset https://gluebenchmark.com/tasks) . For each data pair we create a flair.data.DataPair object.

Parameters:
  • path_to_data (Union[str, Path]) – path to the data file

  • columns (list[int]) – list of integers that indicate the respective columns. The first entry is the column for the first sentence, the second for the second sentence and the third for the label. Default [0,1,2]

  • max_tokens_per_doc – If set, shortens sentences to this maximum number of tokens

  • max_chars_per_doc – If set, shortens sentences to this maximum number of characters

  • use_tokenizer – Whether to use in-built tokenizer

  • in_memory (bool) – If True, data will be saved in list of flair.data.DataPair objects, otherwise we use lists with simple strings which needs less space

  • label_type (Optional[str]) – Name of the label of the data pairs

  • skip_first_line (bool) – If True, first line of data file will be ignored

  • separator (str) – Separator between columns in the data file

  • encoding (str) – Encoding of the data file

  • label (bool) – If False, the dataset expects unlabeled data

Methods

__init__(path_to_data[, columns, ...])

Creates a Dataset for pairs of sentences/paragraphs.

is_in_memory()

is_in_memory()View on GitHub#
Return type:

bool