flair.datasets.text_text.DataTripleDataset#

class flair.datasets.text_text.DataTripleDataset(path_to_data, columns=[0, 1, 2, 3], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub#

Bases: FlairDataset

__init__(path_to_data, columns=[0, 1, 2, 3], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub#

Creates a Dataset for triples of sentences/paragraphs.

The file needs to be in a column format, where each line has a column for the first sentence/paragraph, the second sentence/paragraph, the third sentence/paragraph and the label seperated by e.g. ‘t’ (just like in the glue RTE-dataset https://gluebenchmark.com/tasks) . For each data triple we create a flair.data.DataTriple object.

Parameters:
  • path_to_data (Union[str, Path]) – path to the data file

  • columns (list[int]) – list of integers that indicate the respective columns. The first entry is the column

for the first sentence, the second for the second sentence, the third for the third sentence, and the fourth for the label. Default [0, 1, 2, 3] :type max_tokens_per_doc: :param max_tokens_per_doc: If set, shortens sentences to this maximum number of tokens :type max_chars_per_doc: :param max_chars_per_doc: If set, shortens sentences to this maximum number of characters :type use_tokenizer: :param use_tokenizer: Whether or not to use the in-built tokenizer :type in_memory: bool :param in_memory: If True, data will be saved in a list of flair.data.DataTriple objects, otherwise we use lists with simple strings which need less space :type label_type: Optional[str] :param label_type: Name of the label of the data triples :type skip_first_line: bool :param skip_first_line: If True, the first line of the data file will be ignored :type separator: str :param separator: Separator between columns in the data file :type encoding: str :param encoding: Encoding of the data file :type label: bool :param label: If False, the dataset expects unlabeled data

Methods

__init__(path_to_data[, columns, ...])

Creates a Dataset for triples of sentences/paragraphs.

is_in_memory()

is_in_memory()View on GitHub#
Return type:

bool