flair.datasets.text_text.DataPairDataset#
- class flair.datasets.text_text.DataPairDataset(path_to_data, columns=[0, 1, 2], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub#
Bases:
FlairDataset
- __init__(path_to_data, columns=[0, 1, 2], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub#
Creates a Dataset for pairs of sentences/paragraphs.
The file needs to be in a column format, where each line has a column for the first sentence/paragraph, the second sentence/paragraph and the label seperated by e.g. ‘t’ (just like in the glue RTE-dataset https://gluebenchmark.com/tasks) . For each data pair we create a flair.data.DataPair object.
- Parameters:
path_to_data (
Union
[str
,Path
]) – path to the data filecolumns (
list
[int
]) – list of integers that indicate the respective columns. The first entry is the column for the first sentence, the second for the second sentence and the third for the label. Default [0,1,2]max_tokens_per_doc – If set, shortens sentences to this maximum number of tokens
max_chars_per_doc – If set, shortens sentences to this maximum number of characters
use_tokenizer – Whether to use in-built tokenizer
in_memory (
bool
) – If True, data will be saved in list of flair.data.DataPair objects, otherwise we use lists with simple strings which needs less spacelabel_type (
Optional
[str
]) – Name of the label of the data pairsskip_first_line (
bool
) – If True, first line of data file will be ignoredseparator (
str
) – Separator between columns in the data fileencoding (
str
) – Encoding of the data filelabel (
bool
) – If False, the dataset expects unlabeled data
Methods
__init__
(path_to_data[, columns, ...])Creates a Dataset for pairs of sentences/paragraphs.
- is_in_memory()View on GitHub#
- Return type:
bool