flair.datasets.text_text#

class flair.datasets.text_text.ParallelTextCorpus(source_file, target_file, name, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, **corpusargs)View on GitHub#

Bases: Corpus

__init__(source_file, target_file, name, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, **corpusargs)View on GitHub#

Instantiates a Corpus for text classification from CSV column formatted data.

Parameters:
  • data_folder – base folder with the task data

  • train_file – the name of the train file

  • test_file – the name of the test file

  • dev_file – the name of the dev file, if None, dev data is sampled from train

Returns:

a Corpus with annotated train, dev and test data

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.text_text.OpusParallelCorpus(dataset, l1, l2, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, **corpusargs)View on GitHub#

Bases: ParallelTextCorpus

__init__(dataset, l1, l2, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, **corpusargs)View on GitHub#

Instantiates a Parallel Corpus from OPUS.

see http://opus.nlpl.eu/ :type dataset: str :param dataset: Name of the dataset (one of “tatoeba”) :type l1: str :param l1: Language code of first language in pair (“en”, “de”, etc.) :type l2: str :param l2: Language code of second language in pair (“en”, “de”, etc.) :type use_tokenizer: bool :param use_tokenizer: Whether or not to use in-built tokenizer :type max_tokens_per_doc: :param max_tokens_per_doc: If set, shortens sentences to this maximum number of tokens :type max_chars_per_doc: :param max_chars_per_doc: If set, shortens sentences to this maximum number of characters :type in_memory: bool :param in_memory: If True, keeps dataset fully in memory

class flair.datasets.text_text.ParallelTextDataset(path_to_source, path_to_target, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True)View on GitHub#

Bases: FlairDataset

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.text_text.DataPairCorpus(data_folder, columns=[0, 1, 2], train_file=None, test_file=None, dev_file=None, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, label_type=None, autofind_splits=True, sample_missing_splits=True, skip_first_line=False, separator='\\t', encoding='utf-8')View on GitHub#

Bases: Corpus

__init__(data_folder, columns=[0, 1, 2], train_file=None, test_file=None, dev_file=None, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, label_type=None, autofind_splits=True, sample_missing_splits=True, skip_first_line=False, separator='\\t', encoding='utf-8')View on GitHub#

Corpus for tasks involving pairs of sentences or paragraphs.

The data files are expected to be in column format where each line has a column for the first sentence/paragraph, the second sentence/paragraph and the labels, respectively. The columns must be separated by a given separator (default: ‘t’).

Parameters:
  • data_folder (Union[str, Path]) – base folder with the task data

  • columns (list[int]) – List that indicates the columns for the first sentence (first entry in the list), the second sentence (second entry) and label (last entry). default = [0,1,2]

  • train_file – the name of the train file

  • test_file – the name of the test file, if None, dev data is sampled from train (if sample_missing_splits is true)

  • dev_file – the name of the dev file, if None, dev data is sampled from train (if sample_missing_splits is true)

  • use_tokenizer (bool) – Whether or not to use in-built tokenizer

  • max_tokens_per_doc – If set, shortens sentences to this maximum number of tokens

  • max_chars_per_doc – If set, shortens sentences to this maximum number of characters

  • in_memory (bool) – If True, data will be saved in list of flair.data.DataPair objects, other wise we use lists with simple strings which needs less space

  • label_type (Optional[str]) – Name of the label of the data pairs

  • autofind_splits – If True, train/test/dev files will be automatically identified in the given data_folder

  • sample_missing_splits (bool) – If True, a missing train/test/dev file will be sampled from the available data

  • skip_first_line (bool) – If True, first line of data files will be ignored

  • separator (str) – Separator between columns in data files

  • encoding (str) – Encoding of data files

Returns:

a Corpus with annotated train, dev and test data

class flair.datasets.text_text.DataPairDataset(path_to_data, columns=[0, 1, 2], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub#

Bases: FlairDataset

__init__(path_to_data, columns=[0, 1, 2], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub#

Creates a Dataset for pairs of sentences/paragraphs.

The file needs to be in a column format, where each line has a column for the first sentence/paragraph, the second sentence/paragraph and the label seperated by e.g. ‘t’ (just like in the glue RTE-dataset https://gluebenchmark.com/tasks) . For each data pair we create a flair.data.DataPair object.

Parameters:
  • path_to_data (Union[str, Path]) – path to the data file

  • columns (list[int]) – list of integers that indicate the respective columns. The first entry is the column for the first sentence, the second for the second sentence and the third for the label. Default [0,1,2]

  • max_tokens_per_doc – If set, shortens sentences to this maximum number of tokens

  • max_chars_per_doc – If set, shortens sentences to this maximum number of characters

  • use_tokenizer – Whether to use in-built tokenizer

  • in_memory (bool) – If True, data will be saved in list of flair.data.DataPair objects, otherwise we use lists with simple strings which needs less space

  • label_type (Optional[str]) – Name of the label of the data pairs

  • skip_first_line (bool) – If True, first line of data file will be ignored

  • separator (str) – Separator between columns in the data file

  • encoding (str) – Encoding of the data file

  • label (bool) – If False, the dataset expects unlabeled data

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.text_text.DataTripleCorpus(data_folder, columns=[0, 1, 2, 3], train_file=None, test_file=None, dev_file=None, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, label_type=None, autofind_splits=True, sample_missing_splits=True, skip_first_line=False, separator='\\t', encoding='utf-8')View on GitHub#

Bases: Corpus

__init__(data_folder, columns=[0, 1, 2, 3], train_file=None, test_file=None, dev_file=None, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, label_type=None, autofind_splits=True, sample_missing_splits=True, skip_first_line=False, separator='\\t', encoding='utf-8')View on GitHub#

Corpus for tasks involving triples of sentences or paragraphs.

The data files are expected to be in column format where each line has a column for the first sentence/paragraph, the second sentence/paragraph, the third sentence/paragraph and the labels, respectively. The columns must be separated by a given separator (default: ‘t’).

Parameters:
  • data_folder (Union[str, Path]) – base folder with the task data

  • columns (list[int]) – List that indicates the columns for the first sentence (first entry in the list), the second sentence (second entry), the third sentence (third entry), and label (last entry). default = [0,1,2,3]

  • train_file – the name of the train file

  • test_file – the name of the test file, if None, dev data is sampled from train (if sample_missing_splits is true)

  • dev_file – the name of the dev file, if None, dev data is sampled from train (if sample_missing_splits is true)

  • use_tokenizer (bool) – Whether or not to use in-built tokenizer

  • max_tokens_per_doc – If set, shortens sentences to this maximum number of tokens

  • max_chars_per_doc – If set, shortens sentences to this maximum number of characters

  • in_memory (bool) – If True, data will be saved in list of flair.data.DataTriple objects, otherwise we use lists with simple strings which need less space

  • label_type (Optional[str]) – Name of the label of the data triples

  • autofind_splits – If True, train/test/dev files will be automatically identified in the given data_folder

  • sample_missing_splits (bool) – If True, a missing train/test/dev file will be sampled from the available data

  • skip_first_line (bool) – If True, the first line of data files will be ignored

  • separator (str) – Separator between columns in data files

  • encoding (str) – Encoding of data files

Returns:

a Corpus with annotated train, dev, and test data

class flair.datasets.text_text.DataTripleDataset(path_to_data, columns=[0, 1, 2, 3], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub#

Bases: FlairDataset

__init__(path_to_data, columns=[0, 1, 2, 3], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub#

Creates a Dataset for triples of sentences/paragraphs.

The file needs to be in a column format, where each line has a column for the first sentence/paragraph, the second sentence/paragraph, the third sentence/paragraph and the label seperated by e.g. ‘t’ (just like in the glue RTE-dataset https://gluebenchmark.com/tasks) . For each data triple we create a flair.data.DataTriple object.

Parameters:
  • path_to_data (Union[str, Path]) – path to the data file

  • columns (list[int]) – list of integers that indicate the respective columns. The first entry is the column

for the first sentence, the second for the second sentence, the third for the third sentence, and the fourth for the label. Default [0, 1, 2, 3] :type max_tokens_per_doc: :param max_tokens_per_doc: If set, shortens sentences to this maximum number of tokens :type max_chars_per_doc: :param max_chars_per_doc: If set, shortens sentences to this maximum number of characters :type use_tokenizer: :param use_tokenizer: Whether or not to use the in-built tokenizer :type in_memory: bool :param in_memory: If True, data will be saved in a list of flair.data.DataTriple objects, otherwise we use lists with simple strings which need less space :type label_type: Optional[str] :param label_type: Name of the label of the data triples :type skip_first_line: bool :param skip_first_line: If True, the first line of the data file will be ignored :type separator: str :param separator: Separator between columns in the data file :type encoding: str :param encoding: Encoding of the data file :type label: bool :param label: If False, the dataset expects unlabeled data

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.text_text.GLUE_RTE(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Bases: DataPairCorpus

__init__(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Creates a DataPairCorpus for the Glue Recognizing Textual Entailment (RTE) data.

See https://gluebenchmark.com/tasks Additionally to the Corpus we have a eval_dataset containing the test file of the Glue data. This file contains unlabeled test data to evaluate models on the Glue RTE task.

tsv_from_eval_dataset(folder_path)View on GitHub#
class flair.datasets.text_text.GLUE_MNLI(label_type='entailment', evaluate_on_matched=True, base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Bases: DataPairCorpus

__init__(label_type='entailment', evaluate_on_matched=True, base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Creates a DataPairCorpus for the Multi-Genre Natural Language Inference Corpus (MNLI) from GLUE benchmark.

see https://gluebenchmark.com/tasks Entailment annotations are: entailment, contradiction, neutral. This corpus includes two dev sets mathced/mismatched and two unlabeled test sets: eval_dataset_matched, eval_dataset_mismatched.

tsv_from_eval_dataset(folder_path)View on GitHub#
class flair.datasets.text_text.GLUE_MRPC(label_type='paraphrase', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Bases: DataPairCorpus

__init__(label_type='paraphrase', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Creates a DataPairCorpus for the Microsoft Research Paraphrase Corpus (MRPC) from Glue benchmark.

See https://gluebenchmark.com/tasks MRPC includes annotated train and test sets. Dev set is sampled each time when creating this corpus.

tsv_from_eval_dataset(folder_path)View on GitHub#
class flair.datasets.text_text.GLUE_QNLI(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Bases: DataPairCorpus

__init__(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Creates a DataPairCorpus for the Question-answering Natural Language Inference dataset (QNLI) from GLUE.

see https://gluebenchmark.com/tasks Additionally, to the Corpus we have an eval_dataset containing the test file of the Glue data. This file contains unlabeled test data to evaluate models on the Glue QNLI task.

tsv_from_eval_dataset(folder_path)View on GitHub#
class flair.datasets.text_text.GLUE_QQP(label_type='paraphrase', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Bases: DataPairCorpus

__init__(label_type='paraphrase', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Creates a Quora Question Pairs (QQP) Corpus from the Glue benchmark.

See https://gluebenchmark.com/tasks The task is to determine whether a pair of questions are semantically equivalent. Additionaly to the Corpus we have a eval_dataset containing the test file of the Glue data. This file contains unlabeled test data to evaluate models on the Glue QQP task.

tsv_from_eval_dataset(folder_path)View on GitHub#
class flair.datasets.text_text.GLUE_WNLI(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Bases: DataPairCorpus

__init__(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Creates a Winograd Schema Challenge Corpus formated as Natural Language Inference task (WNLI).

The task is to predict if the sentence with the pronoun substituted is entailed by the original sentence. Additionaly to the Corpus we have a eval_dataset containing the test file of the Glue data. This file contains unlabeled test data to evaluate models on the Glue WNLI task.

tsv_from_eval_dataset(folder_path)View on GitHub#
class flair.datasets.text_text.GLUE_STSB(label_type='similarity', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Bases: DataPairCorpus

tsv_from_eval_dataset(folder_path)View on GitHub#

Create a tsv file of the predictions of the eval_dataset.

After calling classifier.predict(corpus.eval_dataset, label_name=’similarity’), this function can be used to produce a file called STS-B.tsv suitable for submission to the Glue Benchmark.

class flair.datasets.text_text.SUPERGLUE_RTE(base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Bases: DataPairCorpus

__init__(base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Creates a DataPairCorpus for the SuperGlue Recognizing Textual Entailment (RTE) data.

See https://super.gluebenchmark.com/tasks Additionaly to the Corpus we have a eval_dataset containing the test file of the SuperGlue data. This file contains unlabeled test data to evaluate models on the SuperGlue RTE task.

jsonl_from_eval_dataset(folder_path)View on GitHub#
flair.datasets.text_text.rte_jsonl_to_tsv(file_path, label=True, remove=False, encoding='utf-8')View on GitHub#