flair.datasets.text_text#
- class flair.datasets.text_text.ParallelTextCorpus(source_file, target_file, name, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, **corpusargs)View on GitHub#
Bases:
Corpus
- __init__(source_file, target_file, name, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, **corpusargs)View on GitHub#
Instantiates a Corpus for text classification from CSV column formatted data.
- Parameters:
data_folder – base folder with the task data
train_file – the name of the train file
test_file – the name of the test file
dev_file – the name of the dev file, if None, dev data is sampled from train
- Returns:
a Corpus with annotated train, dev and test data
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.text_text.OpusParallelCorpus(dataset, l1, l2, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, **corpusargs)View on GitHub#
Bases:
ParallelTextCorpus
- __init__(dataset, l1, l2, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, **corpusargs)View on GitHub#
Instantiates a Parallel Corpus from OPUS.
see http://opus.nlpl.eu/ :type dataset:
str
:param dataset: Name of the dataset (one of “tatoeba”) :type l1:str
:param l1: Language code of first language in pair (“en”, “de”, etc.) :type l2:str
:param l2: Language code of second language in pair (“en”, “de”, etc.) :type use_tokenizer:bool
:param use_tokenizer: Whether or not to use in-built tokenizer :type max_tokens_per_doc: :param max_tokens_per_doc: If set, shortens sentences to this maximum number of tokens :type max_chars_per_doc: :param max_chars_per_doc: If set, shortens sentences to this maximum number of characters :type in_memory:bool
:param in_memory: If True, keeps dataset fully in memory
- class flair.datasets.text_text.ParallelTextDataset(path_to_source, path_to_target, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True)View on GitHub#
Bases:
FlairDataset
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.text_text.DataPairCorpus(data_folder, columns=[0, 1, 2], train_file=None, test_file=None, dev_file=None, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, label_type=None, autofind_splits=True, sample_missing_splits=True, skip_first_line=False, separator='\\t', encoding='utf-8')View on GitHub#
Bases:
Corpus
- __init__(data_folder, columns=[0, 1, 2], train_file=None, test_file=None, dev_file=None, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, label_type=None, autofind_splits=True, sample_missing_splits=True, skip_first_line=False, separator='\\t', encoding='utf-8')View on GitHub#
Corpus for tasks involving pairs of sentences or paragraphs.
The data files are expected to be in column format where each line has a column for the first sentence/paragraph, the second sentence/paragraph and the labels, respectively. The columns must be separated by a given separator (default: ‘t’).
- Parameters:
data_folder (
Union
[str
,Path
]) – base folder with the task datacolumns (
List
[int
]) – List that indicates the columns for the first sentence (first entry in the list), the second sentence (second entry) and label (last entry). default = [0,1,2]train_file – the name of the train file
test_file – the name of the test file, if None, dev data is sampled from train (if sample_missing_splits is true)
dev_file – the name of the dev file, if None, dev data is sampled from train (if sample_missing_splits is true)
use_tokenizer (
bool
) – Whether or not to use in-built tokenizermax_tokens_per_doc – If set, shortens sentences to this maximum number of tokens
max_chars_per_doc – If set, shortens sentences to this maximum number of characters
in_memory (
bool
) – If True, data will be saved in list of flair.data.DataPair objects, other wise we use lists with simple strings which needs less spacelabel_type (
Optional
[str
]) – Name of the label of the data pairsautofind_splits – If True, train/test/dev files will be automatically identified in the given data_folder
sample_missing_splits (
bool
) – If True, a missing train/test/dev file will be sampled from the available dataskip_first_line (
bool
) – If True, first line of data files will be ignoredseparator (
str
) – Separator between columns in data filesencoding (
str
) – Encoding of data files
- Returns:
a Corpus with annotated train, dev and test data
- class flair.datasets.text_text.DataPairDataset(path_to_data, columns=[0, 1, 2], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub#
Bases:
FlairDataset
- __init__(path_to_data, columns=[0, 1, 2], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub#
Creates a Dataset for pairs of sentences/paragraphs.
The file needs to be in a column format, where each line has a column for the first sentence/paragraph, the second sentence/paragraph and the label seperated by e.g. ‘t’ (just like in the glue RTE-dataset https://gluebenchmark.com/tasks) . For each data pair we create a flair.data.DataPair object.
- Parameters:
path_to_data (
Union
[str
,Path
]) – path to the data filecolumns (
List
[int
]) – list of integers that indicate the respective columns. The first entry is the column for the first sentence, the second for the second sentence and the third for the label. Default [0,1,2]max_tokens_per_doc – If set, shortens sentences to this maximum number of tokens
max_chars_per_doc – If set, shortens sentences to this maximum number of characters
use_tokenizer – Whether to use in-built tokenizer
in_memory (
bool
) – If True, data will be saved in list of flair.data.DataPair objects, otherwise we use lists with simple strings which needs less spacelabel_type (
Optional
[str
]) – Name of the label of the data pairsskip_first_line (
bool
) – If True, first line of data file will be ignoredseparator (
str
) – Separator between columns in the data fileencoding (
str
) – Encoding of the data filelabel (
bool
) – If False, the dataset expects unlabeled data
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.text_text.GLUE_RTE(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Bases:
DataPairCorpus
- __init__(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Creates a DataPairCorpus for the Glue Recognizing Textual Entailment (RTE) data.
See https://gluebenchmark.com/tasks Additionally to the Corpus we have a eval_dataset containing the test file of the Glue data. This file contains unlabeled test data to evaluate models on the Glue RTE task.
- tsv_from_eval_dataset(folder_path)View on GitHub#
- class flair.datasets.text_text.GLUE_MNLI(label_type='entailment', evaluate_on_matched=True, base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Bases:
DataPairCorpus
- __init__(label_type='entailment', evaluate_on_matched=True, base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Creates a DataPairCorpus for the Multi-Genre Natural Language Inference Corpus (MNLI) from GLUE benchmark.
see https://gluebenchmark.com/tasks Entailment annotations are: entailment, contradiction, neutral. This corpus includes two dev sets mathced/mismatched and two unlabeled test sets: eval_dataset_matched, eval_dataset_mismatched.
- tsv_from_eval_dataset(folder_path)View on GitHub#
- class flair.datasets.text_text.GLUE_MRPC(label_type='paraphrase', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Bases:
DataPairCorpus
- __init__(label_type='paraphrase', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Creates a DataPairCorpus for the Microsoft Research Paraphrase Corpus (MRPC) from Glue benchmark.
See https://gluebenchmark.com/tasks MRPC includes annotated train and test sets. Dev set is sampled each time when creating this corpus.
- tsv_from_eval_dataset(folder_path)View on GitHub#
- class flair.datasets.text_text.GLUE_QNLI(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Bases:
DataPairCorpus
- __init__(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Creates a DataPairCorpus for the Question-answering Natural Language Inference dataset (QNLI) from GLUE.
see https://gluebenchmark.com/tasks Additionally, to the Corpus we have an eval_dataset containing the test file of the Glue data. This file contains unlabeled test data to evaluate models on the Glue QNLI task.
- tsv_from_eval_dataset(folder_path)View on GitHub#
- class flair.datasets.text_text.GLUE_QQP(label_type='paraphrase', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Bases:
DataPairCorpus
- __init__(label_type='paraphrase', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Creates a Quora Question Pairs (QQP) Corpus from the Glue benchmark.
See https://gluebenchmark.com/tasks The task is to determine whether a pair of questions are semantically equivalent. Additionaly to the Corpus we have a eval_dataset containing the test file of the Glue data. This file contains unlabeled test data to evaluate models on the Glue QQP task.
- tsv_from_eval_dataset(folder_path)View on GitHub#
- class flair.datasets.text_text.GLUE_WNLI(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Bases:
DataPairCorpus
- __init__(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Creates a Winograd Schema Challenge Corpus formated as Natural Language Inference task (WNLI).
The task is to predict if the sentence with the pronoun substituted is entailed by the original sentence. Additionaly to the Corpus we have a eval_dataset containing the test file of the Glue data. This file contains unlabeled test data to evaluate models on the Glue WNLI task.
- tsv_from_eval_dataset(folder_path)View on GitHub#
- class flair.datasets.text_text.GLUE_STSB(label_type='similarity', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Bases:
DataPairCorpus
- tsv_from_eval_dataset(folder_path)View on GitHub#
Create a tsv file of the predictions of the eval_dataset.
After calling classifier.predict(corpus.eval_dataset, label_name=’similarity’), this function can be used to produce a file called STS-B.tsv suitable for submission to the Glue Benchmark.
- class flair.datasets.text_text.SUPERGLUE_RTE(base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Bases:
DataPairCorpus
- __init__(base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Creates a DataPairCorpus for the SuperGlue Recognizing Textual Entailment (RTE) data.
See https://super.gluebenchmark.com/tasks Additionaly to the Corpus we have a eval_dataset containing the test file of the SuperGlue data. This file contains unlabeled test data to evaluate models on the SuperGlue RTE task.
- jsonl_from_eval_dataset(folder_path)View on GitHub#
- flair.datasets.text_text.rte_jsonl_to_tsv(file_path, label=True, remove=False, encoding='utf-8')View on GitHub#