flair.datasets#
Classes
|
|
|
|
|
|
|
A simple Dataset object to wrap a List of Datapoints, for example Sentences. |
|
|
|
|
|
A Dataset taking string as input and returning Sentence during iteration. |
|
Base class for downloading and reading of dictionaries for entity entity linking. |
|
The AG's News Topic Classification Corpus, classifying news into 4 coarse-grained topics. |
|
Corpus for anatomical named entity mention recognition. |
|
Arizona Disease Corpus from the Biomedical Informatics Lab at Arizona State University. |
|
Original BioCreative-II-GM corpus containing gene annotations. |
|
Original BioInfer corpus. |
|
BC4CHEMD corpus with chemical annotations as used in the evaluation of BioBERT. |
|
BC5CDR corpus with chemical annotations as used in the evaluation of BioBERT. |
|
BC5CDR corpus with disease annotations as used in the evaluation of BioBERT. |
|
NCBI disease corpus as used in the evaluation of BioBERT. |
|
BC4CHEMD corpus with gene annotations as used in the evaluation of BioBERT. |
|
JNLPBA corpus with gene annotations as used in the evaluation of BioBERT. |
|
Linneaeus corpus with species annotations as used in the evaluation of BioBERT. |
|
S800 corpus with species annotations as used in the evaluation of BioBERT. |
|
Corpus of the BioNLP'2013 Cancer Genetics shared task. |
|
Corpus of the BioNLP'2013 Pathway Curation shared task. |
|
Original Biosemantics corpus. |
|
CDR corpus as provided by JHnlp/BioCreative-V-CDR-Corpus. |
|
Original CellFinder corpus containing cell line, species and gene annotations. |
|
Original CEMP corpus containing chemical annotations. |
|
Original corpus of the CHEMDNER shared task. |
|
Original CLL corpus containing cell line annotations. |
|
Original CRAFT corpus (version 2.0) containing all but the coreference and sections/typography annotations. |
|
Version 4.0.1 of the CRAFT corpus containing all but the co-reference and structural annotations. |
|
Original DECA corpus containing gene annotations. |
|
Original FSU corpus containing protein and derived annotations. |
|
Original Gellus corpus containing cell line annotations. |
|
Original GPRO corpus containing gene annotations. |
|
Base dictionary with data already in huner format. |
|
Union of all HUNER cell line data sets. |
|
HUNER version of the CellFinder corpus containing only cell line annotations. |
|
HUNER version of the CLL corpus containing cell line annotations. |
|
HUNER version of the Gellus corpus containing cell line annotations. |
|
HUNER version of the JNLPBA corpus containing cell line annotations. |
|
Union of all HUNER chemical data sets. |
|
HUNER version of the IEPA corpus containing chemical annotations. |
|
HUNER version of the CEMP corpus containing chemical annotations. |
|
HUNER version of the CHEBI corpus containing chemical annotations. |
|
HUNER version of the CHEMDNER corpus containing chemical annotations. |
|
HUNER version of the CRAFT corpus containing (only) chemical annotations. |
|
HUNER version of the SCAI chemicals corpus containing chemical annotations. |
|
Union of all HUNER disease data sets. |
|
HUNER version of the IEPA corpus containing disease annotations. |
|
HUNER version of the miRNA corpus containing disease annotations. |
|
HUNER version of the NCBI corpus containing disease annotations. |
|
PDR Dataset with only Disease annotations. |
|
HUNER version of the SCAI chemicals corpus containing disease annotations. |
|
HUNER version of the Variome corpus containing disease annotations. |
|
Union of all HUNER gene data sets. |
|
HUNER version of the BioCreative-II-GM corpus containing gene annotations. |
|
HUNER version of the BioInfer corpus containing only gene/protein annotations. |
|
HUNER version of the CellFinder corpus containing only gene annotations. |
|
HUNER version of the CHEBI corpus containing gene annotations. |
|
HUNER version of the CRAFT corpus containing (only) gene annotations. |
|
HUNER version of the DECA corpus containing gene annotations. |
|
HUNER version of the FSU corpus containing (only) gene annotations. |
|
HUNER version of the GPRO corpus containing gene annotations. |
|
HUNER version of the IEPA corpus containing gene annotations. |
|
HUNER version of the JNLPBA corpus containing gene annotations. |
|
HUNER version of the Loctext corpus containing protein annotations. |
|
HUNER version of the miRNA corpus containing protein / gene annotations. |
|
HUNER version of the OSIRIS corpus containing (only) gene annotations. |
|
HUNER version of the Variome corpus containing gene annotations. |
|
Union of all HUNER species data sets. |
|
HUNER version of the CellFinder corpus containing only species annotations. |
|
HUNER version of the CHEBI corpus containing species annotations. |
|
HUNER version of the CRAFT corpus containing (only) species annotations. |
|
HUNER version of the LINNEAUS corpus containing species annotations. |
|
HUNER version of the Loctext corpus containing species annotations. |
|
HUNER version of the miRNA corpus containing species annotations. |
|
HUNER version of the S800 corpus containing species annotations. |
|
HUNER version of the Variome corpus containing species annotations. |
|
IEPA corpus as provided by http://corpora.informatik.hu-berlin.de/. |
|
Original corpus of the JNLPBA shared task. |
|
Original LINNEAUS corpus containing species annotations. |
|
Original LOCTEXT corpus containing species annotations. |
|
Original miRNA corpus. |
|
Dictionary for named entity linking on diseases using the NCBI Gene ontology. |
|
Dictionary for named entity linking on organisms / species using the NCBI taxonomy ontology. |
|
Dictionary for named entity linking on diseases using the Comparative Toxicogenomics Database (CTD). |
|
Dictionary for named entity linking on chemicals using the Comparative Toxicogenomics Database (CTD). |
|
Original NCBI disease corpus containing disease annotations. |
|
|
|
Original OSIRIS corpus containing variation and gene annotations. |
|
Corpus of plant-disease relations. |
|
S800 corpus. |
|
Original SCAI chemicals corpus containing chemical annotations. |
|
Original SCAI disease corpus containing disease annotations. |
|
Variome corpus as provided by http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/hvp_bioc.xml.zip. |
|
A very large corpus of Amazon reviews with positivity ratings. |
|
The Communicative Functions Classification Corpus. |
GermEval 2018 corpus for identification of offensive language. |
|
|
Corpus of Linguistic Acceptability from GLUE benchmark. |
|
GoEmotions dataset containing 58k Reddit comments labeled with 27 emotion categories. |
|
Corpus of IMDB movie reviews labeled by sentiment (POSITIVE, NEGATIVE). |
|
20 newsgroups corpus, classifying news items into one of 20 categories. |
|
Stackoverflow corpus classifying questions into one of 20 labels. |
|
The customer reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment. |
|
The opinion-polarity dataset of SentEval, classified into NEGATIVE or POSITIVE polarity. |
|
The movie reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment. |
|
The Stanford sentiment treebank dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment. |
|
The Stanford sentiment treebank dataset of SentEval, classified into 5 sentiment classes. |
|
The subjectivity dataset of SentEval, classified into SUBJECTIVE or OBJECTIVE sentiment. |
|
Twitter sentiment corpus. |
|
The TREC Question Classification Corpus, classifying questions into 6 coarse-grained answer types. |
|
The TREC Question Classification Corpus, classifying questions into 50 fine-grained answer types. |
|
WASSA-2017 anger emotion-intensity corpus. |
|
WASSA-2017 fear emotion-intensity corpus. |
|
WASSA-2017 joy emotion-intensity dataset corpus. |
|
WASSA-2017 sadness emotion-intensity corpus. |
|
The YAHOO Question Classification Corpus, classifying questions into 10 coarse-grained answer types. |
|
A classification corpus from FastText-formatted text files. |
|
Dataset for classification instantiated from a single FastText-formatted file. |
|
Classification corpus instantiated from CSV data files. |
|
Dataset for text classification from CSV column formatted data. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This treebank includes the Faroese treebank dataset. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- class flair.datasets.DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, drop_last=False, timeout=0, worker_init_fn=None)View on GitHub#
Bases:
DataLoader
- dataset: Dataset[_T_co]#
- batch_size: Optional[int]#
- num_workers: int#
- pin_memory: bool#
- drop_last: bool#
- timeout: float#
- sampler: Union[Sampler, Iterable]#
- pin_memory_device: str#
- prefetch_factor: Optional[int]#
- class flair.datasets.OcrJsonDataset(path_to_split_directory, label_type='ner', in_memory=True, encoding='utf-8', load_images=False, normalize_coords_to_thousands=True, label_name_map=None)View on GitHub#
Bases:
FlairDataset
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.SROIE(base_path=None, encoding='utf-8', label_type='ner', in_memory=True, load_images=False, normalize_coords_to_thousands=True, label_name_map=None, **corpusargs)View on GitHub#
Bases:
OcrCorpus
- class flair.datasets.FlairDatapointDataset(datapoints)View on GitHub#
Bases:
FlairDataset
,Generic
[DT
]A simple Dataset object to wrap a List of Datapoints, for example Sentences.
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.SentenceDataset(sentences)View on GitHub#
Bases:
FlairDatapointDataset
- class flair.datasets.MongoDataset(query, host, port, database, collection, text_field, categories_field=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, tag_type='class')View on GitHub#
Bases:
FlairDataset
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.StringDataset(texts, use_tokenizer=<flair.tokenization.SpaceTokenizer object>)View on GitHub#
Bases:
FlairDataset
A Dataset taking string as input and returning Sentence during iteration.
- abstract is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.EntityLinkingDictionary(candidates, dataset_name=None)View on GitHub#
Bases:
object
Base class for downloading and reading of dictionaries for entity entity linking.
A dictionary represents all entities of a knowledge base and their associated ids.
- property database_name: str#
Name of the database represented by the dictionary.
- property text_to_index: dict[str, list[str]]#
- property candidates: list[EntityCandidate]#
- to_in_memory_dictionary()View on GitHub#
- Return type:
- class flair.datasets.AGNEWS(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The AG’s News Topic Classification Corpus, classifying news into 4 coarse-grained topics.
Labels: World, Sports, Business, Sci/Tech.
- class flair.datasets.ANAT_EM(base_path=None, in_memory=True, tokenizer=None)View on GitHub#
Bases:
ColumnCorpus
Corpus for anatomical named entity mention recognition.
For further information see Pyysalo and Ananiadou: Anatomical entity mention recognition at literature scale https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3957068/ http://nactem.ac.uk/anatomytagger/#AnatEM
Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- abstract static download_corpus(data_folder)View on GitHub#
- static parse_input_files(input_dir, sentence_separator)View on GitHub#
- Return type:
- class flair.datasets.AZDZ(base_path=None, in_memory=True, tokenizer=None)View on GitHub#
Bases:
ColumnCorpus
Arizona Disease Corpus from the Biomedical Informatics Lab at Arizona State University.
For further information see: http://diego.asu.edu/index.php
- classmethod download_corpus(data_dir)View on GitHub#
- Return type:
Path
- static parse_corpus(input_file)View on GitHub#
- Return type:
- class flair.datasets.BC2GM(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original BioCreative-II-GM corpus containing gene annotations.
For further information see Smith et al.: Overview of BioCreative II gene mention recognition https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2559986/
Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- static download_dataset(data_dir)View on GitHub#
- Return type:
Path
- classmethod parse_train_dataset(data_folder)View on GitHub#
- Return type:
- classmethod parse_test_dataset(data_folder)View on GitHub#
- Return type:
- static parse_dataset(text_file, ann_file)View on GitHub#
- Return type:
- class flair.datasets.BIO_INFER(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
Original BioInfer corpus.
- For further information see Pyysalo et al.:
BioInfer: a corpus for information extraction in the biomedical domain https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-50
- classmethod download_dataset(data_dir)View on GitHub#
- Return type:
Path
- classmethod parse_dataset(original_file)View on GitHub#
- class flair.datasets.BIOBERT_CHEMICAL_BC4CHEMD(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
BC4CHEMD corpus with chemical annotations as used in the evaluation of BioBERT.
For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- class flair.datasets.BIOBERT_CHEMICAL_BC5CDR(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
BC5CDR corpus with chemical annotations as used in the evaluation of BioBERT.
For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- class flair.datasets.BIOBERT_DISEASE_BC5CDR(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
BC5CDR corpus with disease annotations as used in the evaluation of BioBERT.
For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- class flair.datasets.BIOBERT_DISEASE_NCBI(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
NCBI disease corpus as used in the evaluation of BioBERT.
For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- class flair.datasets.BIOBERT_GENE_BC2GM(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
BC4CHEMD corpus with gene annotations as used in the evaluation of BioBERT.
For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- class flair.datasets.BIOBERT_GENE_JNLPBA(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
JNLPBA corpus with gene annotations as used in the evaluation of BioBERT.
For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- class flair.datasets.BIOBERT_SPECIES_LINNAEUS(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
Linneaeus corpus with species annotations as used in the evaluation of BioBERT.
For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- class flair.datasets.BIOBERT_SPECIES_S800(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
S800 corpus with species annotations as used in the evaluation of BioBERT.
For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- class flair.datasets.BIONLP2013_CG(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
BioNLPCorpus
Corpus of the BioNLP’2013 Cancer Genetics shared task.
For further information see Pyysalo, Ohta & Ananiadou 2013 Overview of the Cancer Genetics (CG) task of BioNLP Shared Task 2013 https://www.aclweb.org/anthology/W13-2008/
Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- static download_corpus(download_folder)View on GitHub#
- Return type:
tuple
[Path
,Path
,Path
]
- class flair.datasets.BIONLP2013_PC(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
BioNLPCorpus
Corpus of the BioNLP’2013 Pathway Curation shared task.
For further information see Ohta et al. Overview of the pathway curation (PC) task of bioNLP shared task 2013. https://www.aclweb.org/anthology/W13-2009/
Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- static download_corpus(download_folder)View on GitHub#
- Return type:
tuple
[Path
,Path
,Path
]
- class flair.datasets.BIOSEMANTICS(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original Biosemantics corpus.
For further information see Akhondi et al.: Annotated chemical patent corpus: a gold standard for text mining https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4182036/
- static download_dataset(data_dir)View on GitHub#
- Return type:
Path
- static parse_dataset(data_dir)View on GitHub#
- Return type:
- class flair.datasets.CDR(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
CDR corpus as provided by JHnlp/BioCreative-V-CDR-Corpus.
For further information see Li et al.: BioCreative V CDR task corpus: a resource for chemical disease relation extraction https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4860626/
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- static download_dataset(data_dir)View on GitHub#
- class flair.datasets.CELL_FINDER(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original CellFinder corpus containing cell line, species and gene annotations.
For futher information see Neves et al.: Annotating and evaluating text for stem cell research https://pdfs.semanticscholar.org/38e3/75aeeeb1937d03c3c80128a70d8e7a74441f.pdf
- classmethod download_and_prepare(data_folder)View on GitHub#
- Return type:
- classmethod read_folder(data_folder)View on GitHub#
- Return type:
- class flair.datasets.CEMP(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original CEMP corpus containing chemical annotations.
For further information see: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/cemp-detailed-task-description/
- classmethod download_train_corpus(data_dir)View on GitHub#
- Return type:
Path
- classmethod download_dev_corpus(data_dir)View on GitHub#
- Return type:
Path
- static parse_input_file(text_file, ann_file)View on GitHub#
- Return type:
- class flair.datasets.CHEMDNER(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original corpus of the CHEMDNER shared task.
For further information see Krallinger et al.: The CHEMDNER corpus of chemicals and drugs and its annotation principles https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-7-S1-S2
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- static download_dataset(data_dir)View on GitHub#
- class flair.datasets.CLL(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
Original CLL corpus containing cell line annotations.
For further information, see Kaewphan et al.: Cell line name recognition in support of the identification of synthetic lethality in cancer from text https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4708107/
- class flair.datasets.CRAFT(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original CRAFT corpus (version 2.0) containing all but the coreference and sections/typography annotations.
For further information see Bada et al.: Concept annotation in the craft corpus https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-161
- classmethod download_corpus(data_dir)View on GitHub#
- Return type:
Path
- static parse_corpus(corpus_dir)View on GitHub#
- Return type:
- class flair.datasets.CRAFT_V4(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Version 4.0.1 of the CRAFT corpus containing all but the co-reference and structural annotations.
For further information see: UCDenver-ccp/CRAFT
- filter_entities(corpus)View on GitHub#
- Return type:
- classmethod download_corpus(data_dir)View on GitHub#
- Return type:
Path
- static prepare_splits(data_dir, corpus)View on GitHub#
- Return type:
tuple
[InternalBioNerDataset
,InternalBioNerDataset
,InternalBioNerDataset
]
- static parse_corpus(corpus_dir)View on GitHub#
- Return type:
- class flair.datasets.DECA(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original DECA corpus containing gene annotations.
For further information see Wang et al.: Disambiguating the species of biomedical named entities using natural language parsers https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2828111/
- classmethod download_corpus(data_dir)View on GitHub#
- Return type:
Path
- static parse_corpus(text_dir, gold_file)View on GitHub#
- Return type:
- class flair.datasets.FSU(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
Original FSU corpus containing protein and derived annotations.
For further information see Hahn et al.: A proposal for a configurable silver standard https://www.aclweb.org/anthology/W10-1838/
- classmethod download_corpus(data_dir)View on GitHub#
- Return type:
Path
- static parse_corpus(corpus_dir, sentence_separator)View on GitHub#
- Return type:
- class flair.datasets.GELLUS(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
Original Gellus corpus containing cell line annotations.
For further information, see Kaewphan et al.: Cell line name recognition in support of the identification of synthetic lethality in cancer from text https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4708107/
- class flair.datasets.GPRO(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original GPRO corpus containing gene annotations.
For further information see: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/gpro-detailed-task-description/
- classmethod download_train_corpus(data_dir)View on GitHub#
- Return type:
Path
- classmethod download_dev_corpus(data_dir)View on GitHub#
- Return type:
Path
- static parse_input_file(text_file, ann_file)View on GitHub#
- Return type:
- class flair.datasets.HunerEntityLinkingDictionary(path, dataset_name)View on GitHub#
Bases:
EntityLinkingDictionary
Base dictionary with data already in huner format.
Every line in the file must be formatted as follows:
concept_id||concept_name
If multiple concept ids are associated to a given name they have to be separated by a |, e.g.
7157||TP53|tumor protein p53
- class flair.datasets.HUNER_CELL_LINE(sentence_splitter=None)View on GitHub#
Bases:
HunerMultiCorpus
Union of all HUNER cell line data sets.
- class flair.datasets.HUNER_CELL_LINE_CELL_FINDER(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CellFinder corpus containing only cell line annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.HUNER_CELL_LINE_CLL(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CLL corpus containing cell line annotations.
- static split_url()View on GitHub#
- Return type:
str
- get_corpus_sentence_splitter()View on GitHub#
Return the pre-defined sentence splitter if defined, otherwise return None.
- Return type:
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.HUNER_CELL_LINE_GELLUS(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the Gellus corpus containing cell line annotations.
- static split_url()View on GitHub#
- Return type:
str
- get_corpus_sentence_splitter()View on GitHub#
Return the pre-defined sentence splitter if defined, otherwise return None.
- Return type:
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.HUNER_CELL_LINE_JNLPBA(*args, **kwargs)View on GitHub#
Bases:
HUNER_JNLPBA
HUNER version of the JNLPBA corpus containing cell line annotations.
- class flair.datasets.HUNER_CHEMICAL(sentence_splitter=None)View on GitHub#
Bases:
HunerMultiCorpus
Union of all HUNER chemical data sets.
- class flair.datasets.HUNER_CHEMICAL_CDR(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the IEPA corpus containing chemical annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.HUNER_CHEMICAL_CEMP(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CEMP corpus containing chemical annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.HUNER_CHEMICAL_CHEBI(*args, **kwargs)View on GitHub#
Bases:
HUNER_CHEBI
HUNER version of the CHEBI corpus containing chemical annotations.
- class flair.datasets.HUNER_CHEMICAL_CHEMDNER(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CHEMDNER corpus containing chemical annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.HUNER_CHEMICAL_CRAFT_V4(*args, **kwargs)View on GitHub#
Bases:
HUNER_CRAFT_V4
HUNER version of the CRAFT corpus containing (only) chemical annotations.
- class flair.datasets.HUNER_CHEMICAL_SCAI(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the SCAI chemicals corpus containing chemical annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.HUNER_DISEASE(sentence_splitter=None)View on GitHub#
Bases:
HunerMultiCorpus
Union of all HUNER disease data sets.
- class flair.datasets.HUNER_DISEASE_CDR(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the IEPA corpus containing disease annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.HUNER_DISEASE_MIRNA(*args, **kwargs)View on GitHub#
Bases:
HUNER_MIRNA
HUNER version of the miRNA corpus containing disease annotations.
- class flair.datasets.HUNER_DISEASE_NCBI(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the NCBI corpus containing disease annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.HUNER_DISEASE_PDR(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
PDR Dataset with only Disease annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.HUNER_DISEASE_SCAI(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the SCAI chemicals corpus containing disease annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.HUNER_DISEASE_VARIOME(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the Variome corpus containing disease annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.HUNER_GENE(sentence_splitter=None)View on GitHub#
Bases:
HunerMultiCorpus
Union of all HUNER gene data sets.
- class flair.datasets.HUNER_GENE_BC2GM(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the BioCreative-II-GM corpus containing gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.HUNER_GENE_BIO_INFER(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the BioInfer corpus containing only gene/protein annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.HUNER_GENE_CELL_FINDER(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CellFinder corpus containing only gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.HUNER_GENE_CHEBI(*args, **kwargs)View on GitHub#
Bases:
HUNER_CHEBI
HUNER version of the CHEBI corpus containing gene annotations.
- class flair.datasets.HUNER_GENE_CRAFT_V4(*args, **kwargs)View on GitHub#
Bases:
HUNER_CRAFT_V4
HUNER version of the CRAFT corpus containing (only) gene annotations.
- class flair.datasets.HUNER_GENE_DECA(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the DECA corpus containing gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.HUNER_GENE_FSU(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the FSU corpus containing (only) gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- get_corpus_sentence_splitter()View on GitHub#
Return the pre-defined sentence splitter if defined, otherwise return None.
- Return type:
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.HUNER_GENE_GPRO(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the GPRO corpus containing gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.HUNER_GENE_IEPA(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the IEPA corpus containing gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.HUNER_GENE_JNLPBA(*args, **kwargs)View on GitHub#
Bases:
HUNER_JNLPBA
HUNER version of the JNLPBA corpus containing gene annotations.
- class flair.datasets.HUNER_GENE_LOCTEXT(*args, **kwargs)View on GitHub#
Bases:
HUNER_LOCTEXT
HUNER version of the Loctext corpus containing protein annotations.
- class flair.datasets.HUNER_GENE_MIRNA(*args, **kwargs)View on GitHub#
Bases:
HUNER_MIRNA
HUNER version of the miRNA corpus containing protein / gene annotations.
- class flair.datasets.HUNER_GENE_OSIRIS(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the OSIRIS corpus containing (only) gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.HUNER_GENE_VARIOME(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the Variome corpus containing gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.HUNER_SPECIES(sentence_splitter=None)View on GitHub#
Bases:
HunerMultiCorpus
Union of all HUNER species data sets.
- class flair.datasets.HUNER_SPECIES_CELL_FINDER(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CellFinder corpus containing only species annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.HUNER_SPECIES_CHEBI(*args, **kwargs)View on GitHub#
Bases:
HUNER_CHEBI
HUNER version of the CHEBI corpus containing species annotations.
- class flair.datasets.HUNER_SPECIES_CRAFT_V4(*args, **kwargs)View on GitHub#
Bases:
HUNER_CRAFT_V4
HUNER version of the CRAFT corpus containing (only) species annotations.
- class flair.datasets.HUNER_SPECIES_LINNEAUS(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the LINNEAUS corpus containing species annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.HUNER_SPECIES_LOCTEXT(*args, **kwargs)View on GitHub#
Bases:
HUNER_LOCTEXT
HUNER version of the Loctext corpus containing species annotations.
- class flair.datasets.HUNER_SPECIES_MIRNA(*args, **kwargs)View on GitHub#
Bases:
HUNER_MIRNA
HUNER version of the miRNA corpus containing species annotations.
- class flair.datasets.HUNER_SPECIES_S800(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the S800 corpus containing species annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.HUNER_SPECIES_VARIOME(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the Variome corpus containing species annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- get_entity_type_mapping()View on GitHub#
- Return type:
Optional
[dict
]
- class flair.datasets.IEPA(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
IEPA corpus as provided by http://corpora.informatik.hu-berlin.de/.
For further information see Ding, Berleant, Nettleton, Wurtele: Mining MEDLINE: abstracts, sentences, or phrases? https://www.ncbi.nlm.nih.gov/pubmed/11928487
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- static download_dataset(data_dir)View on GitHub#
- classmethod parse_dataset(original_file)View on GitHub#
- class flair.datasets.JNLPBA(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
Original corpus of the JNLPBA shared task.
For further information see Kim et al.: Introduction to the Bio- Entity Recognition Task at JNLPBA https://www.aclweb.org/anthology/W04-1213.pdf
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- class flair.datasets.LINNEAUS(base_path=None, in_memory=True, tokenizer=None)View on GitHub#
Bases:
ColumnCorpus
Original LINNEAUS corpus containing species annotations.
- For further information see Gerner et al.:
LINNAEUS: a species name identification system for biomedical literature https://www.ncbi.nlm.nih.gov/pubmed/20149233
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- static download_and_parse_dataset(data_dir)View on GitHub#
- class flair.datasets.LOCTEXT(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original LOCTEXT corpus containing species annotations.
- For further information see Cejuela et al.:
LocText: relation extraction of protein localizations to assist database curation https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2021-9
- static download_dataset(data_dir)View on GitHub#
- static parse_dataset(data_dir)View on GitHub#
- Return type:
- class flair.datasets.MIRNA(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original miRNA corpus.
For further information see Bagewadi et al.: Detecting miRNA Mentions and Relations in Biomedical Literature https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4602280/
- classmethod download_and_prepare_train(data_folder, sentence_separator)View on GitHub#
- classmethod download_and_prepare_test(data_folder, sentence_separator)View on GitHub#
- classmethod parse_file(input_file, split, sentence_separator)View on GitHub#
- Return type:
- class flair.datasets.NCBI_GENE_HUMAN_DICTIONARY(base_path=None)View on GitHub#
Bases:
EntityLinkingDictionary
Dictionary for named entity linking on diseases using the NCBI Gene ontology.
Note that this dictionary only represents human genes - gene from different species aren’t included!
Fur further information can be found at https://www.ncbi.nlm.nih.gov/gene/
- download_dictionary(data_dir)View on GitHub#
- Return type:
Path
- parse_dictionary(original_file)View on GitHub#
- Return type:
Iterator
[EntityCandidate
]
- class flair.datasets.NCBI_TAXONOMY_DICTIONARY(base_path=None)View on GitHub#
Bases:
EntityLinkingDictionary
Dictionary for named entity linking on organisms / species using the NCBI taxonomy ontology.
Further information about the ontology can be found at https://www.ncbi.nlm.nih.gov/taxonomy
- download_dictionary(data_dir)View on GitHub#
- Return type:
Path
- parse_dictionary(original_file)View on GitHub#
- Return type:
Iterator
[EntityCandidate
]
- class flair.datasets.CTD_DISEASES_DICTIONARY(base_path=None)View on GitHub#
Bases:
EntityLinkingDictionary
Dictionary for named entity linking on diseases using the Comparative Toxicogenomics Database (CTD).
Fur further information can be found at https://ctdbase.org/
- download_dictionary(data_dir)View on GitHub#
- Return type:
Path
- parse_file(original_file)View on GitHub#
- Return type:
Iterator
[EntityCandidate
]
- class flair.datasets.CTD_CHEMICALS_DICTIONARY(base_path=None)View on GitHub#
Bases:
EntityLinkingDictionary
Dictionary for named entity linking on chemicals using the Comparative Toxicogenomics Database (CTD).
Fur further information can be found at https://ctdbase.org/
- download_dictionary(data_dir)View on GitHub#
- Return type:
Path
- parse_file(original_file)View on GitHub#
- Return type:
Iterator
[EntityCandidate
]
- class flair.datasets.NCBI_DISEASE(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original NCBI disease corpus containing disease annotations.
For further information see Dogan et al.: NCBI disease corpus: a resource for disease name recognition and concept normalization https://www.ncbi.nlm.nih.gov/pubmed/24393765
- classmethod download_corpus(data_dir)View on GitHub#
- Return type:
Path
- static patch_training_file(orig_train_file, patched_file)View on GitHub#
- static parse_input_file(input_file)View on GitHub#
- class flair.datasets.ONTONOTES(base_path=None, version='v4', language='english', domain=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
MultiFileColumnCorpus
- archive_url = 'https://data.mendeley.com/public-files/datasets/zmycy7t9h9/files/b078e1c4-f7a4-4427-be7f-9389967831ef/file_downloaded'#
- classmethod get_available_domains(base_path=None, version='v4', language='english', split='train')View on GitHub#
- Return type:
list
[str
]
- classmethod dataset_document_iterator(file_path)View on GitHub#
An iterator over CONLL formatted files which yields documents, regardless of the number of document annotations in a particular file.
This is useful for conll data which has been preprocessed, such as the preprocessing which takes place for the 2012 CONLL Coreference Resolution task.
- Return type:
Iterator
[list
[dict
]]
- classmethod sentence_iterator(file_path)View on GitHub#
An iterator over the sentences in an individual CONLL formatted file.
- Return type:
Iterator
- name: str#
- class flair.datasets.OSIRIS(base_path=None, in_memory=True, sentence_splitter=None, load_original_unfixed_annotation=False)View on GitHub#
Bases:
ColumnCorpus
Original OSIRIS corpus containing variation and gene annotations.
For further information see Furlong et al.: Osiris v1.2: a named entity recognition system for sequence variants of genes in biomedical literature https://www.ncbi.nlm.nih.gov/pubmed/18251998
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- classmethod download_dataset(data_dir)View on GitHub#
- Return type:
Path
- classmethod parse_dataset(corpus_folder, fix_annotation=True)View on GitHub#
- class flair.datasets.PDR(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Corpus of plant-disease relations.
For further information see Kim et al.: A corpus of plant-disease relations in the biomedical domain https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0221582 http://gcancer.org/pdr/
Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- classmethod download_corpus(data_dir)View on GitHub#
- Return type:
Path
- class flair.datasets.S800(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
S800 corpus.
For further information see Pafilis et al.: The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0065390.
- static download_dataset(data_dir)View on GitHub#
- static parse_dataset(data_dir)View on GitHub#
- Return type:
- class flair.datasets.SCAI_CHEMICALS(*args, **kwargs)View on GitHub#
Bases:
ScaiCorpus
Original SCAI chemicals corpus containing chemical annotations.
For further information see Kolářik et al.: Chemical Names: Terminological Resources and Corpora Annotation https://pub.uni-bielefeld.de/record/2603498
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- download_corpus(data_dir)View on GitHub#
- Return type:
Path
- static perform_corpus_download(data_dir)View on GitHub#
- Return type:
Path
- class flair.datasets.SCAI_DISEASE(*args, **kwargs)View on GitHub#
Bases:
ScaiCorpus
Original SCAI disease corpus containing disease annotations.
For further information see Gurulingappa et al.: An Empirical Evaluation of Resources for the Identification of Diseases and Adverse Effects in Biomedical Literature https://pub.uni-bielefeld.de/record/2603398
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- download_corpus(data_dir)View on GitHub#
- Return type:
Path
- static perform_corpus_download(data_dir)View on GitHub#
- Return type:
Path
- class flair.datasets.VARIOME(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Variome corpus as provided by http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/hvp_bioc.xml.zip.
For further information see Verspoor et al.: Annotating the biomedical literature for the human variome https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3676157/
- static download_dataset(data_dir)View on GitHub#
- static parse_corpus(corpus_xml)View on GitHub#
- Return type:
- class flair.datasets.AMAZON_REVIEWS(split_max=30000, label_name_map={'1.0': 'NEGATIVE', '2.0': 'NEGATIVE', '3.0': 'NEGATIVE', '4.0': 'POSITIVE', '5.0': 'POSITIVE'}, skip_labels=['3.0', '4.0'], fraction_of_5_star_reviews=10, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
A very large corpus of Amazon reviews with positivity ratings.
Corpus is downloaded from and documented at https://nijianmo.github.io/amazon/index.html. We download the 5-core subset which is still tens of millions of reviews.
- download_and_prepare_amazon_product_file(data_folder, part_name, max_data_points=None, fraction_of_5_star_reviews=None)View on GitHub#
- class flair.datasets.COMMUNICATIVE_FUNCTIONS(base_path=None, memory_mode='full', tokenizer=<flair.tokenization.SpaceTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The Communicative Functions Classification Corpus.
Classifying sentences from scientific papers into 39 communicative functions.
- class flair.datasets.GERMEVAL_2018_OFFENSIVE_LANGUAGE(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='full', fine_grained_classes=False, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
GermEval 2018 corpus for identification of offensive language.
Classifying German tweets into 2 coarse-grained categories OFFENSIVE and OTHER or 4 fine-grained categories ABUSE, INSULT, PROFATINTY and OTHER.
- class flair.datasets.GLUE_COLA(label_type='acceptability', base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
Corpus of Linguistic Acceptability from GLUE benchmark.
see https://gluebenchmark.com/tasks
The task is to predict whether an English sentence is grammatically correct. Additionaly to the Corpus we have eval_dataset containing the unlabeled test data for Glue evaluation.
- tsv_from_eval_dataset(folder_path)View on GitHub#
Create eval prediction file.
This function creates a tsv file with predictions of the eval_dataset (after calling classifier.predict(corpus.eval_dataset, label_name=’acceptability’)). The resulting file is called CoLA.tsv and is in the format required for submission to the Glue Benchmark.
- class flair.datasets.GO_EMOTIONS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
GoEmotions dataset containing 58k Reddit comments labeled with 27 emotion categories.
- class flair.datasets.IMDB(base_path=None, rebalance_corpus=True, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
Corpus of IMDB movie reviews labeled by sentiment (POSITIVE, NEGATIVE).
Downloaded from and documented at http://ai.stanford.edu/~amaas/data/sentiment/.
- class flair.datasets.NEWSGROUPS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
20 newsgroups corpus, classifying news items into one of 20 categories.
Downloaded from http://qwone.com/~jason/20Newsgroups
Each data point is a full news article so documents may be very long.
- class flair.datasets.STACKOVERFLOW(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
Stackoverflow corpus classifying questions into one of 20 labels.
The data will be downloaded from “jacoxu/StackOverflow”,
Each data point is a question.
- class flair.datasets.SENTEVAL_CR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The customer reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.
- class flair.datasets.SENTEVAL_MPQA(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The opinion-polarity dataset of SentEval, classified into NEGATIVE or POSITIVE polarity.
- class flair.datasets.SENTEVAL_MR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The movie reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.
- class flair.datasets.SENTEVAL_SST_BINARY(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The Stanford sentiment treebank dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.
- class flair.datasets.SENTEVAL_SST_GRANULAR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The Stanford sentiment treebank dataset of SentEval, classified into 5 sentiment classes.
- class flair.datasets.SENTEVAL_SUBJ(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The subjectivity dataset of SentEval, classified into SUBJECTIVE or OBJECTIVE sentiment.
- class flair.datasets.SENTIMENT_140(label_name_map=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
Twitter sentiment corpus.
See http://help.sentiment140.com/for-students
Two sentiments in train data (POSITIVE, NEGATIVE) and three sentiments in test data (POSITIVE, NEGATIVE, NEUTRAL).
- class flair.datasets.TREC_6(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The TREC Question Classification Corpus, classifying questions into 6 coarse-grained answer types.
- class flair.datasets.TREC_50(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The TREC Question Classification Corpus, classifying questions into 50 fine-grained answer types.
- class flair.datasets.WASSA_ANGER(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
WASSA-2017 anger emotion-intensity corpus.
see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.
- class flair.datasets.WASSA_FEAR(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
WASSA-2017 fear emotion-intensity corpus.
see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.
- class flair.datasets.WASSA_JOY(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
WASSA-2017 joy emotion-intensity dataset corpus.
see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html
- class flair.datasets.WASSA_SADNESS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
WASSA-2017 sadness emotion-intensity corpus.
see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.
- class flair.datasets.YAHOO_ANSWERS(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The YAHOO Question Classification Corpus, classifying questions into 10 coarse-grained answer types.
- class flair.datasets.ClassificationCorpus(data_folder, label_type='class', train_file=None, test_file=None, dev_file=None, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, sample_missing_splits=True, encoding='utf-8')View on GitHub#
Bases:
Corpus
A classification corpus from FastText-formatted text files.
- class flair.datasets.ClassificationDataset(path_to_file, label_type, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, encoding='utf-8')View on GitHub#
Bases:
FlairDataset
Dataset for classification instantiated from a single FastText-formatted file.
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.CSVClassificationCorpus(data_folder, column_name_map, label_type, name='csv_corpus', train_file=None, test_file=None, dev_file=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, skip_header=False, encoding='utf-8', no_class_label=None, sample_missing_splits=True, **fmtparams)View on GitHub#
Bases:
Corpus
Classification corpus instantiated from CSV data files.
- class flair.datasets.CSVClassificationDataset(path_to_file, column_name_map, label_type, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, skip_header=False, encoding='utf-8', no_class_label=None, **fmtparams)View on GitHub#
Bases:
FlairDataset
Dataset for text classification from CSV column formatted data.
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.NEL_ENGLISH_AIDA(base_path=None, in_memory=True, use_ids_and_check_existence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NEL_ENGLISH_AQUAINT(base_path=None, in_memory=True, agreement_threshold=0.5, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NEL_ENGLISH_IITB(base_path=None, in_memory=True, ignore_disagreements=False, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NEL_ENGLISH_REDDIT(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NEL_ENGLISH_TWEEKI(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NEL_GERMAN_HIPE(base_path=None, in_memory=True, wiki_language='dewiki', **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.WSD_MASC(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.WSD_OMSTI(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.WSD_RAGANATO_ALL(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.WSD_SEMCOR(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.WSD_TRAINOMATIC(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.WSD_UFSAC(filenames=['masc', 'semcor'], base_path=None, in_memory=True, cut_multisense=True, columns={0: 'text', 3: 'sense'}, banned_sentences=None, sample_missing_splits_in_multicorpus=True, sample_missing_splits_in_each_corpus=True, use_raganato_ALL_as_test_data=False, name='multicorpus')View on GitHub#
Bases:
MultiCorpus
- class flair.datasets.WSD_WORDNET_GLOSS_TAGGED(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.RE_ENGLISH_CONLL04(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- convert_to_conllu(source_data_folder, data_folder)View on GitHub#
- class flair.datasets.RE_ENGLISH_DRUGPROT(base_path=None, in_memory=True, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- extract_and_convert_to_conllu(data_file, data_folder)View on GitHub#
- char_spans_to_token_spans(char_spans, token_offsets)View on GitHub#
- has_overlap(a, b)View on GitHub#
- drugprot_document_to_tokenlists(pmid, title_sentences, abstract_sentences, abstract_offset, entities, relations)View on GitHub#
- Return type:
list
[TokenList
]
- class flair.datasets.RE_ENGLISH_SEMEVAL2010(base_path=None, in_memory=True, augment_train=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- extract_and_convert_to_conllu(data_file, data_folder, augment_train)View on GitHub#
- class flair.datasets.RE_ENGLISH_TACRED(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- extract_and_convert_to_conllu(data_file, data_folder)View on GitHub#
- class flair.datasets.BIOSCOPE(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.CONLL_03(base_path=None, column_format={0: 'text', 1: 'pos', 3: 'ner'}, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.CONLL_03_DUTCH(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.CONLL_03_GERMAN(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.CONLL_03_SPANISH(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.CLEANCONLL(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- static download_and_prepare_data(data_folder)View on GitHub#
- class flair.datasets.CONLL_2000(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.FEWNERD(setting='supervised', **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.KEYPHRASE_INSPEC(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.KEYPHRASE_SEMEVAL2010(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.KEYPHRASE_SEMEVAL2017(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.MASAKHA_POS(languages='bam', version='v1', base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
MultiCorpus
- class flair.datasets.NER_ARABIC_ANER(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_ARABIC_AQMAR(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_BASQUE(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_CHINESE_WEIBO(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_DANISH_DANE(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_ENGLISH_MOVIE_COMPLEX(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_ENGLISH_MOVIE_SIMPLE(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_ENGLISH_PERSON(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_ENGLISH_RESTAURANT(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_ENGLISH_SEC_FILLINGS(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_ENGLISH_STACKOVERFLOW(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_ENGLISH_TWITTER(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_ENGLISH_WEBPAGES(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_ENGLISH_WIKIGOLD(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_ENGLISH_WNUT_2020(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_FINNISH(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_GERMAN_BIOFID(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_GERMAN_EUROPARL(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_GERMAN_GERMEVAL(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_GERMAN_LEGAL(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_GERMAN_MOBIE(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_GERMAN_POLITICS(base_path=None, column_delimiter='\\\\s+', in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_HIPE_2022(dataset_name, language, base_path=None, in_memory=True, version='v2.1', branch_name='main', dev_split_name='dev', add_document_separator=False, sample_missing_splits=False, preproc_fn=None, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_NOISEBENCH(noise='clean', base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- label_url = 'https://raw.githubusercontent.com/elenamer/NoiseBench/main/data/annotations/'#
- SAVE_TRAINDEV_FILE = False#
- name: str#
- class flair.datasets.NER_HUNGARIAN(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_ICDAR_EUROPEANA(language, base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_ICELANDIC(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_JAPANESE(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_NERMUD(domains='all', base_path=None, in_memory=False, **corpusargs)View on GitHub#
Bases:
MultiCorpus
- class flair.datasets.NER_MASAKHANE(languages='luo', version='v2', base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
MultiCorpus
- class flair.datasets.NER_MULTI_WIKIANN(languages='en', base_path=None, in_memory=False, **corpusargs)View on GitHub#
Bases:
MultiCorpus
- class flair.datasets.NER_MULTI_WIKINER(languages='en', base_path=None, in_memory=False, **corpusargs)View on GitHub#
Bases:
MultiCorpus
- class flair.datasets.NER_MULTI_XTREME(languages='en', base_path=None, in_memory=False, **corpusargs)View on GitHub#
Bases:
MultiCorpus
- class flair.datasets.NER_SWEDISH(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_TURKU(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_UKRAINIAN(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.NER_ESTONIAN_NOISY(version=0, base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- data_url = 'https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/patnlp/estner.cnll.zip'#
- label_url = 'https://raw.githubusercontent.com/uds-lsv/NoisyNER/master/data/only_labels'#
- name: str#
- class flair.datasets.UP_CHINESE(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.UP_ENGLISH(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.UP_FINNISH(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.UP_FRENCH(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.UP_GERMAN(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.UP_ITALIAN(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.UP_SPANISH(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.UP_SPANISH_ANCORA(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.WNUT_17(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.ColumnCorpus(data_folder, column_format, train_file=None, test_file=None, dev_file=None, autofind_splits=True, name=None, comment_symbol='# ', **corpusargs)View on GitHub#
Bases:
MultiFileColumnCorpus
- class flair.datasets.ColumnDataset(path_to_column_file, column_name_map, column_delimiter='\\\\s+', comment_symbol=None, banned_sentences=None, in_memory=True, document_separator_token=None, encoding='utf-8', skip_first_line=False, label_name_map=None, default_whitespace_after=1)View on GitHub#
Bases:
FlairDataset
- SPACE_AFTER_KEY = 'space-after'#
- FEATS = ['feats', 'misc']#
- HEAD = ['head', 'head_id']#
-
text_column:
int
#
-
head_id_column:
Optional
[int
]#
-
sentences_raw:
list
[list
[str
]]#
-
total_sentence_count:
int
#
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.NER_MULTI_CONER(task='multi', base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
MultiFileColumnCorpus
- class flair.datasets.NER_MULTI_CONER_V2(task='multi', base_path=None, in_memory=True, use_dev_as_test=True, **corpusargs)View on GitHub#
Bases:
MultiFileColumnCorpus
- class flair.datasets.FeideggerCorpus(**kwargs)View on GitHub#
Bases:
Corpus
- class flair.datasets.FeideggerDataset(dataset_info, **kwargs)View on GitHub#
Bases:
FlairDataset
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.GLUE_MNLI(label_type='entailment', evaluate_on_matched=True, base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Bases:
DataPairCorpus
- tsv_from_eval_dataset(folder_path)View on GitHub#
- class flair.datasets.GLUE_MRPC(label_type='paraphrase', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Bases:
DataPairCorpus
- tsv_from_eval_dataset(folder_path)View on GitHub#
- class flair.datasets.GLUE_QNLI(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Bases:
DataPairCorpus
- tsv_from_eval_dataset(folder_path)View on GitHub#
- class flair.datasets.GLUE_QQP(label_type='paraphrase', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Bases:
DataPairCorpus
- tsv_from_eval_dataset(folder_path)View on GitHub#
- class flair.datasets.GLUE_RTE(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Bases:
DataPairCorpus
- tsv_from_eval_dataset(folder_path)View on GitHub#
- class flair.datasets.GLUE_WNLI(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Bases:
DataPairCorpus
- tsv_from_eval_dataset(folder_path)View on GitHub#
- class flair.datasets.GLUE_SST2(label_type='sentiment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, encoding='utf-8', sample_missing_splits=True, **datasetargs)View on GitHub#
Bases:
CSVClassificationCorpus
- label_map = {0: 'negative', 1: 'positive'}#
- tsv_from_eval_dataset(folder_path)View on GitHub#
Create eval prediction file.
- name: str#
- class flair.datasets.GLUE_STSB(label_type='similarity', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Bases:
DataPairCorpus
- tsv_from_eval_dataset(folder_path)View on GitHub#
Create a tsv file of the predictions of the eval_dataset.
After calling classifier.predict(corpus.eval_dataset, label_name=’similarity’), this function can be used to produce a file called STS-B.tsv suitable for submission to the Glue Benchmark.
- class flair.datasets.SUPERGLUE_RTE(base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#
Bases:
DataPairCorpus
- jsonl_from_eval_dataset(folder_path)View on GitHub#
- class flair.datasets.DataPairCorpus(data_folder, columns=[0, 1, 2], train_file=None, test_file=None, dev_file=None, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, label_type=None, autofind_splits=True, sample_missing_splits=True, skip_first_line=False, separator='\\t', encoding='utf-8')View on GitHub#
Bases:
Corpus
- class flair.datasets.DataPairDataset(path_to_data, columns=[0, 1, 2], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub#
Bases:
FlairDataset
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.DataTripleCorpus(data_folder, columns=[0, 1, 2, 3], train_file=None, test_file=None, dev_file=None, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, label_type=None, autofind_splits=True, sample_missing_splits=True, skip_first_line=False, separator='\\t', encoding='utf-8')View on GitHub#
Bases:
Corpus
- class flair.datasets.DataTripleDataset(path_to_data, columns=[0, 1, 2, 3], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub#
Bases:
FlairDataset
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.OpusParallelCorpus(dataset, l1, l2, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, **corpusargs)View on GitHub#
Bases:
ParallelTextCorpus
- class flair.datasets.ParallelTextCorpus(source_file, target_file, name, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, **corpusargs)View on GitHub#
Bases:
Corpus
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.ParallelTextDataset(path_to_source, path_to_target, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True)View on GitHub#
Bases:
FlairDataset
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.UD_AFRIKAANS(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_ANCIENT_GREEK(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_ARABIC(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_ARMENIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_BASQUE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_BAVARIAN_MAIBAAM(base_path=None, in_memory=True, split_multiwords=True, revision='dev')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_BELARUSIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_BULGARIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_BURYAT(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_CATALAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_CHINESE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_CHINESE_KYOTO(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_COPTIC(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_CROATIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_CZECH(base_path=None, in_memory=False, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_DANISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_DUTCH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_ENGLISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_ESTONIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_FAROESE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
This treebank includes the Faroese treebank dataset.
The data is obtained from the following link: UniversalDependencies/UD_Faroese-FarPaHC/{revision}
Faronese is a small Western Scandinavian language with 60.000-100.000, related to Icelandic and Old Norse.
- class flair.datasets.UD_FINNISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_FRENCH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_GALICIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_GERMAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_GERMAN_HDT(base_path=None, in_memory=False, split_multiwords=True, revision='dev')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_GOTHIC(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_GREEK(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_HEBREW(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_HINDI(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_INDONESIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_IRISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_ITALIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_JAPANESE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_KAZAKH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_KOREAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_LATIN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_LATVIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_LITHUANIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_LIVVI(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_MALTESE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_MARATHI(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_NAIJA(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_NORTH_SAMI(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_NORWEGIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_OLD_CHURCH_SLAVONIC(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_OLD_FRENCH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_PERSIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_POLISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_PORTUGUESE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_ROMANIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_RUSSIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_SERBIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_SLOVAK(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_SLOVENIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_SPANISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_SWEDISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_TURKISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_UKRAINIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UD_WOLOF(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#
Bases:
UniversalDependenciesCorpus
- class flair.datasets.UniversalDependenciesCorpus(data_folder, train_file=None, test_file=None, dev_file=None, in_memory=True, split_multiwords=True)View on GitHub#
Bases:
Corpus
- class flair.datasets.UniversalDependenciesDataset(path_to_conll_file, in_memory=True, split_multiwords=True)View on GitHub#
Bases:
FlairDataset
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.ZELDA(base_path=None, in_memory=False, column_format={0: 'text', 2: 'nel'}, **corpusargs)View on GitHub#
Bases:
MultiFileColumnCorpus