flair.datasets#

Classes

`AGNEWS`([base_path, tokenizer, memory_mode])	The AG's News Topic Classification Corpus, classifying news into 4 coarse-grained topics.
`AMAZON_REVIEWS`([split_max, label_name_map, ...])	A very large corpus of Amazon reviews with positivity ratings.
`ANAT_EM`(data_folder, column_format[, ...])	Corpus for anatomical named entity mention recognition.
`AZDZ`([base_path, in_memory, tokenizer])	Arizona Disease Corpus from the Biomedical Informatics Lab at Arizona State University.
`BC2GM`([base_path, in_memory, sentence_splitter])	Original BioCreative-II-GM corpus containing gene annotations.
`BIOBERT_CHEMICAL_BC4CHEMD`([base_path, in_memory])	BC4CHEMD corpus with chemical annotations as used in the evaluation of BioBERT.
`BIOBERT_CHEMICAL_BC5CDR`([base_path, in_memory])	BC5CDR corpus with chemical annotations as used in the evaluation of BioBERT.
`BIOBERT_DISEASE_BC5CDR`([base_path, in_memory])	BC5CDR corpus with disease annotations as used in the evaluation of BioBERT.
`BIOBERT_DISEASE_NCBI`([base_path, in_memory])	NCBI disease corpus as used in the evaluation of BioBERT.
`BIOBERT_GENE_BC2GM`([base_path, in_memory])	BC4CHEMD corpus with gene annotations as used in the evaluation of BioBERT.
`BIOBERT_GENE_JNLPBA`([base_path, in_memory])	JNLPBA corpus with gene annotations as used in the evaluation of BioBERT.
`BIOBERT_SPECIES_LINNAEUS`([base_path, in_memory])	Linneaeus corpus with species annotations as used in the evaluation of BioBERT.
`BIOBERT_SPECIES_S800`([base_path, in_memory])	S800 corpus with species annotations as used in the evaluation of BioBERT.
`BIONLP2013_CG`([base_path, in_memory, ...])	Corpus of the BioNLP'2013 Cancer Genetics shared task.
`BIONLP2013_PC`([base_path, in_memory, ...])	Corpus of the BioNLP'2013 Pathway Curation shared task.
`BIOSCOPE`([base_path, in_memory])
`BIOSEMANTICS`([base_path, in_memory, ...])	Original Biosemantics corpus.
`BIO_INFER`([base_path, in_memory])	Original BioInfer corpus.
`CDR`([base_path, in_memory, sentence_splitter])	CDR corpus as provided by JHnlp/BioCreative-V-CDR-Corpus.
`CELL_FINDER`([base_path, in_memory, ...])	Original CellFinder corpus containing cell line, species and gene annotations.
`CEMP`([base_path, in_memory, sentence_splitter])	Original CEMP corpus containing chemical annotations.
`CHEMDNER`([base_path, in_memory, ...])	Original corpus of the CHEMDNER shared task.
`CLEANCONLL`([base_path, in_memory])
`CLL`([base_path, in_memory])	Original CLL corpus containing cell line annotations.
`COMMUNICATIVE_FUNCTIONS`([base_path, ...])	The Communicative Functions Classification Corpus.
`CONLL_03`([base_path, column_format, in_memory])
`CONLL_03_DUTCH`([base_path, in_memory])
`CONLL_03_GERMAN`([base_path, in_memory])
`CONLL_03_SPANISH`([base_path, in_memory])
`CONLL_2000`([base_path, in_memory])
`CRAFT`([base_path, in_memory, sentence_splitter])	Original CRAFT corpus (version 2.0) containing all but the coreference and sections/typography annotations.
`CRAFT_V4`([base_path, in_memory, ...])	Version 4.0.1 of the CRAFT corpus containing all but the co-reference and structural annotations.
`CTD_CHEMICALS_DICTIONARY`([base_path])	Dictionary for named entity linking on chemicals using the Comparative Toxicogenomics Database (CTD).
`CTD_DISEASES_DICTIONARY`([base_path])	Dictionary for named entity linking on diseases using the Comparative Toxicogenomics Database (CTD).
`DECA`([base_path, in_memory, sentence_splitter])	Original DECA corpus containing gene annotations.
`FEWNERD`([setting])
`FSU`([base_path, in_memory])	Original FSU corpus containing protein and derived annotations.
`GELLUS`([base_path, in_memory])	Original Gellus corpus containing cell line annotations.
`GERMEVAL_2018_OFFENSIVE_LANGUAGE`([...])	GermEval 2018 corpus for identification of offensive language.
`GLUE_COLA`([label_type, base_path, tokenizer])	Corpus of Linguistic Acceptability from GLUE benchmark.
`GLUE_MNLI`([label_type, evaluate_on_matched, ...])
`GLUE_MRPC`([label_type, base_path, ...])
`GLUE_QNLI`([label_type, base_path, ...])
`GLUE_QQP`([label_type, base_path, ...])
`GLUE_RTE`([label_type, base_path, ...])
`GLUE_SST2`([label_type, base_path, ...])
`GLUE_STSB`([label_type, base_path, ...])
`GLUE_WNLI`([label_type, base_path, ...])
`GO_EMOTIONS`([base_path, tokenizer, memory_mode])	GoEmotions dataset containing 58k Reddit comments labeled with 27 emotion categories.
`GPRO`([base_path, in_memory, sentence_splitter])	Original GPRO corpus containing gene annotations.
`HUNER_CELL_LINE`([sentence_splitter])	Union of all HUNER cell line data sets.
`HUNER_CELL_LINE_CELL_FINDER`(args, *kwargs)	HUNER version of the CellFinder corpus containing only cell line annotations.
`HUNER_CELL_LINE_CLL`(args, *kwargs)	HUNER version of the CLL corpus containing cell line annotations.
`HUNER_CELL_LINE_GELLUS`(args, *kwargs)	HUNER version of the Gellus corpus containing cell line annotations.
`HUNER_CELL_LINE_JNLPBA`(args, *kwargs)	HUNER version of the JNLPBA corpus containing cell line annotations.
`HUNER_CHEMICAL`([sentence_splitter])	Union of all HUNER chemical data sets.
`HUNER_CHEMICAL_CDR`(args, *kwargs)	HUNER version of the IEPA corpus containing chemical annotations.
`HUNER_CHEMICAL_CEMP`(args, *kwargs)	HUNER version of the CEMP corpus containing chemical annotations.
`HUNER_CHEMICAL_CHEBI`(args, *kwargs)	HUNER version of the CHEBI corpus containing chemical annotations.
`HUNER_CHEMICAL_CHEMDNER`(args, *kwargs)	HUNER version of the CHEMDNER corpus containing chemical annotations.
`HUNER_CHEMICAL_CRAFT_V4`(args, *kwargs)	HUNER version of the CRAFT corpus containing (only) chemical annotations.
`HUNER_CHEMICAL_SCAI`(args, *kwargs)	HUNER version of the SCAI chemicals corpus containing chemical annotations.
`HUNER_DISEASE`([sentence_splitter])	Union of all HUNER disease data sets.
`HUNER_DISEASE_CDR`(args, *kwargs)	HUNER version of the IEPA corpus containing disease annotations.
`HUNER_DISEASE_MIRNA`(args, *kwargs)	HUNER version of the miRNA corpus containing disease annotations.
`HUNER_DISEASE_NCBI`(args, *kwargs)	HUNER version of the NCBI corpus containing disease annotations.
`HUNER_DISEASE_PDR`(args, *kwargs)	PDR Dataset with only Disease annotations.
`HUNER_DISEASE_SCAI`(args, *kwargs)	HUNER version of the SCAI chemicals corpus containing disease annotations.
`HUNER_DISEASE_VARIOME`(args, *kwargs)	HUNER version of the Variome corpus containing disease annotations.
`HUNER_GENE`([sentence_splitter])	Union of all HUNER gene data sets.
`HUNER_GENE_BC2GM`(args, *kwargs)	HUNER version of the BioCreative-II-GM corpus containing gene annotations.
`HUNER_GENE_BIO_INFER`(args, *kwargs)	HUNER version of the BioInfer corpus containing only gene/protein annotations.
`HUNER_GENE_CELL_FINDER`(args, *kwargs)	HUNER version of the CellFinder corpus containing only gene annotations.
`HUNER_GENE_CHEBI`(args, *kwargs)	HUNER version of the CHEBI corpus containing gene annotations.
`HUNER_GENE_CRAFT_V4`(args, *kwargs)	HUNER version of the CRAFT corpus containing (only) gene annotations.
`HUNER_GENE_DECA`(args, *kwargs)	HUNER version of the DECA corpus containing gene annotations.
`HUNER_GENE_FSU`(args, *kwargs)	HUNER version of the FSU corpus containing (only) gene annotations.
`HUNER_GENE_GPRO`(args, *kwargs)	HUNER version of the GPRO corpus containing gene annotations.
`HUNER_GENE_IEPA`(args, *kwargs)	HUNER version of the IEPA corpus containing gene annotations.
`HUNER_GENE_JNLPBA`(args, *kwargs)	HUNER version of the JNLPBA corpus containing gene annotations.
`HUNER_GENE_LOCTEXT`(args, *kwargs)	HUNER version of the Loctext corpus containing protein annotations.
`HUNER_GENE_MIRNA`(args, *kwargs)	HUNER version of the miRNA corpus containing protein / gene annotations.
`HUNER_GENE_OSIRIS`(args, *kwargs)	HUNER version of the OSIRIS corpus containing (only) gene annotations.
`HUNER_GENE_VARIOME`(args, *kwargs)	HUNER version of the Variome corpus containing gene annotations.
`HUNER_SPECIES`([sentence_splitter])	Union of all HUNER species data sets.
`HUNER_SPECIES_CELL_FINDER`(args, *kwargs)	HUNER version of the CellFinder corpus containing only species annotations.
`HUNER_SPECIES_CHEBI`(args, *kwargs)	HUNER version of the CHEBI corpus containing species annotations.
`HUNER_SPECIES_CRAFT_V4`(args, *kwargs)	HUNER version of the CRAFT corpus containing (only) species annotations.
`HUNER_SPECIES_LINNEAUS`(args, *kwargs)	HUNER version of the LINNEAUS corpus containing species annotations.
`HUNER_SPECIES_LOCTEXT`(args, *kwargs)	HUNER version of the Loctext corpus containing species annotations.
`HUNER_SPECIES_MIRNA`(args, *kwargs)	HUNER version of the miRNA corpus containing species annotations.
`HUNER_SPECIES_S800`(args, *kwargs)	HUNER version of the S800 corpus containing species annotations.
`HUNER_SPECIES_VARIOME`(args, *kwargs)	HUNER version of the Variome corpus containing species annotations.
`IEPA`([base_path, in_memory])	IEPA corpus as provided by http://corpora.informatik.hu-berlin.de/.
`IMDB`([base_path, rebalance_corpus, ...])	Corpus of IMDB movie reviews labeled by sentiment (POSITIVE, NEGATIVE).
`JNLPBA`([base_path, in_memory])	Original corpus of the JNLPBA shared task.
`KEYPHRASE_INSPEC`([base_path, in_memory])
`KEYPHRASE_SEMEVAL2010`([base_path, in_memory])
`KEYPHRASE_SEMEVAL2017`([base_path, in_memory])
`LINNEAUS`([base_path, in_memory, tokenizer])	Original LINNEAUS corpus containing species annotations.
`LOCTEXT`([base_path, in_memory, ...])	Original LOCTEXT corpus containing species annotations.
`MASAKHA_POS`([languages, version, base_path, ...])
`MIRNA`([base_path, in_memory, sentence_splitter])	Original miRNA corpus.
`NCBI_DISEASE`([base_path, in_memory, ...])	Original NCBI disease corpus containing disease annotations.
`NCBI_GENE_HUMAN_DICTIONARY`([base_path])	Dictionary for named entity linking on diseases using the NCBI Gene ontology.
`NCBI_TAXONOMY_DICTIONARY`([base_path])	Dictionary for named entity linking on organisms / species using the NCBI taxonomy ontology.
`NEL_ENGLISH_AIDA`([base_path, in_memory, ...])
`NEL_ENGLISH_AQUAINT`([base_path, in_memory, ...])
`NEL_ENGLISH_IITB`([base_path, in_memory, ...])
`NEL_ENGLISH_REDDIT`([base_path, in_memory])
`NEL_ENGLISH_TWEEKI`([base_path, in_memory])
`NEL_GERMAN_HIPE`([base_path, in_memory, ...])
`NER_ARABIC_ANER`([base_path, in_memory, ...])
`NER_ARABIC_AQMAR`([base_path, in_memory, ...])
`NER_BASQUE`([base_path, in_memory])
`NER_BAVARIAN_WIKI`([fine_grained, revision, ...])
`NER_CHINESE_WEIBO`([base_path, in_memory, ...])
`NER_DANISH_DANE`([base_path, in_memory])
`NER_ENGLISH_MOVIE_COMPLEX`([base_path, in_memory])
`NER_ENGLISH_MOVIE_SIMPLE`([base_path, in_memory])
`NER_ENGLISH_PERSON`([base_path, in_memory])
`NER_ENGLISH_RESTAURANT`([base_path, in_memory])
`NER_ENGLISH_SEC_FILLINGS`([base_path, in_memory])
`NER_ENGLISH_STACKOVERFLOW`([base_path, in_memory])
`NER_ENGLISH_TWITTER`([base_path, in_memory])
`NER_ENGLISH_WEBPAGES`([base_path, in_memory])
`NER_ENGLISH_WIKIGOLD`([base_path, in_memory, ...])
`NER_ENGLISH_WNUT_2020`([base_path, ...])
`NER_ESTONIAN_NOISY`([version, base_path, ...])
`NER_FINNISH`([base_path, in_memory])
`NER_GERMAN_BIOFID`([base_path, in_memory])
`NER_GERMAN_EUROPARL`([base_path, in_memory])
`NER_GERMAN_GERMEVAL`([base_path, in_memory])
`NER_GERMAN_LEGAL`([base_path, in_memory])
`NER_GERMAN_MOBIE`([base_path, in_memory])
`NER_GERMAN_POLITICS`([base_path, ...])
`NER_HIPE_2022`(dataset_name, language[, ...])
`NER_HUNGARIAN`([base_path, in_memory, ...])
`NER_ICDAR_EUROPEANA`(language[, base_path, ...])
`NER_ICELANDIC`([base_path, in_memory])
`NER_JAPANESE`([base_path, in_memory])
`NER_MASAKHANE`([languages, version, ...])
`NER_MULTI_CONER`([task, base_path, in_memory])
`NER_MULTI_CONER_V2`([task, base_path, ...])
`NER_MULTI_WIKIANN`([languages, base_path, ...])
`NER_MULTI_WIKINER`([languages, base_path, ...])
`NER_MULTI_XTREME`([languages, base_path, ...])
`NER_NERMUD`([domains, base_path, in_memory])
`NER_NOISEBENCH`([noise, base_path, in_memory])
`NER_SWEDISH`([base_path, in_memory])
`NER_TURKU`([base_path, in_memory])
`NER_UKRAINIAN`([base_path, in_memory])
`NEWSGROUPS`([base_path, tokenizer, memory_mode])	20 newsgroups corpus, classifying news items into one of 20 categories.
`ONTONOTES`([base_path, version, language, ...])
`OSIRIS`([base_path, in_memory, ...])	Original OSIRIS corpus containing variation and gene annotations.
`PDR`([base_path, in_memory, sentence_splitter])	Corpus of plant-disease relations.
`RE_ENGLISH_CONLL04`([base_path, in_memory])
`RE_ENGLISH_DRUGPROT`([base_path, in_memory, ...])
`RE_ENGLISH_SEMEVAL2010`([base_path, ...])
`RE_ENGLISH_TACRED`([base_path, in_memory])
`S800`([base_path, in_memory, sentence_splitter])	S800 corpus.
`SCAI_CHEMICALS`(args, *kwargs)	Original SCAI chemicals corpus containing chemical annotations.
`SCAI_DISEASE`(args, *kwargs)	Original SCAI disease corpus containing disease annotations.
`SENTEVAL_CR`([tokenizer, memory_mode])	The customer reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.
`SENTEVAL_MPQA`([tokenizer, memory_mode])	The opinion-polarity dataset of SentEval, classified into NEGATIVE or POSITIVE polarity.
`SENTEVAL_MR`([tokenizer, memory_mode])	The movie reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.
`SENTEVAL_SST_BINARY`([tokenizer, memory_mode])	The Stanford sentiment treebank dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.
`SENTEVAL_SST_GRANULAR`([tokenizer, memory_mode])	The Stanford sentiment treebank dataset of SentEval, classified into 5 sentiment classes.
`SENTEVAL_SUBJ`([tokenizer, memory_mode])	The subjectivity dataset of SentEval, classified into SUBJECTIVE or OBJECTIVE sentiment.
`SENTIMENT_140`([label_name_map, tokenizer, ...])	Twitter sentiment corpus.
`SROIE`([base_path, encoding, label_type, ...])
`STACKOVERFLOW`([base_path, tokenizer, ...])	Stackoverflow corpus classifying questions into one of 20 labels.
`SUPERGLUE_RTE`([base_path, ...])
`TREC_6`([base_path, tokenizer, memory_mode])	The TREC Question Classification Corpus, classifying questions into 6 coarse-grained answer types.
`TREC_50`([base_path, tokenizer, memory_mode])	The TREC Question Classification Corpus, classifying questions into 50 fine-grained answer types.
`UD_AFRIKAANS`([base_path, in_memory, ...])
`UD_ANCIENT_GREEK`([base_path, in_memory, ...])
`UD_ARABIC`([base_path, in_memory, ...])
`UD_ARMENIAN`([base_path, in_memory, ...])
`UD_BASQUE`([base_path, in_memory, ...])
`UD_BAVARIAN_MAIBAAM`([base_path, in_memory, ...])
`UD_BELARUSIAN`([base_path, in_memory, ...])
`UD_BULGARIAN`([base_path, in_memory, ...])
`UD_BURYAT`([base_path, in_memory, ...])
`UD_CATALAN`([base_path, in_memory, ...])
`UD_CHINESE`([base_path, in_memory, ...])
`UD_CHINESE_KYOTO`([base_path, in_memory, ...])
`UD_COPTIC`([base_path, in_memory, ...])
`UD_CROATIAN`([base_path, in_memory, ...])
`UD_CZECH`([base_path, in_memory, ...])
`UD_DANISH`([base_path, in_memory, ...])
`UD_DUTCH`([base_path, in_memory, ...])
`UD_ENGLISH`([base_path, in_memory, ...])
`UD_ESTONIAN`([base_path, in_memory, ...])
`UD_FAROESE`([base_path, in_memory, ...])	This treebank includes the Faroese treebank dataset.
`UD_FINNISH`([base_path, in_memory, ...])
`UD_FRENCH`([base_path, in_memory, ...])
`UD_GALICIAN`([base_path, in_memory, ...])
`UD_GERMAN`([base_path, in_memory, ...])
`UD_GERMAN_HDT`([base_path, in_memory, ...])
`UD_GOTHIC`([base_path, in_memory, ...])
`UD_GREEK`([base_path, in_memory, ...])
`UD_HEBREW`([base_path, in_memory, ...])
`UD_HINDI`([base_path, in_memory, ...])
`UD_INDONESIAN`([base_path, in_memory, ...])
`UD_IRISH`([base_path, in_memory, ...])
`UD_ITALIAN`([base_path, in_memory, ...])
`UD_JAPANESE`([base_path, in_memory, ...])
`UD_KAZAKH`([base_path, in_memory, ...])
`UD_KOREAN`([base_path, in_memory, ...])
`UD_LATIN`([base_path, in_memory, ...])
`UD_LATVIAN`([base_path, in_memory, ...])
`UD_LITHUANIAN`([base_path, in_memory, ...])
`UD_LIVVI`([base_path, in_memory, ...])
`UD_MALTESE`([base_path, in_memory, ...])
`UD_MARATHI`([base_path, in_memory, ...])
`UD_NAIJA`([base_path, in_memory, ...])
`UD_NORTH_SAMI`([base_path, in_memory, ...])
`UD_NORWEGIAN`([base_path, in_memory, ...])
`UD_OLD_CHURCH_SLAVONIC`([base_path, ...])
`UD_OLD_FRENCH`([base_path, in_memory, ...])
`UD_PERSIAN`([base_path, in_memory, ...])
`UD_POLISH`([base_path, in_memory, ...])
`UD_PORTUGUESE`([base_path, in_memory, ...])
`UD_ROMANIAN`([base_path, in_memory, ...])
`UD_RUSSIAN`([base_path, in_memory, ...])
`UD_SERBIAN`([base_path, in_memory, ...])
`UD_SLOVAK`([base_path, in_memory, ...])
`UD_SLOVENIAN`([base_path, in_memory, ...])
`UD_SPANISH`([base_path, in_memory, ...])
`UD_SWEDISH`([base_path, in_memory, ...])
`UD_TURKISH`([base_path, in_memory, ...])
`UD_UKRAINIAN`([base_path, in_memory, ...])
`UD_WOLOF`([base_path, in_memory, ...])
`UP_CHINESE`([base_path, in_memory, ...])
`UP_ENGLISH`([base_path, in_memory, ...])
`UP_FINNISH`([base_path, in_memory, ...])
`UP_FRENCH`([base_path, in_memory, ...])
`UP_GERMAN`([base_path, in_memory, ...])
`UP_ITALIAN`([base_path, in_memory, ...])
`UP_SPANISH`([base_path, in_memory, ...])
`UP_SPANISH_ANCORA`([base_path, in_memory, ...])
`VARIOME`([base_path, in_memory, ...])	Variome corpus as provided by http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/hvp_bioc.xml.zip.
`WASSA_ANGER`([base_path, tokenizer])	WASSA-2017 anger emotion-intensity corpus.
`WASSA_FEAR`([base_path, tokenizer])	WASSA-2017 fear emotion-intensity corpus.
`WASSA_JOY`([base_path, tokenizer])	WASSA-2017 joy emotion-intensity dataset corpus.
`WASSA_SADNESS`([base_path, tokenizer])	WASSA-2017 sadness emotion-intensity corpus.
`WNUT_17`([base_path, in_memory])
`WSD_MASC`([base_path, in_memory, columns, ...])
`WSD_OMSTI`([base_path, in_memory, columns, ...])
`WSD_RAGANATO_ALL`([base_path, in_memory, ...])
`WSD_SEMCOR`([base_path, in_memory, columns, ...])
`WSD_TRAINOMATIC`([base_path, in_memory, ...])
`WSD_UFSAC`([filenames, base_path, in_memory, ...])
`WSD_WORDNET_GLOSS_TAGGED`([base_path, ...])
`YAHOO_ANSWERS`([base_path, tokenizer, ...])	The YAHOO Question Classification Corpus, classifying questions into 10 coarse-grained answer types.
`ZELDA`([base_path, in_memory, column_format])
`CSVClassificationCorpus`(data_folder, ...[, ...])	Classification corpus instantiated from CSV data files.
`CSVClassificationDataset`(path_to_file, ...)	Dataset for text classification from CSV column formatted data.
`ClassificationCorpus`(data_folder[, ...])	A classification corpus from FastText-formatted text files.
`ClassificationDataset`(path_to_file, label_type)	Dataset for classification instantiated from a single FastText-formatted file.
`ColumnCorpus`(data_folder, column_format[, ...])
`ColumnDataset`(path_to_column_file, ...[, ...])
`DataLoader`(dataset[, batch_size, shuffle, ...])
`DataPairCorpus`(data_folder[, columns, ...])
`DataPairDataset`(path_to_data[, columns, ...])
`DataTripleCorpus`(data_folder[, columns, ...])
`DataTripleDataset`(path_to_data[, columns, ...])
`EntityLinkingDictionary`(candidates[, ...])	Base class for downloading and reading of dictionaries for entity entity linking.
`FeideggerCorpus`(**kwargs)
`FeideggerDataset`(dataset_info, **kwargs)
`FlairDatapointDataset`(datapoints)	A simple Dataset object to wrap a List of Datapoints, for example Sentences.
`HunerEntityLinkingDictionary`(path, dataset_name)	Base dictionary with data already in huner format.
`MongoDataset`(query, host, port, database, ...)
`OcrJsonDataset`(path_to_split_directory[, ...])
`OpusParallelCorpus`(dataset, l1, l2[, ...])
`ParallelTextCorpus`(source_file, target_file, ...)
`ParallelTextDataset`(path_to_source, ...[, ...])
`SentenceDataset`(sentences)
`StringDataset`(texts[, use_tokenizer])	A Dataset taking string as input and returning Sentence during iteration.
`UniversalDependenciesCorpus`(data_folder[, ...])
`UniversalDependenciesDataset`(path_to_conll_file)

class flair.datasets.AGNEWS(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

The AG’s News Topic Classification Corpus, classifying news into 4 coarse-grained topics.

Labels: World, Sports, Business, Sci/Tech.

class flair.datasets.AMAZON_REVIEWS(split_max=30000, label_name_map={'1.0': 'NEGATIVE', '2.0': 'NEGATIVE', '3.0': 'NEGATIVE', '4.0': 'POSITIVE', '5.0': 'POSITIVE'}, skip_labels=['3.0', '4.0'], fraction_of_5_star_reviews=10, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

A very large corpus of Amazon reviews with positivity ratings.

Corpus is downloaded from and documented at https://nijianmo.github.io/amazon/index.html. We download the 5-core subset which is still tens of millions of reviews.

download_and_prepare_amazon_product_file(data_folder, part_name, max_data_points=None, fraction_of_5_star_reviews=None)View on GitHub #

class flair.datasets.ANAT_EM(data_folder, column_format, train_file=None, test_file=None, dev_file=None, autofind_splits=True, name=None, comment_symbol='# ', use_tokenizer=None, **corpusargs)View on GitHub #

Bases: ColumnCorpus

Corpus for anatomical named entity mention recognition.

For further information see Pyysalo and Ananiadou: Anatomical entity mention recognition at literature scale https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3957068/ http://nactem.ac.uk/anatomytagger/#AnatEM

Deprecated since version 0.13: Please use BIGBIO_NER_CORPUS implementation by calling corpus = BIGBIO_NER_CORPUS(“bigbio/anat_em”, trust_remote_code=True)

class flair.datasets.AZDZ(base_path=None, in_memory=True, tokenizer=None)View on GitHub #

Bases: ColumnCorpus

Arizona Disease Corpus from the Biomedical Informatics Lab at Arizona State University.

For further information see: http://diego.asu.edu/index.php

classmethod download_corpus(data_dir)View on GitHub #

Return type:: Path

static parse_corpus(input_file)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.BC2GM(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: ColumnCorpus

Original BioCreative-II-GM corpus containing gene annotations.

For further information see Smith et al.: Overview of BioCreative II gene mention recognition https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2559986/

Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

static download_dataset(data_dir)View on GitHub #

Return type:: Path

classmethod parse_train_dataset(data_folder)View on GitHub #

Return type:: InternalBioNerDataset

classmethod parse_test_dataset(data_folder)View on GitHub #

Return type:: InternalBioNerDataset

static parse_dataset(text_file, ann_file)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.BIOBERT_CHEMICAL_BC4CHEMD(base_path=None, in_memory=True)View on GitHub #

Bases: ColumnCorpus

BC4CHEMD corpus with chemical annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.BIOBERT_CHEMICAL_BC5CDR(base_path=None, in_memory=True)View on GitHub #

Bases: ColumnCorpus

BC5CDR corpus with chemical annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.BIOBERT_DISEASE_BC5CDR(base_path=None, in_memory=True)View on GitHub #

Bases: ColumnCorpus

BC5CDR corpus with disease annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.BIOBERT_DISEASE_NCBI(base_path=None, in_memory=True)View on GitHub #

Bases: ColumnCorpus

NCBI disease corpus as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.BIOBERT_GENE_BC2GM(base_path=None, in_memory=True)View on GitHub #

Bases: ColumnCorpus

BC4CHEMD corpus with gene annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.BIOBERT_GENE_JNLPBA(base_path=None, in_memory=True)View on GitHub #

Bases: ColumnCorpus

JNLPBA corpus with gene annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.BIOBERT_SPECIES_LINNAEUS(base_path=None, in_memory=True)View on GitHub #

Bases: ColumnCorpus

Linneaeus corpus with species annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.BIOBERT_SPECIES_S800(base_path=None, in_memory=True)View on GitHub #

Bases: ColumnCorpus

S800 corpus with species annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.BIONLP2013_CG(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: BioNLPCorpus

Corpus of the BioNLP’2013 Cancer Genetics shared task.

For further information see Pyysalo, Ohta & Ananiadou 2013 Overview of the Cancer Genetics (CG) task of BioNLP Shared Task 2013 https://www.aclweb.org/anthology/W13-2008/

Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

static download_corpus(download_folder)View on GitHub #

Return type:: tuple[Path, Path, Path]

class flair.datasets.BIONLP2013_PC(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: BioNLPCorpus

Corpus of the BioNLP’2013 Pathway Curation shared task.

For further information see Ohta et al. Overview of the pathway curation (PC) task of bioNLP shared task 2013. https://www.aclweb.org/anthology/W13-2009/

Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

static download_corpus(download_folder)View on GitHub #

Return type:: tuple[Path, Path, Path]

class flair.datasets.BIOSCOPE(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.BIOSEMANTICS(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: ColumnCorpus

Original Biosemantics corpus.

For further information see Akhondi et al.: Annotated chemical patent corpus: a gold standard for text mining https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4182036/

static download_dataset(data_dir)View on GitHub #

Return type:: Path

static parse_dataset(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.BIO_INFER(base_path=None, in_memory=True)View on GitHub #

Bases: ColumnCorpus

Original BioInfer corpus.

For further information see Pyysalo et al.:: BioInfer: a corpus for information extraction in the biomedical domain https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-50

classmethod download_dataset(data_dir)View on GitHub #

Return type:: Path

classmethod parse_dataset(original_file)View on GitHub #

class flair.datasets.CDR(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: ColumnCorpus

CDR corpus as provided by JHnlp/BioCreative-V-CDR-Corpus.

For further information see Li et al.: BioCreative V CDR task corpus: a resource for chemical disease relation extraction https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4860626/

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

static download_dataset(data_dir)View on GitHub #

class flair.datasets.CELL_FINDER(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: ColumnCorpus

Original CellFinder corpus containing cell line, species and gene annotations.

For futher information see Neves et al.: Annotating and evaluating text for stem cell research https://pdfs.semanticscholar.org/38e3/75aeeeb1937d03c3c80128a70d8e7a74441f.pdf

classmethod download_and_prepare(data_folder)View on GitHub #

Return type:: InternalBioNerDataset

classmethod read_folder(data_folder)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.CEMP(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: ColumnCorpus

Original CEMP corpus containing chemical annotations.

For further information see: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/cemp-detailed-task-description/

classmethod download_train_corpus(data_dir)View on GitHub #

Return type:: Path

classmethod download_dev_corpus(data_dir)View on GitHub #

Return type:: Path

static parse_input_file(text_file, ann_file)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.CHEMDNER(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: ColumnCorpus

Original corpus of the CHEMDNER shared task.

For further information see Krallinger et al.: The CHEMDNER corpus of chemicals and drugs and its annotation principles https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-7-S1-S2

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

static download_dataset(data_dir)View on GitHub #

class flair.datasets.CLEANCONLL(base_path=None, in_memory=True, **corpusargs)View on GitHub #

Bases: ColumnCorpus

static download_and_prepare_data(data_folder)View on GitHub #

class flair.datasets.CLL(base_path=None, in_memory=True)View on GitHub #

Bases: ColumnCorpus

Original CLL corpus containing cell line annotations.

For further information, see Kaewphan et al.: Cell line name recognition in support of the identification of synthetic lethality in cancer from text https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4708107/

class flair.datasets.COMMUNICATIVE_FUNCTIONS(base_path=None, memory_mode='full', tokenizer=<flair.tokenization.SpaceTokenizer object>, **corpusargs)View on GitHub #

Bases: ClassificationCorpus

The Communicative Functions Classification Corpus.

Classifying sentences from scientific papers into 39 communicative functions.

class flair.datasets.CONLL_03(base_path=None, column_format={0: 'text', 1: 'pos', 3: 'ner'}, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.CONLL_03_DUTCH(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.CONLL_03_GERMAN(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.CONLL_03_SPANISH(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.CONLL_2000(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.CRAFT(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: ColumnCorpus

Original CRAFT corpus (version 2.0) containing all but the coreference and sections/typography annotations.

For further information see Bada et al.: Concept annotation in the craft corpus https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-161

classmethod download_corpus(data_dir)View on GitHub #

Return type:: Path

static parse_corpus(corpus_dir)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.CRAFT_V4(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: ColumnCorpus

Version 4.0.1 of the CRAFT corpus containing all but the co-reference and structural annotations.

For further information see: UCDenver-ccp/CRAFT

filter_entities(corpus)View on GitHub #

Return type:: InternalBioNerDataset

classmethod download_corpus(data_dir)View on GitHub #

Return type:: Path

static prepare_splits(data_dir, corpus)View on GitHub #

Return type:: tuple[InternalBioNerDataset, InternalBioNerDataset, InternalBioNerDataset]

static parse_corpus(corpus_dir)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.CTD_CHEMICALS_DICTIONARY(base_path=None)View on GitHub #

Bases: EntityLinkingDictionary

Dictionary for named entity linking on chemicals using the Comparative Toxicogenomics Database (CTD).

Fur further information can be found at https://ctdbase.org/

download_dictionary(data_dir)View on GitHub #

Return type:: Path

parse_file(original_file)View on GitHub #

Return type:: Iterator[EntityCandidate]

class flair.datasets.CTD_DISEASES_DICTIONARY(base_path=None)View on GitHub #

Bases: EntityLinkingDictionary

Dictionary for named entity linking on diseases using the Comparative Toxicogenomics Database (CTD).

Fur further information can be found at https://ctdbase.org/

download_dictionary(data_dir)View on GitHub #

Return type:: Path

parse_file(original_file)View on GitHub #

Return type:: Iterator[EntityCandidate]

class flair.datasets.DECA(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: ColumnCorpus

Original DECA corpus containing gene annotations.

For further information see Wang et al.: Disambiguating the species of biomedical named entities using natural language parsers https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2828111/

classmethod download_corpus(data_dir)View on GitHub #

Return type:: Path

static parse_corpus(text_dir, gold_file)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.FEWNERD(setting='supervised', **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.FSU(base_path=None, in_memory=True)View on GitHub #

Bases: ColumnCorpus

Original FSU corpus containing protein and derived annotations.

For further information see Hahn et al.: A proposal for a configurable silver standard https://www.aclweb.org/anthology/W10-1838/

classmethod download_corpus(data_dir)View on GitHub #

Return type:: Path

static parse_corpus(corpus_dir, sentence_separator)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.GELLUS(base_path=None, in_memory=True)View on GitHub #

Bases: ColumnCorpus

Original Gellus corpus containing cell line annotations.

For further information, see Kaewphan et al.: Cell line name recognition in support of the identification of synthetic lethality in cancer from text https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4708107/

class flair.datasets.GERMEVAL_2018_OFFENSIVE_LANGUAGE(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='full', fine_grained_classes=False, **corpusargs)View on GitHub #

Bases: ClassificationCorpus

GermEval 2018 corpus for identification of offensive language.

Classifying German tweets into 2 coarse-grained categories OFFENSIVE and OTHER or 4 fine-grained categories ABUSE, INSULT, PROFATINTY and OTHER.

class flair.datasets.GLUE_COLA(label_type='acceptability', base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub #

Bases: ClassificationCorpus

Corpus of Linguistic Acceptability from GLUE benchmark.

see https://gluebenchmark.com/tasks

The task is to predict whether an English sentence is grammatically correct. Additionaly to the Corpus we have eval_dataset containing the unlabeled test data for Glue evaluation.

tsv_from_eval_dataset(folder_path)View on GitHub #

Create eval prediction file.

This function creates a tsv file with predictions of the eval_dataset (after calling classifier.predict(corpus.eval_dataset, label_name=’acceptability’)). The resulting file is called CoLA.tsv and is in the format required for submission to the Glue Benchmark.

class flair.datasets.GLUE_MNLI(label_type='entailment', evaluate_on_matched=True, base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub #

Bases: DataPairCorpus

tsv_from_eval_dataset(folder_path)View on GitHub #

class flair.datasets.GLUE_MRPC(label_type='paraphrase', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub #

Bases: DataPairCorpus

tsv_from_eval_dataset(folder_path)View on GitHub #

class flair.datasets.GLUE_QNLI(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub #

Bases: DataPairCorpus

tsv_from_eval_dataset(folder_path)View on GitHub #

class flair.datasets.GLUE_QQP(label_type='paraphrase', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub #

Bases: DataPairCorpus

tsv_from_eval_dataset(folder_path)View on GitHub #

class flair.datasets.GLUE_RTE(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub #

Bases: DataPairCorpus

tsv_from_eval_dataset(folder_path)View on GitHub #

class flair.datasets.GLUE_SST2(label_type='sentiment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, encoding='utf-8', sample_missing_splits=True, **datasetargs)View on GitHub #

Bases: CSVClassificationCorpus

label_map = {0: 'negative', 1: 'positive'}#

tsv_from_eval_dataset(folder_path)View on GitHub #: Create eval prediction file.

name: str#

tokenizer: Optional[Tokenizer]#

class flair.datasets.GLUE_STSB(label_type='similarity', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub #

Bases: DataPairCorpus

tsv_from_eval_dataset(folder_path)View on GitHub #

Create a tsv file of the predictions of the eval_dataset.

After calling classifier.predict(corpus.eval_dataset, label_name=’similarity’), this function can be used to produce a file called STS-B.tsv suitable for submission to the Glue Benchmark.

class flair.datasets.GLUE_WNLI(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub #

Bases: DataPairCorpus

tsv_from_eval_dataset(folder_path)View on GitHub #

class flair.datasets.GO_EMOTIONS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

GoEmotions dataset containing 58k Reddit comments labeled with 27 emotion categories.

see google-research/google-research

class flair.datasets.GPRO(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: ColumnCorpus

Original GPRO corpus containing gene annotations.

For further information see: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/gpro-detailed-task-description/

classmethod download_train_corpus(data_dir)View on GitHub #

Return type:: Path

classmethod download_dev_corpus(data_dir)View on GitHub #

Return type:: Path

static parse_input_file(text_file, ann_file)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.HUNER_CELL_LINE(sentence_splitter=None)View on GitHub #

Bases: HunerMultiCorpus

Union of all HUNER cell line data sets.

class flair.datasets.HUNER_CELL_LINE_CELL_FINDER(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the CellFinder corpus containing only cell line annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.HUNER_CELL_LINE_CLL(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the CLL corpus containing cell line annotations.

static split_url()View on GitHub #

Return type:: str

get_corpus_sentence_splitter()View on GitHub #

Return the pre-defined sentence splitter if defined, otherwise return None.

Return type:: SentenceSplitter

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.HUNER_CELL_LINE_GELLUS(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the Gellus corpus containing cell line annotations.

static split_url()View on GitHub #

Return type:: str

get_corpus_sentence_splitter()View on GitHub #

Return the pre-defined sentence splitter if defined, otherwise return None.

Return type:: SentenceSplitter

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.HUNER_CELL_LINE_JNLPBA(*args, **kwargs)View on GitHub #

Bases: HUNER_JNLPBA

HUNER version of the JNLPBA corpus containing cell line annotations.

class flair.datasets.HUNER_CHEMICAL(sentence_splitter=None)View on GitHub #

Bases: HunerMultiCorpus

Union of all HUNER chemical data sets.

class flair.datasets.HUNER_CHEMICAL_CDR(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the IEPA corpus containing chemical annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.HUNER_CHEMICAL_CEMP(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the CEMP corpus containing chemical annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.HUNER_CHEMICAL_CHEBI(*args, **kwargs)View on GitHub #

Bases: HUNER_CHEBI

HUNER version of the CHEBI corpus containing chemical annotations.

class flair.datasets.HUNER_CHEMICAL_CHEMDNER(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the CHEMDNER corpus containing chemical annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.HUNER_CHEMICAL_CRAFT_V4(*args, **kwargs)View on GitHub #

Bases: HUNER_CRAFT_V4

HUNER version of the CRAFT corpus containing (only) chemical annotations.

class flair.datasets.HUNER_CHEMICAL_SCAI(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the SCAI chemicals corpus containing chemical annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.HUNER_DISEASE(sentence_splitter=None)View on GitHub #

Bases: HunerMultiCorpus

Union of all HUNER disease data sets.

class flair.datasets.HUNER_DISEASE_CDR(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the IEPA corpus containing disease annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.HUNER_DISEASE_MIRNA(*args, **kwargs)View on GitHub #

Bases: HUNER_MIRNA

HUNER version of the miRNA corpus containing disease annotations.

class flair.datasets.HUNER_DISEASE_NCBI(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the NCBI corpus containing disease annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.HUNER_DISEASE_PDR(*args, **kwargs)View on GitHub #

Bases: HunerDataset

PDR Dataset with only Disease annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.HUNER_DISEASE_SCAI(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the SCAI chemicals corpus containing disease annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.HUNER_DISEASE_VARIOME(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the Variome corpus containing disease annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.HUNER_GENE(sentence_splitter=None)View on GitHub #

Bases: HunerMultiCorpus

Union of all HUNER gene data sets.

class flair.datasets.HUNER_GENE_BC2GM(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the BioCreative-II-GM corpus containing gene annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.HUNER_GENE_BIO_INFER(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the BioInfer corpus containing only gene/protein annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.HUNER_GENE_CELL_FINDER(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the CellFinder corpus containing only gene annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.HUNER_GENE_CHEBI(*args, **kwargs)View on GitHub #

Bases: HUNER_CHEBI

HUNER version of the CHEBI corpus containing gene annotations.

class flair.datasets.HUNER_GENE_CRAFT_V4(*args, **kwargs)View on GitHub #

Bases: HUNER_CRAFT_V4

HUNER version of the CRAFT corpus containing (only) gene annotations.

class flair.datasets.HUNER_GENE_DECA(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the DECA corpus containing gene annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.HUNER_GENE_FSU(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the FSU corpus containing (only) gene annotations.

static split_url()View on GitHub #

Return type:: str

get_corpus_sentence_splitter()View on GitHub #

Return the pre-defined sentence splitter if defined, otherwise return None.

Return type:: SentenceSplitter

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.HUNER_GENE_GPRO(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the GPRO corpus containing gene annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.HUNER_GENE_IEPA(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the IEPA corpus containing gene annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.HUNER_GENE_JNLPBA(*args, **kwargs)View on GitHub #

Bases: HUNER_JNLPBA

HUNER version of the JNLPBA corpus containing gene annotations.

class flair.datasets.HUNER_GENE_LOCTEXT(*args, **kwargs)View on GitHub #

Bases: HUNER_LOCTEXT

HUNER version of the Loctext corpus containing protein annotations.

class flair.datasets.HUNER_GENE_MIRNA(*args, **kwargs)View on GitHub #

Bases: HUNER_MIRNA

HUNER version of the miRNA corpus containing protein / gene annotations.

class flair.datasets.HUNER_GENE_OSIRIS(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the OSIRIS corpus containing (only) gene annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.HUNER_GENE_VARIOME(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the Variome corpus containing gene annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.HUNER_SPECIES(sentence_splitter=None)View on GitHub #

Bases: HunerMultiCorpus

Union of all HUNER species data sets.

class flair.datasets.HUNER_SPECIES_CELL_FINDER(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the CellFinder corpus containing only species annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.HUNER_SPECIES_CHEBI(*args, **kwargs)View on GitHub #

Bases: HUNER_CHEBI

HUNER version of the CHEBI corpus containing species annotations.

class flair.datasets.HUNER_SPECIES_CRAFT_V4(*args, **kwargs)View on GitHub #

Bases: HUNER_CRAFT_V4

HUNER version of the CRAFT corpus containing (only) species annotations.

class flair.datasets.HUNER_SPECIES_LINNEAUS(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the LINNEAUS corpus containing species annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.HUNER_SPECIES_LOCTEXT(*args, **kwargs)View on GitHub #

Bases: HUNER_LOCTEXT

HUNER version of the Loctext corpus containing species annotations.

class flair.datasets.HUNER_SPECIES_MIRNA(*args, **kwargs)View on GitHub #

Bases: HUNER_MIRNA

HUNER version of the miRNA corpus containing species annotations.

class flair.datasets.HUNER_SPECIES_S800(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the S800 corpus containing species annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.HUNER_SPECIES_VARIOME(*args, **kwargs)View on GitHub #

Bases: HunerDataset

HUNER version of the Variome corpus containing species annotations.

static split_url()View on GitHub #

Return type:: str

to_internal(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

get_entity_type_mapping()View on GitHub #

Return type:: Optional[dict]

class flair.datasets.IEPA(base_path=None, in_memory=True)View on GitHub #

Bases: ColumnCorpus

IEPA corpus as provided by http://corpora.informatik.hu-berlin.de/.

For further information see Ding, Berleant, Nettleton, Wurtele: Mining MEDLINE: abstracts, sentences, or phrases? https://www.ncbi.nlm.nih.gov/pubmed/11928487

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

static download_dataset(data_dir)View on GitHub #

classmethod parse_dataset(original_file)View on GitHub #

class flair.datasets.IMDB(base_path=None, rebalance_corpus=True, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

Corpus of IMDB movie reviews labeled by sentiment (POSITIVE, NEGATIVE).

Downloaded from and documented at http://ai.stanford.edu/~amaas/data/sentiment/.

class flair.datasets.JNLPBA(base_path=None, in_memory=True)View on GitHub #

Bases: ColumnCorpus

Original corpus of the JNLPBA shared task.

For further information see Kim et al.: Introduction to the Bio- Entity Recognition Task at JNLPBA https://www.aclweb.org/anthology/W04-1213.pdf

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

class flair.datasets.KEYPHRASE_INSPEC(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.KEYPHRASE_SEMEVAL2010(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.KEYPHRASE_SEMEVAL2017(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.LINNEAUS(base_path=None, in_memory=True, tokenizer=None)View on GitHub #

Bases: ColumnCorpus

Original LINNEAUS corpus containing species annotations.

For further information see Gerner et al.:: LINNAEUS: a species name identification system for biomedical literature https://www.ncbi.nlm.nih.gov/pubmed/20149233

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

static download_and_parse_dataset(data_dir)View on GitHub #

class flair.datasets.LOCTEXT(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: ColumnCorpus

Original LOCTEXT corpus containing species annotations.

For further information see Cejuela et al.:: LocText: relation extraction of protein localizations to assist database curation https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2021-9

static download_dataset(data_dir)View on GitHub #

static parse_dataset(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.MASAKHA_POS(languages='bam', version='v1', base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: MultiCorpus

class flair.datasets.MIRNA(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: ColumnCorpus

Original miRNA corpus.

For further information see Bagewadi et al.: Detecting miRNA Mentions and Relations in Biomedical Literature https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4602280/

classmethod download_and_prepare_train(data_folder, sentence_separator)View on GitHub #

classmethod download_and_prepare_test(data_folder, sentence_separator)View on GitHub #

classmethod parse_file(input_file, split, sentence_separator)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.NCBI_DISEASE(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: ColumnCorpus

Original NCBI disease corpus containing disease annotations.

For further information see Dogan et al.: NCBI disease corpus: a resource for disease name recognition and concept normalization https://www.ncbi.nlm.nih.gov/pubmed/24393765

classmethod download_corpus(data_dir)View on GitHub #

Return type:: Path

static patch_training_file(orig_train_file, patched_file)View on GitHub #

static parse_input_file(input_file)View on GitHub #

class flair.datasets.NCBI_GENE_HUMAN_DICTIONARY(base_path=None)View on GitHub #

Bases: EntityLinkingDictionary

Dictionary for named entity linking on diseases using the NCBI Gene ontology.

Note that this dictionary only represents human genes - gene from different species aren’t included!

Fur further information can be found at https://www.ncbi.nlm.nih.gov/gene/

download_dictionary(data_dir)View on GitHub #

Return type:: Path

parse_dictionary(original_file)View on GitHub #

Return type:: Iterator[EntityCandidate]

class flair.datasets.NCBI_TAXONOMY_DICTIONARY(base_path=None)View on GitHub #

Bases: EntityLinkingDictionary

Dictionary for named entity linking on organisms / species using the NCBI taxonomy ontology.

Further information about the ontology can be found at https://www.ncbi.nlm.nih.gov/taxonomy

download_dictionary(data_dir)View on GitHub #

Return type:: Path

parse_dictionary(original_file)View on GitHub #

Return type:: Iterator[EntityCandidate]

class flair.datasets.NEL_ENGLISH_AIDA(base_path=None, in_memory=True, use_ids_and_check_existence=False, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NEL_ENGLISH_AQUAINT(base_path=None, in_memory=True, agreement_threshold=0.5, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NEL_ENGLISH_IITB(base_path=None, in_memory=True, ignore_disagreements=False, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NEL_ENGLISH_REDDIT(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NEL_ENGLISH_TWEEKI(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NEL_GERMAN_HIPE(base_path=None, in_memory=True, wiki_language='dewiki', **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_ARABIC_ANER(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_ARABIC_AQMAR(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_BASQUE(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_BAVARIAN_WIKI(fine_grained=False, revision='main', base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_CHINESE_WEIBO(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_DANISH_DANE(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_MOVIE_COMPLEX(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_MOVIE_SIMPLE(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_PERSON(base_path=None, in_memory=True)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_RESTAURANT(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_SEC_FILLINGS(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_STACKOVERFLOW(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_TWITTER(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_WEBPAGES(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_WIKIGOLD(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_WNUT_2020(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_ESTONIAN_NOISY(version=0, base_path=None, in_memory=True, **corpusargs)View on GitHub #

Bases: ColumnCorpus

data_url = 'https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/patnlp/estner.cnll.zip'#

label_url = 'https://raw.githubusercontent.com/uds-lsv/NoisyNER/master/data/only_labels'#

tokenizer: Optional[Tokenizer]#

name: str#

class flair.datasets.NER_FINNISH(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_GERMAN_BIOFID(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_GERMAN_EUROPARL(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_GERMAN_GERMEVAL(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_GERMAN_LEGAL(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_GERMAN_MOBIE(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_GERMAN_POLITICS(base_path=None, column_delimiter='\\\\s+', in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_HIPE_2022(dataset_name, language, base_path=None, in_memory=True, version='v2.1', branch_name='main', dev_split_name='dev', add_document_separator=False, sample_missing_splits=False, preproc_fn=None, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_HUNGARIAN(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_ICDAR_EUROPEANA(language, base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_ICELANDIC(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_JAPANESE(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_MASAKHANE(languages='luo', version='v2', base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: MultiCorpus

class flair.datasets.NER_MULTI_CONER(task='multi', base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: MultiFileColumnCorpus

class flair.datasets.NER_MULTI_CONER_V2(task='multi', base_path=None, in_memory=True, use_dev_as_test=True, **corpusargs)View on GitHub #: Bases: MultiFileColumnCorpus

class flair.datasets.NER_MULTI_WIKIANN(languages='en', base_path=None, in_memory=False, **corpusargs)View on GitHub #: Bases: MultiCorpus

class flair.datasets.NER_MULTI_WIKINER(languages='en', base_path=None, in_memory=False, **corpusargs)View on GitHub #: Bases: MultiCorpus

class flair.datasets.NER_MULTI_XTREME(languages='en', base_path=None, in_memory=False, **corpusargs)View on GitHub #: Bases: MultiCorpus

class flair.datasets.NER_NERMUD(domains='all', base_path=None, in_memory=False, **corpusargs)View on GitHub #: Bases: MultiCorpus

class flair.datasets.NER_NOISEBENCH(noise='clean', base_path=None, in_memory=True, **corpusargs)View on GitHub #

Bases: ColumnCorpus

label_url = 'https://raw.githubusercontent.com/elenamer/NoiseBench/main/data/annotations/'#

SAVE_TRAINDEV_FILE = False#

tokenizer: Optional[Tokenizer]#

name: str#

class flair.datasets.NER_SWEDISH(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_TURKU(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NER_UKRAINIAN(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.NEWSGROUPS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

20 newsgroups corpus, classifying news items into one of 20 categories.

Downloaded from http://qwone.com/~jason/20Newsgroups

Each data point is a full news article so documents may be very long.

class flair.datasets.ONTONOTES(base_path=None, version='v4', language='english', domain=None, in_memory=True, **corpusargs)View on GitHub #

Bases: MultiFileColumnCorpus

archive_url = 'https://data.mendeley.com/public-files/datasets/zmycy7t9h9/files/b078e1c4-f7a4-4427-be7f-9389967831ef/file_downloaded'#

classmethod get_available_domains(base_path=None, version='v4', language='english', split='train')View on GitHub #

Return type:: list[str]

classmethod dataset_document_iterator(file_path)View on GitHub #

An iterator over CONLL formatted files which yields documents, regardless of the number of document annotations in a particular file.

This is useful for conll data which has been preprocessed, such as the preprocessing which takes place for the 2012 CONLL Coreference Resolution task.

Return type:: Iterator[list[dict]]

classmethod sentence_iterator(file_path)View on GitHub #

An iterator over the sentences in an individual CONLL formatted file.

Return type:: Iterator

tokenizer: Optional[Tokenizer]#

name: str#

class flair.datasets.OSIRIS(base_path=None, in_memory=True, sentence_splitter=None, load_original_unfixed_annotation=False)View on GitHub #

Bases: ColumnCorpus

Original OSIRIS corpus containing variation and gene annotations.

For further information see Furlong et al.: Osiris v1.2: a named entity recognition system for sequence variants of genes in biomedical literature https://www.ncbi.nlm.nih.gov/pubmed/18251998

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

classmethod download_dataset(data_dir)View on GitHub #

Return type:: Path

classmethod parse_dataset(corpus_folder, fix_annotation=True)View on GitHub #

class flair.datasets.PDR(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: ColumnCorpus

Corpus of plant-disease relations.

For further information see Kim et al.: A corpus of plant-disease relations in the biomedical domain https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0221582 http://gcancer.org/pdr/

Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

classmethod download_corpus(data_dir)View on GitHub #

Return type:: Path

class flair.datasets.RE_ENGLISH_CONLL04(base_path=None, in_memory=True, **corpusargs)View on GitHub #

Bases: ColumnCorpus

convert_to_conllu(source_data_folder, data_folder)View on GitHub #

class flair.datasets.RE_ENGLISH_DRUGPROT(base_path=None, in_memory=True, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub #

Bases: ColumnCorpus

extract_and_convert_to_conllu(data_file, data_folder)View on GitHub #

char_spans_to_token_spans(char_spans, token_offsets)View on GitHub #

has_overlap(a, b)View on GitHub #

drugprot_document_to_tokenlists(pmid, title_sentences, abstract_sentences, abstract_offset, entities, relations)View on GitHub #

Return type:: list[TokenList]

class flair.datasets.RE_ENGLISH_SEMEVAL2010(base_path=None, in_memory=True, augment_train=False, **corpusargs)View on GitHub #

Bases: ColumnCorpus

extract_and_convert_to_conllu(data_file, data_folder, augment_train)View on GitHub #

class flair.datasets.RE_ENGLISH_TACRED(base_path=None, in_memory=True, **corpusargs)View on GitHub #

Bases: ColumnCorpus

extract_and_convert_to_conllu(data_file, data_folder)View on GitHub #

class flair.datasets.S800(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: ColumnCorpus

S800 corpus.

For further information see Pafilis et al.: The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0065390.

static download_dataset(data_dir)View on GitHub #

static parse_dataset(data_dir)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.SCAI_CHEMICALS(*args, **kwargs)View on GitHub #

Bases: ScaiCorpus

Original SCAI chemicals corpus containing chemical annotations.

For further information see Kolářik et al.: Chemical Names: Terminological Resources and Corpora Annotation https://pub.uni-bielefeld.de/record/2603498

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

download_corpus(data_dir)View on GitHub #

Return type:: Path

static perform_corpus_download(data_dir)View on GitHub #

Return type:: Path

class flair.datasets.SCAI_DISEASE(*args, **kwargs)View on GitHub #

Bases: ScaiCorpus

Original SCAI disease corpus containing disease annotations.

For further information see Gurulingappa et al.: An Empirical Evaluation of Resources for the Identification of Diseases and Adverse Effects in Biomedical Literature https://pub.uni-bielefeld.de/record/2603398

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

download_corpus(data_dir)View on GitHub #

Return type:: Path

static perform_corpus_download(data_dir)View on GitHub #

Return type:: Path

class flair.datasets.SENTEVAL_CR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

The customer reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.

see facebookresearch/SentEval

class flair.datasets.SENTEVAL_MPQA(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

The opinion-polarity dataset of SentEval, classified into NEGATIVE or POSITIVE polarity.

see facebookresearch/SentEval

class flair.datasets.SENTEVAL_MR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

The movie reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.

see facebookresearch/SentEval

class flair.datasets.SENTEVAL_SST_BINARY(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

The Stanford sentiment treebank dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.

see facebookresearch/SentEval

class flair.datasets.SENTEVAL_SST_GRANULAR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

The Stanford sentiment treebank dataset of SentEval, classified into 5 sentiment classes.

see facebookresearch/SentEval

class flair.datasets.SENTEVAL_SUBJ(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

The subjectivity dataset of SentEval, classified into SUBJECTIVE or OBJECTIVE sentiment.

see facebookresearch/SentEval

class flair.datasets.SENTIMENT_140(label_name_map=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

Twitter sentiment corpus.

See http://help.sentiment140.com/for-students

Two sentiments in train data (POSITIVE, NEGATIVE) and three sentiments in test data (POSITIVE, NEGATIVE, NEUTRAL).

class flair.datasets.SROIE(base_path=None, encoding='utf-8', label_type='ner', in_memory=True, load_images=False, normalize_coords_to_thousands=True, label_name_map=None, **corpusargs)View on GitHub #: Bases: OcrCorpus

class flair.datasets.STACKOVERFLOW(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

Stackoverflow corpus classifying questions into one of 20 labels.

The data will be downloaded from “jacoxu/StackOverflow”,

Each data point is a question.

class flair.datasets.SUPERGLUE_RTE(base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub #

Bases: DataPairCorpus

jsonl_from_eval_dataset(folder_path)View on GitHub #

class flair.datasets.TREC_6(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

The TREC Question Classification Corpus, classifying questions into 6 coarse-grained answer types.

class flair.datasets.TREC_50(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

The TREC Question Classification Corpus, classifying questions into 50 fine-grained answer types.

class flair.datasets.UD_AFRIKAANS(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_ANCIENT_GREEK(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_ARABIC(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_ARMENIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_BASQUE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_BAVARIAN_MAIBAAM(base_path=None, in_memory=True, split_multiwords=True, revision='dev')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_BELARUSIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_BULGARIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_BURYAT(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_CATALAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_CHINESE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_CHINESE_KYOTO(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_COPTIC(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_CROATIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_CZECH(base_path=None, in_memory=False, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_DANISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_DUTCH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_ENGLISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_ESTONIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_FAROESE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #

Bases: UniversalDependenciesCorpus

This treebank includes the Faroese treebank dataset.

The data is obtained from the following link: UniversalDependencies/UD_Faroese-FarPaHC/{revision}

Faronese is a small Western Scandinavian language with 60.000-100.000, related to Icelandic and Old Norse.

class flair.datasets.UD_FINNISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_FRENCH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_GALICIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_GERMAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_GERMAN_HDT(base_path=None, in_memory=False, split_multiwords=True, revision='dev')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_GOTHIC(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_GREEK(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_HEBREW(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_HINDI(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_INDONESIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_IRISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_ITALIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_JAPANESE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_KAZAKH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_KOREAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_LATIN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_LATVIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_LITHUANIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_LIVVI(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_MALTESE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_MARATHI(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_NAIJA(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_NORTH_SAMI(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_NORWEGIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_OLD_CHURCH_SLAVONIC(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_OLD_FRENCH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_PERSIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_POLISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_PORTUGUESE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_ROMANIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_RUSSIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_SERBIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_SLOVAK(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_SLOVENIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_SPANISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_SWEDISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_TURKISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_UKRAINIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UD_WOLOF(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub #: Bases: UniversalDependenciesCorpus

class flair.datasets.UP_CHINESE(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.UP_ENGLISH(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.UP_FINNISH(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.UP_FRENCH(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.UP_GERMAN(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.UP_ITALIAN(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.UP_SPANISH(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.UP_SPANISH_ANCORA(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.VARIOME(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub #

Bases: ColumnCorpus

Variome corpus as provided by http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/hvp_bioc.xml.zip.

For further information see Verspoor et al.: Annotating the biomedical literature for the human variome https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3676157/

static download_dataset(data_dir)View on GitHub #

static parse_corpus(corpus_xml)View on GitHub #

Return type:: InternalBioNerDataset

class flair.datasets.WASSA_ANGER(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub #

Bases: ClassificationCorpus

WASSA-2017 anger emotion-intensity corpus.

see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.

class flair.datasets.WASSA_FEAR(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub #

Bases: ClassificationCorpus

WASSA-2017 fear emotion-intensity corpus.

see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.

class flair.datasets.WASSA_JOY(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub #

Bases: ClassificationCorpus

WASSA-2017 joy emotion-intensity dataset corpus.

see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html

class flair.datasets.WASSA_SADNESS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub #

Bases: ClassificationCorpus

WASSA-2017 sadness emotion-intensity corpus.

see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.

class flair.datasets.WNUT_17(base_path=None, in_memory=True, **corpusargs)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.WSD_MASC(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.WSD_OMSTI(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.WSD_RAGANATO_ALL(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.WSD_SEMCOR(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.WSD_TRAINOMATIC(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.WSD_UFSAC(filenames=['masc', 'semcor'], base_path=None, in_memory=True, cut_multisense=True, columns={0: 'text', 3: 'sense'}, banned_sentences=None, sample_missing_splits_in_multicorpus=True, sample_missing_splits_in_each_corpus=True, use_raganato_ALL_as_test_data=False, name='multicorpus')View on GitHub #: Bases: MultiCorpus

class flair.datasets.WSD_WORDNET_GLOSS_TAGGED(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub #: Bases: ColumnCorpus

class flair.datasets.YAHOO_ANSWERS(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

The YAHOO Question Classification Corpus, classifying questions into 10 coarse-grained answer types.

class flair.datasets.ZELDA(base_path=None, in_memory=False, column_format={0: 'text', 2: 'nel'}, **corpusargs)View on GitHub #: Bases: MultiFileColumnCorpus

class flair.datasets.CSVClassificationCorpus(data_folder, column_name_map, label_type, name='csv_corpus', train_file=None, test_file=None, dev_file=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, skip_header=False, encoding='utf-8', no_class_label=None, sample_missing_splits=True, **fmtparams)View on GitHub #

Bases: Corpus

Classification corpus instantiated from CSV data files.

class flair.datasets.CSVClassificationDataset(path_to_file, column_name_map, label_type, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, skip_header=False, encoding='utf-8', no_class_label=None, **fmtparams)View on GitHub #

Bases: FlairDataset

Dataset for text classification from CSV column formatted data.

is_in_memory()View on GitHub #

Returns True if the entire dataset is currently loaded in memory, False otherwise.

Return type:: bool

class flair.datasets.ClassificationCorpus(data_folder, label_type='class', train_file=None, test_file=None, dev_file=None, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, sample_missing_splits=True, encoding='utf-8')View on GitHub #

Bases: Corpus

A classification corpus from FastText-formatted text files.

class flair.datasets.ClassificationDataset(path_to_file, label_type, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, encoding='utf-8')View on GitHub #

Bases: FlairDataset

Dataset for classification instantiated from a single FastText-formatted file.

is_in_memory()View on GitHub #

Returns True if the entire dataset is currently loaded in memory, False otherwise.

Return type:: bool

class flair.datasets.ColumnCorpus(data_folder, column_format, train_file=None, test_file=None, dev_file=None, autofind_splits=True, name=None, comment_symbol='# ', use_tokenizer=None, **corpusargs)View on GitHub #: Bases: MultiFileColumnCorpus

class flair.datasets.ColumnDataset(path_to_column_file, column_name_map, column_delimiter='\\\\s+', comment_symbol=None, banned_sentences=None, in_memory=True, document_separator_token=None, every_sentence_is_independent=False, encoding='utf-8', skip_first_line=False, label_name_map=None, default_whitespace_after=1, documents_as_sentences=False, use_tokenizer=None)View on GitHub #

Bases: FlairDataset

SPACE_AFTER_KEY = 'space-after'#

FEATS = ['feats', 'misc']#

HEAD = ['head', 'head_id']#

text_column: int#

head_id_column: Optional[int]#

sentences: list[Sentence]#

sentences_raw: list[list[str]]#

total_sentence_count: int#

is_in_memory()View on GitHub #

Returns True if the entire dataset is currently loaded in memory, False otherwise.

Return type:: bool

class flair.datasets.DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, drop_last=False, timeout=0, worker_init_fn=None)View on GitHub #

Bases: DataLoader

dataset: Dataset[_T_co]#

batch_size: Optional[int]#

num_workers: int#

pin_memory: bool#

drop_last: bool#

timeout: float#

sampler: Union[Sampler, Iterable]#

pin_memory_device: str#

prefetch_factor: Optional[int]#

class flair.datasets.DataPairCorpus(data_folder, columns=[0, 1, 2], train_file=None, test_file=None, dev_file=None, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, label_type=None, autofind_splits=True, sample_missing_splits=True, skip_first_line=False, separator='\\t', encoding='utf-8')View on GitHub #: Bases: Corpus

class flair.datasets.DataPairDataset(path_to_data, columns=[0, 1, 2], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub #

Bases: FlairDataset

is_in_memory()View on GitHub #

Returns True if the entire dataset is currently loaded in memory, False otherwise.

Return type:: bool

class flair.datasets.DataTripleCorpus(data_folder, columns=[0, 1, 2, 3], train_file=None, test_file=None, dev_file=None, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, label_type=None, autofind_splits=True, sample_missing_splits=True, skip_first_line=False, separator='\\t', encoding='utf-8')View on GitHub #: Bases: Corpus

class flair.datasets.DataTripleDataset(path_to_data, columns=[0, 1, 2, 3], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub #

Bases: FlairDataset

is_in_memory()View on GitHub #

Returns True if the entire dataset is currently loaded in memory, False otherwise.

Return type:: bool

class flair.datasets.EntityLinkingDictionary(candidates, dataset_name=None)View on GitHub #

Bases: object

Base class for downloading and reading of dictionaries for entity entity linking.

A dictionary represents all entities of a knowledge base and their associated ids.

property database_name: str#: Name of the database represented by the dictionary.

property text_to_index: dict[str, list[str]]#

property candidates: list[EntityCandidate]#

to_in_memory_dictionary()View on GitHub #

Return type:: InMemoryEntityLinkingDictionary

class flair.datasets.FeideggerCorpus(**kwargs)View on GitHub #: Bases: Corpus

class flair.datasets.FeideggerDataset(dataset_info, **kwargs)View on GitHub #

Bases: FlairDataset

is_in_memory()View on GitHub #

Returns True if the entire dataset is currently loaded in memory, False otherwise.

Return type:: bool

class flair.datasets.FlairDatapointDataset(datapoints)View on GitHub #

Bases: FlairDataset, Generic[DT]

A simple Dataset object to wrap a List of Datapoints, for example Sentences.

is_in_memory()View on GitHub #

Returns True if the entire dataset is currently loaded in memory, False otherwise.

Return type:: bool

class flair.datasets.HunerEntityLinkingDictionary(path, dataset_name)View on GitHub #

Bases: EntityLinkingDictionary

Base dictionary with data already in huner format.

Every line in the file must be formatted as follows:

concept_id||concept_name

If multiple concept ids are associated to a given name they have to be separated by a |, e.g.

7157||TP53|tumor protein p53

class flair.datasets.MongoDataset(query, host, port, database, collection, text_field, categories_field=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, tag_type='class')View on GitHub #

Bases: FlairDataset

is_in_memory()View on GitHub #

Returns True if the entire dataset is currently loaded in memory, False otherwise.

Return type:: bool

class flair.datasets.OcrJsonDataset(path_to_split_directory, label_type='ner', in_memory=True, encoding='utf-8', load_images=False, normalize_coords_to_thousands=True, label_name_map=None)View on GitHub #

Bases: FlairDataset

is_in_memory()View on GitHub #

Returns True if the entire dataset is currently loaded in memory, False otherwise.

Return type:: bool

class flair.datasets.OpusParallelCorpus(dataset, l1, l2, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, **corpusargs)View on GitHub #: Bases: ParallelTextCorpus

class flair.datasets.ParallelTextCorpus(source_file, target_file, name, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, **corpusargs)View on GitHub #

Bases: Corpus

is_in_memory()View on GitHub #

Return type:: bool

class flair.datasets.ParallelTextDataset(path_to_source, path_to_target, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True)View on GitHub #

Bases: FlairDataset

is_in_memory()View on GitHub #

Returns True if the entire dataset is currently loaded in memory, False otherwise.

Return type:: bool

class flair.datasets.SentenceDataset(sentences)View on GitHub #: Bases: FlairDatapointDataset

class flair.datasets.StringDataset(texts, use_tokenizer=<flair.tokenization.SpaceTokenizer object>)View on GitHub #

Bases: FlairDataset

A Dataset taking string as input and returning Sentence during iteration.

abstract is_in_memory()View on GitHub #

Returns True if the entire dataset is currently loaded in memory, False otherwise.

Return type:: bool

class flair.datasets.UniversalDependenciesCorpus(data_folder, train_file=None, test_file=None, dev_file=None, in_memory=True, split_multiwords=True)View on GitHub #: Bases: Corpus

class flair.datasets.UniversalDependenciesDataset(path_to_conll_file, in_memory=True, split_multiwords=True)View on GitHub #

Bases: FlairDataset

is_in_memory()View on GitHub #

Returns True if the entire dataset is currently loaded in memory, False otherwise.

Return type:: bool

Table of Contents

flair.datasets#