flair.datasets#

Classes

DataLoader(dataset[, batch_size, shuffle, ...])

OcrJsonDataset(path_to_split_directory[, ...])

SROIE([base_path, encoding, label_type, ...])

FlairDatapointDataset(datapoints)

A simple Dataset object to wrap a List of Datapoints, for example Sentences.

SentenceDataset(sentences)

MongoDataset(query, host, port, database, ...)

StringDataset(texts[, use_tokenizer])

A Dataset taking string as input and returning Sentence during iteration.

EntityLinkingDictionary(candidates[, ...])

Base class for downloading and reading of dictionaries for entity entity linking.

AGNEWS([base_path, tokenizer, memory_mode])

The AG's News Topic Classification Corpus, classifying news into 4 coarse-grained topics.

ANAT_EM([base_path, in_memory, tokenizer])

Corpus for anatomical named entity mention recognition.

AZDZ([base_path, in_memory, tokenizer])

Arizona Disease Corpus from the Biomedical Informatics Lab at Arizona State University.

BC2GM([base_path, in_memory, sentence_splitter])

Original BioCreative-II-GM corpus containing gene annotations.

BIO_INFER([base_path, in_memory])

Original BioInfer corpus.

BIOBERT_CHEMICAL_BC4CHEMD([base_path, in_memory])

BC4CHEMD corpus with chemical annotations as used in the evaluation of BioBERT.

BIOBERT_CHEMICAL_BC5CDR([base_path, in_memory])

BC5CDR corpus with chemical annotations as used in the evaluation of BioBERT.

BIOBERT_DISEASE_BC5CDR([base_path, in_memory])

BC5CDR corpus with disease annotations as used in the evaluation of BioBERT.

BIOBERT_DISEASE_NCBI([base_path, in_memory])

NCBI disease corpus as used in the evaluation of BioBERT.

BIOBERT_GENE_BC2GM([base_path, in_memory])

BC4CHEMD corpus with gene annotations as used in the evaluation of BioBERT.

BIOBERT_GENE_JNLPBA([base_path, in_memory])

JNLPBA corpus with gene annotations as used in the evaluation of BioBERT.

BIOBERT_SPECIES_LINNAEUS([base_path, in_memory])

Linneaeus corpus with species annotations as used in the evaluation of BioBERT.

BIOBERT_SPECIES_S800([base_path, in_memory])

S800 corpus with species annotations as used in the evaluation of BioBERT.

BIONLP2013_CG([base_path, in_memory, ...])

Corpus of the BioNLP'2013 Cancer Genetics shared task.

BIONLP2013_PC([base_path, in_memory, ...])

Corpus of the BioNLP'2013 Pathway Curation shared task.

BIOSEMANTICS([base_path, in_memory, ...])

Original Biosemantics corpus.

CDR([base_path, in_memory, sentence_splitter])

CDR corpus as provided by JHnlp/BioCreative-V-CDR-Corpus.

CELL_FINDER([base_path, in_memory, ...])

Original CellFinder corpus containing cell line, species and gene annotations.

CEMP([base_path, in_memory, sentence_splitter])

Original CEMP corpus containing chemical annotations.

CHEMDNER([base_path, in_memory, ...])

Original corpus of the CHEMDNER shared task.

CLL([base_path, in_memory])

Original CLL corpus containing cell line annotations.

CRAFT([base_path, in_memory, sentence_splitter])

Original CRAFT corpus (version 2.0) containing all but the coreference and sections/typography annotations.

CRAFT_V4([base_path, in_memory, ...])

Version 4.0.1 of the CRAFT corpus containing all but the co-reference and structural annotations.

DECA([base_path, in_memory, sentence_splitter])

Original DECA corpus containing gene annotations.

FSU([base_path, in_memory])

Original FSU corpus containing protein and derived annotations.

GELLUS([base_path, in_memory])

Original Gellus corpus containing cell line annotations.

GPRO([base_path, in_memory, sentence_splitter])

Original GPRO corpus containing gene annotations.

HunerEntityLinkingDictionary(path, dataset_name)

Base dictionary with data already in huner format.

HUNER_CELL_LINE([sentence_splitter])

Union of all HUNER cell line data sets.

HUNER_CELL_LINE_CELL_FINDER(*args, **kwargs)

HUNER version of the CellFinder corpus containing only cell line annotations.

HUNER_CELL_LINE_CLL(*args, **kwargs)

HUNER version of the CLL corpus containing cell line annotations.

HUNER_CELL_LINE_GELLUS(*args, **kwargs)

HUNER version of the Gellus corpus containing cell line annotations.

HUNER_CELL_LINE_JNLPBA(*args, **kwargs)

HUNER version of the JNLPBA corpus containing cell line annotations.

HUNER_CHEMICAL([sentence_splitter])

Union of all HUNER chemical data sets.

HUNER_CHEMICAL_CDR(*args, **kwargs)

HUNER version of the IEPA corpus containing chemical annotations.

HUNER_CHEMICAL_CEMP(*args, **kwargs)

HUNER version of the CEMP corpus containing chemical annotations.

HUNER_CHEMICAL_CHEBI(*args, **kwargs)

HUNER version of the CHEBI corpus containing chemical annotations.

HUNER_CHEMICAL_CHEMDNER(*args, **kwargs)

HUNER version of the CHEMDNER corpus containing chemical annotations.

HUNER_CHEMICAL_CRAFT_V4(*args, **kwargs)

HUNER version of the CRAFT corpus containing (only) chemical annotations.

HUNER_CHEMICAL_SCAI(*args, **kwargs)

HUNER version of the SCAI chemicals corpus containing chemical annotations.

HUNER_DISEASE([sentence_splitter])

Union of all HUNER disease data sets.

HUNER_DISEASE_CDR(*args, **kwargs)

HUNER version of the IEPA corpus containing disease annotations.

HUNER_DISEASE_MIRNA(*args, **kwargs)

HUNER version of the miRNA corpus containing disease annotations.

HUNER_DISEASE_NCBI(*args, **kwargs)

HUNER version of the NCBI corpus containing disease annotations.

HUNER_DISEASE_PDR(*args, **kwargs)

PDR Dataset with only Disease annotations.

HUNER_DISEASE_SCAI(*args, **kwargs)

HUNER version of the SCAI chemicals corpus containing disease annotations.

HUNER_DISEASE_VARIOME(*args, **kwargs)

HUNER version of the Variome corpus containing disease annotations.

HUNER_GENE([sentence_splitter])

Union of all HUNER gene data sets.

HUNER_GENE_BC2GM(*args, **kwargs)

HUNER version of the BioCreative-II-GM corpus containing gene annotations.

HUNER_GENE_BIO_INFER(*args, **kwargs)

HUNER version of the BioInfer corpus containing only gene/protein annotations.

HUNER_GENE_CELL_FINDER(*args, **kwargs)

HUNER version of the CellFinder corpus containing only gene annotations.

HUNER_GENE_CHEBI(*args, **kwargs)

HUNER version of the CHEBI corpus containing gene annotations.

HUNER_GENE_CRAFT_V4(*args, **kwargs)

HUNER version of the CRAFT corpus containing (only) gene annotations.

HUNER_GENE_DECA(*args, **kwargs)

HUNER version of the DECA corpus containing gene annotations.

HUNER_GENE_FSU(*args, **kwargs)

HUNER version of the FSU corpus containing (only) gene annotations.

HUNER_GENE_GPRO(*args, **kwargs)

HUNER version of the GPRO corpus containing gene annotations.

HUNER_GENE_IEPA(*args, **kwargs)

HUNER version of the IEPA corpus containing gene annotations.

HUNER_GENE_JNLPBA(*args, **kwargs)

HUNER version of the JNLPBA corpus containing gene annotations.

HUNER_GENE_LOCTEXT(*args, **kwargs)

HUNER version of the Loctext corpus containing protein annotations.

HUNER_GENE_MIRNA(*args, **kwargs)

HUNER version of the miRNA corpus containing protein / gene annotations.

HUNER_GENE_OSIRIS(*args, **kwargs)

HUNER version of the OSIRIS corpus containing (only) gene annotations.

HUNER_GENE_VARIOME(*args, **kwargs)

HUNER version of the Variome corpus containing gene annotations.

HUNER_SPECIES([sentence_splitter])

Union of all HUNER species data sets.

HUNER_SPECIES_CELL_FINDER(*args, **kwargs)

HUNER version of the CellFinder corpus containing only species annotations.

HUNER_SPECIES_CHEBI(*args, **kwargs)

HUNER version of the CHEBI corpus containing species annotations.

HUNER_SPECIES_CRAFT_V4(*args, **kwargs)

HUNER version of the CRAFT corpus containing (only) species annotations.

HUNER_SPECIES_LINNEAUS(*args, **kwargs)

HUNER version of the LINNEAUS corpus containing species annotations.

HUNER_SPECIES_LOCTEXT(*args, **kwargs)

HUNER version of the Loctext corpus containing species annotations.

HUNER_SPECIES_MIRNA(*args, **kwargs)

HUNER version of the miRNA corpus containing species annotations.

HUNER_SPECIES_S800(*args, **kwargs)

HUNER version of the S800 corpus containing species annotations.

HUNER_SPECIES_VARIOME(*args, **kwargs)

HUNER version of the Variome corpus containing species annotations.

IEPA([base_path, in_memory])

IEPA corpus as provided by http://corpora.informatik.hu-berlin.de/.

JNLPBA([base_path, in_memory])

Original corpus of the JNLPBA shared task.

LINNEAUS([base_path, in_memory, tokenizer])

Original LINNEAUS corpus containing species annotations.

LOCTEXT([base_path, in_memory, ...])

Original LOCTEXT corpus containing species annotations.

MIRNA([base_path, in_memory, sentence_splitter])

Original miRNA corpus.

NCBI_GENE_HUMAN_DICTIONARY([base_path])

Dictionary for named entity linking on diseases using the NCBI Gene ontology.

NCBI_TAXONOMY_DICTIONARY([base_path])

Dictionary for named entity linking on organisms / species using the NCBI taxonomy ontology.

CTD_DISEASES_DICTIONARY([base_path])

Dictionary for named entity linking on diseases using the Comparative Toxicogenomics Database (CTD).

CTD_CHEMICALS_DICTIONARY([base_path])

Dictionary for named entity linking on chemicals using the Comparative Toxicogenomics Database (CTD).

NCBI_DISEASE([base_path, in_memory, ...])

Original NCBI disease corpus containing disease annotations.

ONTONOTES([base_path, version, language, ...])

OSIRIS([base_path, in_memory, ...])

Original OSIRIS corpus containing variation and gene annotations.

PDR([base_path, in_memory, sentence_splitter])

Corpus of plant-disease relations.

S800([base_path, in_memory, sentence_splitter])

S800 corpus.

SCAI_CHEMICALS(*args, **kwargs)

Original SCAI chemicals corpus containing chemical annotations.

SCAI_DISEASE(*args, **kwargs)

Original SCAI disease corpus containing disease annotations.

VARIOME([base_path, in_memory, ...])

Variome corpus as provided by http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/hvp_bioc.xml.zip.

AMAZON_REVIEWS([split_max, label_name_map, ...])

A very large corpus of Amazon reviews with positivity ratings.

COMMUNICATIVE_FUNCTIONS([base_path, ...])

The Communicative Functions Classification Corpus.

GERMEVAL_2018_OFFENSIVE_LANGUAGE([...])

GermEval 2018 corpus for identification of offensive language.

GLUE_COLA([label_type, base_path, tokenizer])

Corpus of Linguistic Acceptability from GLUE benchmark.

GO_EMOTIONS([base_path, tokenizer, memory_mode])

GoEmotions dataset containing 58k Reddit comments labeled with 27 emotion categories.

IMDB([base_path, rebalance_corpus, ...])

Corpus of IMDB movie reviews labeled by sentiment (POSITIVE, NEGATIVE).

NEWSGROUPS([base_path, tokenizer, memory_mode])

20 newsgroups corpus, classifying news items into one of 20 categories.

STACKOVERFLOW([base_path, tokenizer, ...])

Stackoverflow corpus classifying questions into one of 20 labels.

SENTEVAL_CR([tokenizer, memory_mode])

The customer reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.

SENTEVAL_MPQA([tokenizer, memory_mode])

The opinion-polarity dataset of SentEval, classified into NEGATIVE or POSITIVE polarity.

SENTEVAL_MR([tokenizer, memory_mode])

The movie reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.

SENTEVAL_SST_BINARY([tokenizer, memory_mode])

The Stanford sentiment treebank dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.

SENTEVAL_SST_GRANULAR([tokenizer, memory_mode])

The Stanford sentiment treebank dataset of SentEval, classified into 5 sentiment classes.

SENTEVAL_SUBJ([tokenizer, memory_mode])

The subjectivity dataset of SentEval, classified into SUBJECTIVE or OBJECTIVE sentiment.

SENTIMENT_140([label_name_map, tokenizer, ...])

Twitter sentiment corpus.

TREC_6([base_path, tokenizer, memory_mode])

The TREC Question Classification Corpus, classifying questions into 6 coarse-grained answer types.

TREC_50([base_path, tokenizer, memory_mode])

The TREC Question Classification Corpus, classifying questions into 50 fine-grained answer types.

WASSA_ANGER([base_path, tokenizer])

WASSA-2017 anger emotion-intensity corpus.

WASSA_FEAR([base_path, tokenizer])

WASSA-2017 fear emotion-intensity corpus.

WASSA_JOY([base_path, tokenizer])

WASSA-2017 joy emotion-intensity dataset corpus.

WASSA_SADNESS([base_path, tokenizer])

WASSA-2017 sadness emotion-intensity corpus.

YAHOO_ANSWERS([base_path, tokenizer, ...])

The YAHOO Question Classification Corpus, classifying questions into 10 coarse-grained answer types.

ClassificationCorpus(data_folder[, ...])

A classification corpus from FastText-formatted text files.

ClassificationDataset(path_to_file, label_type)

Dataset for classification instantiated from a single FastText-formatted file.

CSVClassificationCorpus(data_folder, ...[, ...])

Classification corpus instantiated from CSV data files.

CSVClassificationDataset(path_to_file, ...)

Dataset for text classification from CSV column formatted data.

NEL_ENGLISH_AIDA([base_path, in_memory, ...])

NEL_ENGLISH_AQUAINT([base_path, in_memory, ...])

NEL_ENGLISH_IITB([base_path, in_memory, ...])

NEL_ENGLISH_REDDIT([base_path, in_memory])

NEL_ENGLISH_TWEEKI([base_path, in_memory])

NEL_GERMAN_HIPE([base_path, in_memory, ...])

WSD_MASC([base_path, in_memory, columns, ...])

WSD_OMSTI([base_path, in_memory, columns, ...])

WSD_RAGANATO_ALL([base_path, in_memory, ...])

WSD_SEMCOR([base_path, in_memory, columns, ...])

WSD_TRAINOMATIC([base_path, in_memory, ...])

WSD_UFSAC([filenames, base_path, in_memory, ...])

WSD_WORDNET_GLOSS_TAGGED([base_path, ...])

RE_ENGLISH_CONLL04([base_path, in_memory])

RE_ENGLISH_DRUGPROT([base_path, in_memory, ...])

RE_ENGLISH_SEMEVAL2010([base_path, ...])

RE_ENGLISH_TACRED([base_path, in_memory])

BIOSCOPE([base_path, in_memory])

CONLL_03([base_path, column_format, in_memory])

CONLL_03_DUTCH([base_path, in_memory])

CONLL_03_GERMAN([base_path, in_memory])

CONLL_03_SPANISH([base_path, in_memory])

CLEANCONLL([base_path, in_memory])

CONLL_2000([base_path, in_memory])

FEWNERD([setting])

KEYPHRASE_INSPEC([base_path, in_memory])

KEYPHRASE_SEMEVAL2010([base_path, in_memory])

KEYPHRASE_SEMEVAL2017([base_path, in_memory])

MASAKHA_POS([languages, version, base_path, ...])

NER_ARABIC_ANER([base_path, in_memory, ...])

NER_ARABIC_AQMAR([base_path, in_memory, ...])

NER_BASQUE([base_path, in_memory])

NER_CHINESE_WEIBO([base_path, in_memory, ...])

NER_DANISH_DANE([base_path, in_memory])

NER_ENGLISH_MOVIE_COMPLEX([base_path, in_memory])

NER_ENGLISH_MOVIE_SIMPLE([base_path, in_memory])

NER_ENGLISH_PERSON([base_path, in_memory])

NER_ENGLISH_RESTAURANT([base_path, in_memory])

NER_ENGLISH_SEC_FILLINGS([base_path, in_memory])

NER_ENGLISH_STACKOVERFLOW([base_path, in_memory])

NER_ENGLISH_TWITTER([base_path, in_memory])

NER_ENGLISH_WEBPAGES([base_path, in_memory])

NER_ENGLISH_WIKIGOLD([base_path, in_memory, ...])

NER_ENGLISH_WNUT_2020([base_path, ...])

NER_FINNISH([base_path, in_memory])

NER_GERMAN_BIOFID([base_path, in_memory])

NER_GERMAN_EUROPARL([base_path, in_memory])

NER_GERMAN_GERMEVAL([base_path, in_memory])

NER_GERMAN_LEGAL([base_path, in_memory])

NER_GERMAN_MOBIE([base_path, in_memory])

NER_GERMAN_POLITICS([base_path, ...])

NER_HIPE_2022(dataset_name, language[, ...])

NER_NOISEBENCH([noise, base_path, in_memory])

NER_HUNGARIAN([base_path, in_memory, ...])

NER_ICDAR_EUROPEANA(language[, base_path, ...])

NER_ICELANDIC([base_path, in_memory])

NER_JAPANESE([base_path, in_memory])

NER_NERMUD([domains, base_path, in_memory])

NER_MASAKHANE([languages, version, ...])

NER_MULTI_WIKIANN([languages, base_path, ...])

NER_MULTI_WIKINER([languages, base_path, ...])

NER_MULTI_XTREME([languages, base_path, ...])

NER_SWEDISH([base_path, in_memory])

NER_TURKU([base_path, in_memory])

NER_UKRAINIAN([base_path, in_memory])

NER_ESTONIAN_NOISY([version, base_path, ...])

UP_CHINESE([base_path, in_memory, ...])

UP_ENGLISH([base_path, in_memory, ...])

UP_FINNISH([base_path, in_memory, ...])

UP_FRENCH([base_path, in_memory, ...])

UP_GERMAN([base_path, in_memory, ...])

UP_ITALIAN([base_path, in_memory, ...])

UP_SPANISH([base_path, in_memory, ...])

UP_SPANISH_ANCORA([base_path, in_memory, ...])

WNUT_17([base_path, in_memory])

ColumnCorpus(data_folder, column_format[, ...])

ColumnDataset(path_to_column_file, ...[, ...])

NER_MULTI_CONER([task, base_path, in_memory])

NER_MULTI_CONER_V2([task, base_path, ...])

FeideggerCorpus(**kwargs)

FeideggerDataset(dataset_info, **kwargs)

GLUE_MNLI([label_type, evaluate_on_matched, ...])

GLUE_MRPC([label_type, base_path, ...])

GLUE_QNLI([label_type, base_path, ...])

GLUE_QQP([label_type, base_path, ...])

GLUE_RTE([label_type, base_path, ...])

GLUE_WNLI([label_type, base_path, ...])

GLUE_SST2([label_type, base_path, ...])

GLUE_STSB([label_type, base_path, ...])

SUPERGLUE_RTE([base_path, ...])

DataPairCorpus(data_folder[, columns, ...])

DataPairDataset(path_to_data[, columns, ...])

DataTripleCorpus(data_folder[, columns, ...])

DataTripleDataset(path_to_data[, columns, ...])

OpusParallelCorpus(dataset, l1, l2[, ...])

ParallelTextCorpus(source_file, target_file, ...)

ParallelTextDataset(path_to_source, ...[, ...])

UD_AFRIKAANS([base_path, in_memory, ...])

UD_ANCIENT_GREEK([base_path, in_memory, ...])

UD_ARABIC([base_path, in_memory, ...])

UD_ARMENIAN([base_path, in_memory, ...])

UD_BASQUE([base_path, in_memory, ...])

UD_BAVARIAN_MAIBAAM([base_path, in_memory, ...])

UD_BELARUSIAN([base_path, in_memory, ...])

UD_BULGARIAN([base_path, in_memory, ...])

UD_BURYAT([base_path, in_memory, ...])

UD_CATALAN([base_path, in_memory, ...])

UD_CHINESE([base_path, in_memory, ...])

UD_CHINESE_KYOTO([base_path, in_memory, ...])

UD_COPTIC([base_path, in_memory, ...])

UD_CROATIAN([base_path, in_memory, ...])

UD_CZECH([base_path, in_memory, ...])

UD_DANISH([base_path, in_memory, ...])

UD_DUTCH([base_path, in_memory, ...])

UD_ENGLISH([base_path, in_memory, ...])

UD_ESTONIAN([base_path, in_memory, ...])

UD_FAROESE([base_path, in_memory, ...])

This treebank includes the Faroese treebank dataset.

UD_FINNISH([base_path, in_memory, ...])

UD_FRENCH([base_path, in_memory, ...])

UD_GALICIAN([base_path, in_memory, ...])

UD_GERMAN([base_path, in_memory, ...])

UD_GERMAN_HDT([base_path, in_memory, ...])

UD_GOTHIC([base_path, in_memory, ...])

UD_GREEK([base_path, in_memory, ...])

UD_HEBREW([base_path, in_memory, ...])

UD_HINDI([base_path, in_memory, ...])

UD_INDONESIAN([base_path, in_memory, ...])

UD_IRISH([base_path, in_memory, ...])

UD_ITALIAN([base_path, in_memory, ...])

UD_JAPANESE([base_path, in_memory, ...])

UD_KAZAKH([base_path, in_memory, ...])

UD_KOREAN([base_path, in_memory, ...])

UD_LATIN([base_path, in_memory, ...])

UD_LATVIAN([base_path, in_memory, ...])

UD_LITHUANIAN([base_path, in_memory, ...])

UD_LIVVI([base_path, in_memory, ...])

UD_MALTESE([base_path, in_memory, ...])

UD_MARATHI([base_path, in_memory, ...])

UD_NAIJA([base_path, in_memory, ...])

UD_NORTH_SAMI([base_path, in_memory, ...])

UD_NORWEGIAN([base_path, in_memory, ...])

UD_OLD_CHURCH_SLAVONIC([base_path, ...])

UD_OLD_FRENCH([base_path, in_memory, ...])

UD_PERSIAN([base_path, in_memory, ...])

UD_POLISH([base_path, in_memory, ...])

UD_PORTUGUESE([base_path, in_memory, ...])

UD_ROMANIAN([base_path, in_memory, ...])

UD_RUSSIAN([base_path, in_memory, ...])

UD_SERBIAN([base_path, in_memory, ...])

UD_SLOVAK([base_path, in_memory, ...])

UD_SLOVENIAN([base_path, in_memory, ...])

UD_SPANISH([base_path, in_memory, ...])

UD_SWEDISH([base_path, in_memory, ...])

UD_TURKISH([base_path, in_memory, ...])

UD_UKRAINIAN([base_path, in_memory, ...])

UD_WOLOF([base_path, in_memory, ...])

UniversalDependenciesCorpus(data_folder[, ...])

UniversalDependenciesDataset(path_to_conll_file)

ZELDA([base_path, in_memory, column_format])

class flair.datasets.DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, drop_last=False, timeout=0, worker_init_fn=None)View on GitHub#

Bases: DataLoader

dataset: Dataset[_T_co]#
batch_size: Optional[int]#
num_workers: int#
pin_memory: bool#
drop_last: bool#
timeout: float#
sampler: Union[Sampler, Iterable]#
pin_memory_device: str#
prefetch_factor: Optional[int]#
class flair.datasets.OcrJsonDataset(path_to_split_directory, label_type='ner', in_memory=True, encoding='utf-8', load_images=False, normalize_coords_to_thousands=True, label_name_map=None)View on GitHub#

Bases: FlairDataset

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.SROIE(base_path=None, encoding='utf-8', label_type='ner', in_memory=True, load_images=False, normalize_coords_to_thousands=True, label_name_map=None, **corpusargs)View on GitHub#

Bases: OcrCorpus

class flair.datasets.FlairDatapointDataset(datapoints)View on GitHub#

Bases: FlairDataset, Generic[DT]

A simple Dataset object to wrap a List of Datapoints, for example Sentences.

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.SentenceDataset(sentences)View on GitHub#

Bases: FlairDatapointDataset

class flair.datasets.MongoDataset(query, host, port, database, collection, text_field, categories_field=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, tag_type='class')View on GitHub#

Bases: FlairDataset

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.StringDataset(texts, use_tokenizer=<flair.tokenization.SpaceTokenizer object>)View on GitHub#

Bases: FlairDataset

A Dataset taking string as input and returning Sentence during iteration.

abstract is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.EntityLinkingDictionary(candidates, dataset_name=None)View on GitHub#

Bases: object

Base class for downloading and reading of dictionaries for entity entity linking.

A dictionary represents all entities of a knowledge base and their associated ids.

property database_name: str#

Name of the database represented by the dictionary.

property text_to_index: dict[str, list[str]]#
property candidates: list[EntityCandidate]#
to_in_memory_dictionary()View on GitHub#
Return type:

InMemoryEntityLinkingDictionary

class flair.datasets.AGNEWS(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The AG’s News Topic Classification Corpus, classifying news into 4 coarse-grained topics.

Labels: World, Sports, Business, Sci/Tech.

class flair.datasets.ANAT_EM(base_path=None, in_memory=True, tokenizer=None)View on GitHub#

Bases: ColumnCorpus

Corpus for anatomical named entity mention recognition.

For further information see Pyysalo and Ananiadou: Anatomical entity mention recognition at literature scale https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3957068/ http://nactem.ac.uk/anatomytagger/#AnatEM

Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

abstract static download_corpus(data_folder)View on GitHub#
static parse_input_files(input_dir, sentence_separator)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.AZDZ(base_path=None, in_memory=True, tokenizer=None)View on GitHub#

Bases: ColumnCorpus

Arizona Disease Corpus from the Biomedical Informatics Lab at Arizona State University.

For further information see: http://diego.asu.edu/index.php

classmethod download_corpus(data_dir)View on GitHub#
Return type:

Path

static parse_corpus(input_file)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.BC2GM(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original BioCreative-II-GM corpus containing gene annotations.

For further information see Smith et al.: Overview of BioCreative II gene mention recognition https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2559986/

Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

static download_dataset(data_dir)View on GitHub#
Return type:

Path

classmethod parse_train_dataset(data_folder)View on GitHub#
Return type:

InternalBioNerDataset

classmethod parse_test_dataset(data_folder)View on GitHub#
Return type:

InternalBioNerDataset

static parse_dataset(text_file, ann_file)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.BIO_INFER(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

Original BioInfer corpus.

For further information see Pyysalo et al.:

BioInfer: a corpus for information extraction in the biomedical domain https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-50

classmethod download_dataset(data_dir)View on GitHub#
Return type:

Path

classmethod parse_dataset(original_file)View on GitHub#
class flair.datasets.BIOBERT_CHEMICAL_BC4CHEMD(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

BC4CHEMD corpus with chemical annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.BIOBERT_CHEMICAL_BC5CDR(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

BC5CDR corpus with chemical annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.BIOBERT_DISEASE_BC5CDR(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

BC5CDR corpus with disease annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.BIOBERT_DISEASE_NCBI(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

NCBI disease corpus as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.BIOBERT_GENE_BC2GM(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

BC4CHEMD corpus with gene annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.BIOBERT_GENE_JNLPBA(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

JNLPBA corpus with gene annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.BIOBERT_SPECIES_LINNAEUS(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

Linneaeus corpus with species annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.BIOBERT_SPECIES_S800(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

S800 corpus with species annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.BIONLP2013_CG(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: BioNLPCorpus

Corpus of the BioNLP’2013 Cancer Genetics shared task.

For further information see Pyysalo, Ohta & Ananiadou 2013 Overview of the Cancer Genetics (CG) task of BioNLP Shared Task 2013 https://www.aclweb.org/anthology/W13-2008/

Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

static download_corpus(download_folder)View on GitHub#
Return type:

tuple[Path, Path, Path]

class flair.datasets.BIONLP2013_PC(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: BioNLPCorpus

Corpus of the BioNLP’2013 Pathway Curation shared task.

For further information see Ohta et al. Overview of the pathway curation (PC) task of bioNLP shared task 2013. https://www.aclweb.org/anthology/W13-2009/

Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

static download_corpus(download_folder)View on GitHub#
Return type:

tuple[Path, Path, Path]

class flair.datasets.BIOSEMANTICS(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original Biosemantics corpus.

For further information see Akhondi et al.: Annotated chemical patent corpus: a gold standard for text mining https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4182036/

static download_dataset(data_dir)View on GitHub#
Return type:

Path

static parse_dataset(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.CDR(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

CDR corpus as provided by JHnlp/BioCreative-V-CDR-Corpus.

For further information see Li et al.: BioCreative V CDR task corpus: a resource for chemical disease relation extraction https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4860626/

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

static download_dataset(data_dir)View on GitHub#
class flair.datasets.CELL_FINDER(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original CellFinder corpus containing cell line, species and gene annotations.

For futher information see Neves et al.: Annotating and evaluating text for stem cell research https://pdfs.semanticscholar.org/38e3/75aeeeb1937d03c3c80128a70d8e7a74441f.pdf

classmethod download_and_prepare(data_folder)View on GitHub#
Return type:

InternalBioNerDataset

classmethod read_folder(data_folder)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.CEMP(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original CEMP corpus containing chemical annotations.

For further information see: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/cemp-detailed-task-description/

classmethod download_train_corpus(data_dir)View on GitHub#
Return type:

Path

classmethod download_dev_corpus(data_dir)View on GitHub#
Return type:

Path

static parse_input_file(text_file, ann_file)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.CHEMDNER(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original corpus of the CHEMDNER shared task.

For further information see Krallinger et al.: The CHEMDNER corpus of chemicals and drugs and its annotation principles https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-7-S1-S2

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

static download_dataset(data_dir)View on GitHub#
class flair.datasets.CLL(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

Original CLL corpus containing cell line annotations.

For further information, see Kaewphan et al.: Cell line name recognition in support of the identification of synthetic lethality in cancer from text https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4708107/

class flair.datasets.CRAFT(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original CRAFT corpus (version 2.0) containing all but the coreference and sections/typography annotations.

For further information see Bada et al.: Concept annotation in the craft corpus https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-161

classmethod download_corpus(data_dir)View on GitHub#
Return type:

Path

static parse_corpus(corpus_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.CRAFT_V4(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Version 4.0.1 of the CRAFT corpus containing all but the co-reference and structural annotations.

For further information see: UCDenver-ccp/CRAFT

filter_entities(corpus)View on GitHub#
Return type:

InternalBioNerDataset

classmethod download_corpus(data_dir)View on GitHub#
Return type:

Path

static prepare_splits(data_dir, corpus)View on GitHub#
Return type:

tuple[InternalBioNerDataset, InternalBioNerDataset, InternalBioNerDataset]

static parse_corpus(corpus_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.DECA(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original DECA corpus containing gene annotations.

For further information see Wang et al.: Disambiguating the species of biomedical named entities using natural language parsers https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2828111/

classmethod download_corpus(data_dir)View on GitHub#
Return type:

Path

static parse_corpus(text_dir, gold_file)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.FSU(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

Original FSU corpus containing protein and derived annotations.

For further information see Hahn et al.: A proposal for a configurable silver standard https://www.aclweb.org/anthology/W10-1838/

classmethod download_corpus(data_dir)View on GitHub#
Return type:

Path

static parse_corpus(corpus_dir, sentence_separator)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.GELLUS(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

Original Gellus corpus containing cell line annotations.

For further information, see Kaewphan et al.: Cell line name recognition in support of the identification of synthetic lethality in cancer from text https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4708107/

class flair.datasets.GPRO(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original GPRO corpus containing gene annotations.

For further information see: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/gpro-detailed-task-description/

classmethod download_train_corpus(data_dir)View on GitHub#
Return type:

Path

classmethod download_dev_corpus(data_dir)View on GitHub#
Return type:

Path

static parse_input_file(text_file, ann_file)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.HunerEntityLinkingDictionary(path, dataset_name)View on GitHub#

Bases: EntityLinkingDictionary

Base dictionary with data already in huner format.

Every line in the file must be formatted as follows:

concept_id||concept_name

If multiple concept ids are associated to a given name they have to be separated by a |, e.g.

7157||TP53|tumor protein p53

class flair.datasets.HUNER_CELL_LINE(sentence_splitter=None)View on GitHub#

Bases: HunerMultiCorpus

Union of all HUNER cell line data sets.

class flair.datasets.HUNER_CELL_LINE_CELL_FINDER(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the CellFinder corpus containing only cell line annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.HUNER_CELL_LINE_CLL(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the CLL corpus containing cell line annotations.

static split_url()View on GitHub#
Return type:

str

get_corpus_sentence_splitter()View on GitHub#

Return the pre-defined sentence splitter if defined, otherwise return None.

Return type:

SentenceSplitter

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.HUNER_CELL_LINE_GELLUS(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the Gellus corpus containing cell line annotations.

static split_url()View on GitHub#
Return type:

str

get_corpus_sentence_splitter()View on GitHub#

Return the pre-defined sentence splitter if defined, otherwise return None.

Return type:

SentenceSplitter

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.HUNER_CELL_LINE_JNLPBA(*args, **kwargs)View on GitHub#

Bases: HUNER_JNLPBA

HUNER version of the JNLPBA corpus containing cell line annotations.

class flair.datasets.HUNER_CHEMICAL(sentence_splitter=None)View on GitHub#

Bases: HunerMultiCorpus

Union of all HUNER chemical data sets.

class flair.datasets.HUNER_CHEMICAL_CDR(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the IEPA corpus containing chemical annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.HUNER_CHEMICAL_CEMP(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the CEMP corpus containing chemical annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.HUNER_CHEMICAL_CHEBI(*args, **kwargs)View on GitHub#

Bases: HUNER_CHEBI

HUNER version of the CHEBI corpus containing chemical annotations.

class flair.datasets.HUNER_CHEMICAL_CHEMDNER(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the CHEMDNER corpus containing chemical annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.HUNER_CHEMICAL_CRAFT_V4(*args, **kwargs)View on GitHub#

Bases: HUNER_CRAFT_V4

HUNER version of the CRAFT corpus containing (only) chemical annotations.

class flair.datasets.HUNER_CHEMICAL_SCAI(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the SCAI chemicals corpus containing chemical annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.HUNER_DISEASE(sentence_splitter=None)View on GitHub#

Bases: HunerMultiCorpus

Union of all HUNER disease data sets.

class flair.datasets.HUNER_DISEASE_CDR(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the IEPA corpus containing disease annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.HUNER_DISEASE_MIRNA(*args, **kwargs)View on GitHub#

Bases: HUNER_MIRNA

HUNER version of the miRNA corpus containing disease annotations.

class flair.datasets.HUNER_DISEASE_NCBI(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the NCBI corpus containing disease annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.HUNER_DISEASE_PDR(*args, **kwargs)View on GitHub#

Bases: HunerDataset

PDR Dataset with only Disease annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.HUNER_DISEASE_SCAI(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the SCAI chemicals corpus containing disease annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.HUNER_DISEASE_VARIOME(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the Variome corpus containing disease annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.HUNER_GENE(sentence_splitter=None)View on GitHub#

Bases: HunerMultiCorpus

Union of all HUNER gene data sets.

class flair.datasets.HUNER_GENE_BC2GM(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the BioCreative-II-GM corpus containing gene annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.HUNER_GENE_BIO_INFER(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the BioInfer corpus containing only gene/protein annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.HUNER_GENE_CELL_FINDER(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the CellFinder corpus containing only gene annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.HUNER_GENE_CHEBI(*args, **kwargs)View on GitHub#

Bases: HUNER_CHEBI

HUNER version of the CHEBI corpus containing gene annotations.

class flair.datasets.HUNER_GENE_CRAFT_V4(*args, **kwargs)View on GitHub#

Bases: HUNER_CRAFT_V4

HUNER version of the CRAFT corpus containing (only) gene annotations.

class flair.datasets.HUNER_GENE_DECA(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the DECA corpus containing gene annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.HUNER_GENE_FSU(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the FSU corpus containing (only) gene annotations.

static split_url()View on GitHub#
Return type:

str

get_corpus_sentence_splitter()View on GitHub#

Return the pre-defined sentence splitter if defined, otherwise return None.

Return type:

SentenceSplitter

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.HUNER_GENE_GPRO(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the GPRO corpus containing gene annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.HUNER_GENE_IEPA(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the IEPA corpus containing gene annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.HUNER_GENE_JNLPBA(*args, **kwargs)View on GitHub#

Bases: HUNER_JNLPBA

HUNER version of the JNLPBA corpus containing gene annotations.

class flair.datasets.HUNER_GENE_LOCTEXT(*args, **kwargs)View on GitHub#

Bases: HUNER_LOCTEXT

HUNER version of the Loctext corpus containing protein annotations.

class flair.datasets.HUNER_GENE_MIRNA(*args, **kwargs)View on GitHub#

Bases: HUNER_MIRNA

HUNER version of the miRNA corpus containing protein / gene annotations.

class flair.datasets.HUNER_GENE_OSIRIS(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the OSIRIS corpus containing (only) gene annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.HUNER_GENE_VARIOME(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the Variome corpus containing gene annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.HUNER_SPECIES(sentence_splitter=None)View on GitHub#

Bases: HunerMultiCorpus

Union of all HUNER species data sets.

class flair.datasets.HUNER_SPECIES_CELL_FINDER(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the CellFinder corpus containing only species annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.HUNER_SPECIES_CHEBI(*args, **kwargs)View on GitHub#

Bases: HUNER_CHEBI

HUNER version of the CHEBI corpus containing species annotations.

class flair.datasets.HUNER_SPECIES_CRAFT_V4(*args, **kwargs)View on GitHub#

Bases: HUNER_CRAFT_V4

HUNER version of the CRAFT corpus containing (only) species annotations.

class flair.datasets.HUNER_SPECIES_LINNEAUS(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the LINNEAUS corpus containing species annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.HUNER_SPECIES_LOCTEXT(*args, **kwargs)View on GitHub#

Bases: HUNER_LOCTEXT

HUNER version of the Loctext corpus containing species annotations.

class flair.datasets.HUNER_SPECIES_MIRNA(*args, **kwargs)View on GitHub#

Bases: HUNER_MIRNA

HUNER version of the miRNA corpus containing species annotations.

class flair.datasets.HUNER_SPECIES_S800(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the S800 corpus containing species annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.HUNER_SPECIES_VARIOME(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the Variome corpus containing species annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[dict]

class flair.datasets.IEPA(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

IEPA corpus as provided by http://corpora.informatik.hu-berlin.de/.

For further information see Ding, Berleant, Nettleton, Wurtele: Mining MEDLINE: abstracts, sentences, or phrases? https://www.ncbi.nlm.nih.gov/pubmed/11928487

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

static download_dataset(data_dir)View on GitHub#
classmethod parse_dataset(original_file)View on GitHub#
class flair.datasets.JNLPBA(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

Original corpus of the JNLPBA shared task.

For further information see Kim et al.: Introduction to the Bio- Entity Recognition Task at JNLPBA https://www.aclweb.org/anthology/W04-1213.pdf

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

class flair.datasets.LINNEAUS(base_path=None, in_memory=True, tokenizer=None)View on GitHub#

Bases: ColumnCorpus

Original LINNEAUS corpus containing species annotations.

For further information see Gerner et al.:

LINNAEUS: a species name identification system for biomedical literature https://www.ncbi.nlm.nih.gov/pubmed/20149233

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

static download_and_parse_dataset(data_dir)View on GitHub#
class flair.datasets.LOCTEXT(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original LOCTEXT corpus containing species annotations.

For further information see Cejuela et al.:

LocText: relation extraction of protein localizations to assist database curation https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2021-9

static download_dataset(data_dir)View on GitHub#
static parse_dataset(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.MIRNA(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original miRNA corpus.

For further information see Bagewadi et al.: Detecting miRNA Mentions and Relations in Biomedical Literature https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4602280/

classmethod download_and_prepare_train(data_folder, sentence_separator)View on GitHub#
classmethod download_and_prepare_test(data_folder, sentence_separator)View on GitHub#
classmethod parse_file(input_file, split, sentence_separator)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.NCBI_GENE_HUMAN_DICTIONARY(base_path=None)View on GitHub#

Bases: EntityLinkingDictionary

Dictionary for named entity linking on diseases using the NCBI Gene ontology.

Note that this dictionary only represents human genes - gene from different species aren’t included!

Fur further information can be found at https://www.ncbi.nlm.nih.gov/gene/

download_dictionary(data_dir)View on GitHub#
Return type:

Path

parse_dictionary(original_file)View on GitHub#
Return type:

Iterator[EntityCandidate]

class flair.datasets.NCBI_TAXONOMY_DICTIONARY(base_path=None)View on GitHub#

Bases: EntityLinkingDictionary

Dictionary for named entity linking on organisms / species using the NCBI taxonomy ontology.

Further information about the ontology can be found at https://www.ncbi.nlm.nih.gov/taxonomy

download_dictionary(data_dir)View on GitHub#
Return type:

Path

parse_dictionary(original_file)View on GitHub#
Return type:

Iterator[EntityCandidate]

class flair.datasets.CTD_DISEASES_DICTIONARY(base_path=None)View on GitHub#

Bases: EntityLinkingDictionary

Dictionary for named entity linking on diseases using the Comparative Toxicogenomics Database (CTD).

Fur further information can be found at https://ctdbase.org/

download_dictionary(data_dir)View on GitHub#
Return type:

Path

parse_file(original_file)View on GitHub#
Return type:

Iterator[EntityCandidate]

class flair.datasets.CTD_CHEMICALS_DICTIONARY(base_path=None)View on GitHub#

Bases: EntityLinkingDictionary

Dictionary for named entity linking on chemicals using the Comparative Toxicogenomics Database (CTD).

Fur further information can be found at https://ctdbase.org/

download_dictionary(data_dir)View on GitHub#
Return type:

Path

parse_file(original_file)View on GitHub#
Return type:

Iterator[EntityCandidate]

class flair.datasets.NCBI_DISEASE(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original NCBI disease corpus containing disease annotations.

For further information see Dogan et al.: NCBI disease corpus: a resource for disease name recognition and concept normalization https://www.ncbi.nlm.nih.gov/pubmed/24393765

classmethod download_corpus(data_dir)View on GitHub#
Return type:

Path

static patch_training_file(orig_train_file, patched_file)View on GitHub#
static parse_input_file(input_file)View on GitHub#
class flair.datasets.ONTONOTES(base_path=None, version='v4', language='english', domain=None, in_memory=True, **corpusargs)View on GitHub#

Bases: MultiFileColumnCorpus

archive_url = 'https://data.mendeley.com/public-files/datasets/zmycy7t9h9/files/b078e1c4-f7a4-4427-be7f-9389967831ef/file_downloaded'#
classmethod get_available_domains(base_path=None, version='v4', language='english', split='train')View on GitHub#
Return type:

list[str]

classmethod dataset_document_iterator(file_path)View on GitHub#

An iterator over CONLL formatted files which yields documents, regardless of the number of document annotations in a particular file.

This is useful for conll data which has been preprocessed, such as the preprocessing which takes place for the 2012 CONLL Coreference Resolution task.

Return type:

Iterator[list[dict]]

classmethod sentence_iterator(file_path)View on GitHub#

An iterator over the sentences in an individual CONLL formatted file.

Return type:

Iterator

name: str#
class flair.datasets.OSIRIS(base_path=None, in_memory=True, sentence_splitter=None, load_original_unfixed_annotation=False)View on GitHub#

Bases: ColumnCorpus

Original OSIRIS corpus containing variation and gene annotations.

For further information see Furlong et al.: Osiris v1.2: a named entity recognition system for sequence variants of genes in biomedical literature https://www.ncbi.nlm.nih.gov/pubmed/18251998

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

classmethod download_dataset(data_dir)View on GitHub#
Return type:

Path

classmethod parse_dataset(corpus_folder, fix_annotation=True)View on GitHub#
class flair.datasets.PDR(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Corpus of plant-disease relations.

For further information see Kim et al.: A corpus of plant-disease relations in the biomedical domain https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0221582 http://gcancer.org/pdr/

Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

classmethod download_corpus(data_dir)View on GitHub#
Return type:

Path

class flair.datasets.S800(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

S800 corpus.

For further information see Pafilis et al.: The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0065390.

static download_dataset(data_dir)View on GitHub#
static parse_dataset(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.SCAI_CHEMICALS(*args, **kwargs)View on GitHub#

Bases: ScaiCorpus

Original SCAI chemicals corpus containing chemical annotations.

For further information see Kolářik et al.: Chemical Names: Terminological Resources and Corpora Annotation https://pub.uni-bielefeld.de/record/2603498

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

download_corpus(data_dir)View on GitHub#
Return type:

Path

static perform_corpus_download(data_dir)View on GitHub#
Return type:

Path

class flair.datasets.SCAI_DISEASE(*args, **kwargs)View on GitHub#

Bases: ScaiCorpus

Original SCAI disease corpus containing disease annotations.

For further information see Gurulingappa et al.: An Empirical Evaluation of Resources for the Identification of Diseases and Adverse Effects in Biomedical Literature https://pub.uni-bielefeld.de/record/2603398

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

download_corpus(data_dir)View on GitHub#
Return type:

Path

static perform_corpus_download(data_dir)View on GitHub#
Return type:

Path

class flair.datasets.VARIOME(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Variome corpus as provided by http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/hvp_bioc.xml.zip.

For further information see Verspoor et al.: Annotating the biomedical literature for the human variome https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3676157/

static download_dataset(data_dir)View on GitHub#
static parse_corpus(corpus_xml)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.AMAZON_REVIEWS(split_max=30000, label_name_map={'1.0': 'NEGATIVE', '2.0': 'NEGATIVE', '3.0': 'NEGATIVE', '4.0': 'POSITIVE', '5.0': 'POSITIVE'}, skip_labels=['3.0', '4.0'], fraction_of_5_star_reviews=10, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

A very large corpus of Amazon reviews with positivity ratings.

Corpus is downloaded from and documented at https://nijianmo.github.io/amazon/index.html. We download the 5-core subset which is still tens of millions of reviews.

download_and_prepare_amazon_product_file(data_folder, part_name, max_data_points=None, fraction_of_5_star_reviews=None)View on GitHub#
class flair.datasets.COMMUNICATIVE_FUNCTIONS(base_path=None, memory_mode='full', tokenizer=<flair.tokenization.SpaceTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The Communicative Functions Classification Corpus.

Classifying sentences from scientific papers into 39 communicative functions.

class flair.datasets.GERMEVAL_2018_OFFENSIVE_LANGUAGE(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='full', fine_grained_classes=False, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

GermEval 2018 corpus for identification of offensive language.

Classifying German tweets into 2 coarse-grained categories OFFENSIVE and OTHER or 4 fine-grained categories ABUSE, INSULT, PROFATINTY and OTHER.

class flair.datasets.GLUE_COLA(label_type='acceptability', base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

Corpus of Linguistic Acceptability from GLUE benchmark.

see https://gluebenchmark.com/tasks

The task is to predict whether an English sentence is grammatically correct. Additionaly to the Corpus we have eval_dataset containing the unlabeled test data for Glue evaluation.

tsv_from_eval_dataset(folder_path)View on GitHub#

Create eval prediction file.

This function creates a tsv file with predictions of the eval_dataset (after calling classifier.predict(corpus.eval_dataset, label_name=’acceptability’)). The resulting file is called CoLA.tsv and is in the format required for submission to the Glue Benchmark.

class flair.datasets.GO_EMOTIONS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

GoEmotions dataset containing 58k Reddit comments labeled with 27 emotion categories.

see google-research/google-research

class flair.datasets.IMDB(base_path=None, rebalance_corpus=True, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

Corpus of IMDB movie reviews labeled by sentiment (POSITIVE, NEGATIVE).

Downloaded from and documented at http://ai.stanford.edu/~amaas/data/sentiment/.

class flair.datasets.NEWSGROUPS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

20 newsgroups corpus, classifying news items into one of 20 categories.

Downloaded from http://qwone.com/~jason/20Newsgroups

Each data point is a full news article so documents may be very long.

class flair.datasets.STACKOVERFLOW(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

Stackoverflow corpus classifying questions into one of 20 labels.

The data will be downloaded from “jacoxu/StackOverflow”,

Each data point is a question.

class flair.datasets.SENTEVAL_CR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The customer reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.

see facebookresearch/SentEval

class flair.datasets.SENTEVAL_MPQA(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The opinion-polarity dataset of SentEval, classified into NEGATIVE or POSITIVE polarity.

see facebookresearch/SentEval

class flair.datasets.SENTEVAL_MR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The movie reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.

see facebookresearch/SentEval

class flair.datasets.SENTEVAL_SST_BINARY(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The Stanford sentiment treebank dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.

see facebookresearch/SentEval

class flair.datasets.SENTEVAL_SST_GRANULAR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The Stanford sentiment treebank dataset of SentEval, classified into 5 sentiment classes.

see facebookresearch/SentEval

class flair.datasets.SENTEVAL_SUBJ(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The subjectivity dataset of SentEval, classified into SUBJECTIVE or OBJECTIVE sentiment.

see facebookresearch/SentEval

class flair.datasets.SENTIMENT_140(label_name_map=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

Twitter sentiment corpus.

See http://help.sentiment140.com/for-students

Two sentiments in train data (POSITIVE, NEGATIVE) and three sentiments in test data (POSITIVE, NEGATIVE, NEUTRAL).

class flair.datasets.TREC_6(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The TREC Question Classification Corpus, classifying questions into 6 coarse-grained answer types.

class flair.datasets.TREC_50(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The TREC Question Classification Corpus, classifying questions into 50 fine-grained answer types.

class flair.datasets.WASSA_ANGER(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

WASSA-2017 anger emotion-intensity corpus.

see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.

class flair.datasets.WASSA_FEAR(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

WASSA-2017 fear emotion-intensity corpus.

see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.

class flair.datasets.WASSA_JOY(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

WASSA-2017 joy emotion-intensity dataset corpus.

see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html

class flair.datasets.WASSA_SADNESS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

WASSA-2017 sadness emotion-intensity corpus.

see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.

class flair.datasets.YAHOO_ANSWERS(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The YAHOO Question Classification Corpus, classifying questions into 10 coarse-grained answer types.

class flair.datasets.ClassificationCorpus(data_folder, label_type='class', train_file=None, test_file=None, dev_file=None, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, sample_missing_splits=True, encoding='utf-8')View on GitHub#

Bases: Corpus

A classification corpus from FastText-formatted text files.

class flair.datasets.ClassificationDataset(path_to_file, label_type, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, encoding='utf-8')View on GitHub#

Bases: FlairDataset

Dataset for classification instantiated from a single FastText-formatted file.

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.CSVClassificationCorpus(data_folder, column_name_map, label_type, name='csv_corpus', train_file=None, test_file=None, dev_file=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, skip_header=False, encoding='utf-8', no_class_label=None, sample_missing_splits=True, **fmtparams)View on GitHub#

Bases: Corpus

Classification corpus instantiated from CSV data files.

class flair.datasets.CSVClassificationDataset(path_to_file, column_name_map, label_type, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, skip_header=False, encoding='utf-8', no_class_label=None, **fmtparams)View on GitHub#

Bases: FlairDataset

Dataset for text classification from CSV column formatted data.

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.NEL_ENGLISH_AIDA(base_path=None, in_memory=True, use_ids_and_check_existence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NEL_ENGLISH_AQUAINT(base_path=None, in_memory=True, agreement_threshold=0.5, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NEL_ENGLISH_IITB(base_path=None, in_memory=True, ignore_disagreements=False, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NEL_ENGLISH_REDDIT(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NEL_ENGLISH_TWEEKI(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NEL_GERMAN_HIPE(base_path=None, in_memory=True, wiki_language='dewiki', **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.WSD_MASC(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.WSD_OMSTI(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.WSD_RAGANATO_ALL(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.WSD_SEMCOR(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.WSD_TRAINOMATIC(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.WSD_UFSAC(filenames=['masc', 'semcor'], base_path=None, in_memory=True, cut_multisense=True, columns={0: 'text', 3: 'sense'}, banned_sentences=None, sample_missing_splits_in_multicorpus=True, sample_missing_splits_in_each_corpus=True, use_raganato_ALL_as_test_data=False, name='multicorpus')View on GitHub#

Bases: MultiCorpus

class flair.datasets.WSD_WORDNET_GLOSS_TAGGED(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.RE_ENGLISH_CONLL04(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

convert_to_conllu(source_data_folder, data_folder)View on GitHub#
class flair.datasets.RE_ENGLISH_DRUGPROT(base_path=None, in_memory=True, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#

Bases: ColumnCorpus

extract_and_convert_to_conllu(data_file, data_folder)View on GitHub#
char_spans_to_token_spans(char_spans, token_offsets)View on GitHub#
has_overlap(a, b)View on GitHub#
drugprot_document_to_tokenlists(pmid, title_sentences, abstract_sentences, abstract_offset, entities, relations)View on GitHub#
Return type:

list[TokenList]

class flair.datasets.RE_ENGLISH_SEMEVAL2010(base_path=None, in_memory=True, augment_train=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

extract_and_convert_to_conllu(data_file, data_folder, augment_train)View on GitHub#
class flair.datasets.RE_ENGLISH_TACRED(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

extract_and_convert_to_conllu(data_file, data_folder)View on GitHub#
class flair.datasets.BIOSCOPE(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.CONLL_03(base_path=None, column_format={0: 'text', 1: 'pos', 3: 'ner'}, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.CONLL_03_DUTCH(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.CONLL_03_GERMAN(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.CONLL_03_SPANISH(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.CLEANCONLL(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

static download_and_prepare_data(data_folder)View on GitHub#
class flair.datasets.CONLL_2000(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.FEWNERD(setting='supervised', **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.KEYPHRASE_INSPEC(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.KEYPHRASE_SEMEVAL2010(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.KEYPHRASE_SEMEVAL2017(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.MASAKHA_POS(languages='bam', version='v1', base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: MultiCorpus

class flair.datasets.NER_ARABIC_ANER(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_ARABIC_AQMAR(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_BASQUE(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_CHINESE_WEIBO(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_DANISH_DANE(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_MOVIE_COMPLEX(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_MOVIE_SIMPLE(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_PERSON(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_RESTAURANT(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_SEC_FILLINGS(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_STACKOVERFLOW(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_TWITTER(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_WEBPAGES(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_WIKIGOLD(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_ENGLISH_WNUT_2020(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_FINNISH(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_GERMAN_BIOFID(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_GERMAN_EUROPARL(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_GERMAN_GERMEVAL(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

Bases: ColumnCorpus

class flair.datasets.NER_GERMAN_MOBIE(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_GERMAN_POLITICS(base_path=None, column_delimiter='\\\\s+', in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_HIPE_2022(dataset_name, language, base_path=None, in_memory=True, version='v2.1', branch_name='main', dev_split_name='dev', add_document_separator=False, sample_missing_splits=False, preproc_fn=None, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_NOISEBENCH(noise='clean', base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

label_url = 'https://raw.githubusercontent.com/elenamer/NoiseBench/main/data/annotations/'#
SAVE_TRAINDEV_FILE = False#
name: str#
class flair.datasets.NER_HUNGARIAN(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_ICDAR_EUROPEANA(language, base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_ICELANDIC(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_JAPANESE(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_NERMUD(domains='all', base_path=None, in_memory=False, **corpusargs)View on GitHub#

Bases: MultiCorpus

class flair.datasets.NER_MASAKHANE(languages='luo', version='v2', base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: MultiCorpus

class flair.datasets.NER_MULTI_WIKIANN(languages='en', base_path=None, in_memory=False, **corpusargs)View on GitHub#

Bases: MultiCorpus

class flair.datasets.NER_MULTI_WIKINER(languages='en', base_path=None, in_memory=False, **corpusargs)View on GitHub#

Bases: MultiCorpus

class flair.datasets.NER_MULTI_XTREME(languages='en', base_path=None, in_memory=False, **corpusargs)View on GitHub#

Bases: MultiCorpus

class flair.datasets.NER_SWEDISH(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_TURKU(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_UKRAINIAN(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.NER_ESTONIAN_NOISY(version=0, base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

data_url = 'https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/patnlp/estner.cnll.zip'#
label_url = 'https://raw.githubusercontent.com/uds-lsv/NoisyNER/master/data/only_labels'#
name: str#
class flair.datasets.UP_CHINESE(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.UP_ENGLISH(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.UP_FINNISH(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.UP_FRENCH(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.UP_GERMAN(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.UP_ITALIAN(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.UP_SPANISH(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.UP_SPANISH_ANCORA(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.WNUT_17(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.ColumnCorpus(data_folder, column_format, train_file=None, test_file=None, dev_file=None, autofind_splits=True, name=None, comment_symbol='# ', **corpusargs)View on GitHub#

Bases: MultiFileColumnCorpus

class flair.datasets.ColumnDataset(path_to_column_file, column_name_map, column_delimiter='\\\\s+', comment_symbol=None, banned_sentences=None, in_memory=True, document_separator_token=None, encoding='utf-8', skip_first_line=False, label_name_map=None, default_whitespace_after=1)View on GitHub#

Bases: FlairDataset

SPACE_AFTER_KEY = 'space-after'#
FEATS = ['feats', 'misc']#
HEAD = ['head', 'head_id']#
text_column: int#
head_id_column: Optional[int]#
sentences: list[Sentence]#
sentences_raw: list[list[str]]#
total_sentence_count: int#
is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.NER_MULTI_CONER(task='multi', base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: MultiFileColumnCorpus

class flair.datasets.NER_MULTI_CONER_V2(task='multi', base_path=None, in_memory=True, use_dev_as_test=True, **corpusargs)View on GitHub#

Bases: MultiFileColumnCorpus

class flair.datasets.FeideggerCorpus(**kwargs)View on GitHub#

Bases: Corpus

class flair.datasets.FeideggerDataset(dataset_info, **kwargs)View on GitHub#

Bases: FlairDataset

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.GLUE_MNLI(label_type='entailment', evaluate_on_matched=True, base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Bases: DataPairCorpus

tsv_from_eval_dataset(folder_path)View on GitHub#
class flair.datasets.GLUE_MRPC(label_type='paraphrase', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Bases: DataPairCorpus

tsv_from_eval_dataset(folder_path)View on GitHub#
class flair.datasets.GLUE_QNLI(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Bases: DataPairCorpus

tsv_from_eval_dataset(folder_path)View on GitHub#
class flair.datasets.GLUE_QQP(label_type='paraphrase', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Bases: DataPairCorpus

tsv_from_eval_dataset(folder_path)View on GitHub#
class flair.datasets.GLUE_RTE(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Bases: DataPairCorpus

tsv_from_eval_dataset(folder_path)View on GitHub#
class flair.datasets.GLUE_WNLI(label_type='entailment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Bases: DataPairCorpus

tsv_from_eval_dataset(folder_path)View on GitHub#
class flair.datasets.GLUE_SST2(label_type='sentiment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, encoding='utf-8', sample_missing_splits=True, **datasetargs)View on GitHub#

Bases: CSVClassificationCorpus

label_map = {0: 'negative', 1: 'positive'}#
tsv_from_eval_dataset(folder_path)View on GitHub#

Create eval prediction file.

name: str#
class flair.datasets.GLUE_STSB(label_type='similarity', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Bases: DataPairCorpus

tsv_from_eval_dataset(folder_path)View on GitHub#

Create a tsv file of the predictions of the eval_dataset.

After calling classifier.predict(corpus.eval_dataset, label_name=’similarity’), this function can be used to produce a file called STS-B.tsv suitable for submission to the Glue Benchmark.

class flair.datasets.SUPERGLUE_RTE(base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, sample_missing_splits=True)View on GitHub#

Bases: DataPairCorpus

jsonl_from_eval_dataset(folder_path)View on GitHub#
class flair.datasets.DataPairCorpus(data_folder, columns=[0, 1, 2], train_file=None, test_file=None, dev_file=None, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, label_type=None, autofind_splits=True, sample_missing_splits=True, skip_first_line=False, separator='\\t', encoding='utf-8')View on GitHub#

Bases: Corpus

class flair.datasets.DataPairDataset(path_to_data, columns=[0, 1, 2], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub#

Bases: FlairDataset

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.DataTripleCorpus(data_folder, columns=[0, 1, 2, 3], train_file=None, test_file=None, dev_file=None, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, label_type=None, autofind_splits=True, sample_missing_splits=True, skip_first_line=False, separator='\\t', encoding='utf-8')View on GitHub#

Bases: Corpus

class flair.datasets.DataTripleDataset(path_to_data, columns=[0, 1, 2, 3], max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True, label_type=None, skip_first_line=False, separator='\\t', encoding='utf-8', label=True)View on GitHub#

Bases: FlairDataset

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.OpusParallelCorpus(dataset, l1, l2, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, **corpusargs)View on GitHub#

Bases: ParallelTextCorpus

class flair.datasets.ParallelTextCorpus(source_file, target_file, name, use_tokenizer=True, max_tokens_per_doc=-1, max_chars_per_doc=-1, in_memory=True, **corpusargs)View on GitHub#

Bases: Corpus

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.ParallelTextDataset(path_to_source, path_to_target, max_tokens_per_doc=-1, max_chars_per_doc=-1, use_tokenizer=True, in_memory=True)View on GitHub#

Bases: FlairDataset

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.UD_AFRIKAANS(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_ANCIENT_GREEK(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_ARABIC(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_ARMENIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_BASQUE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_BAVARIAN_MAIBAAM(base_path=None, in_memory=True, split_multiwords=True, revision='dev')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_BELARUSIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_BULGARIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_BURYAT(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_CATALAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_CHINESE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_CHINESE_KYOTO(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_COPTIC(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_CROATIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_CZECH(base_path=None, in_memory=False, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_DANISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_DUTCH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_ENGLISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_ESTONIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_FAROESE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

This treebank includes the Faroese treebank dataset.

The data is obtained from the following link: UniversalDependencies/UD_Faroese-FarPaHC/{revision}

Faronese is a small Western Scandinavian language with 60.000-100.000, related to Icelandic and Old Norse.

class flair.datasets.UD_FINNISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_FRENCH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_GALICIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_GERMAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_GERMAN_HDT(base_path=None, in_memory=False, split_multiwords=True, revision='dev')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_GOTHIC(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_GREEK(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_HEBREW(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_HINDI(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_INDONESIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_IRISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_ITALIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_JAPANESE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_KAZAKH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_KOREAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_LATIN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_LATVIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_LITHUANIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_LIVVI(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_MALTESE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_MARATHI(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_NAIJA(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_NORTH_SAMI(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_NORWEGIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_OLD_CHURCH_SLAVONIC(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_OLD_FRENCH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_PERSIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_POLISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_PORTUGUESE(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_ROMANIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_RUSSIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_SERBIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_SLOVAK(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_SLOVENIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_SPANISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_SWEDISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_TURKISH(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_UKRAINIAN(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UD_WOLOF(base_path=None, in_memory=True, split_multiwords=True, revision='master')View on GitHub#

Bases: UniversalDependenciesCorpus

class flair.datasets.UniversalDependenciesCorpus(data_folder, train_file=None, test_file=None, dev_file=None, in_memory=True, split_multiwords=True)View on GitHub#

Bases: Corpus

class flair.datasets.UniversalDependenciesDataset(path_to_conll_file, in_memory=True, split_multiwords=True)View on GitHub#

Bases: FlairDataset

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.ZELDA(base_path=None, in_memory=False, column_format={0: 'text', 2: 'nel'}, **corpusargs)View on GitHub#

Bases: MultiFileColumnCorpus