How to load a prepared dataset
This part of the tutorial shows how you can load a corpus for training a model.
The Corpus Object
The Corpus
represents a dataset that you use to train a model. It consists of a list of train
sentences,
a list of dev
sentences, and a list of test
sentences, which correspond to the training, validation and testing
split during model training.
The following example snippet instantiates the Universal Dependency Treebank for English as a corpus object:
import flair.datasets
corpus = flair.datasets.UD_ENGLISH()
The first time you call this snippet, it triggers a download of the Universal Dependency Treebank for English onto your
hard drive. It then reads the train, test and dev splits into the Corpus
which it returns. Check the length of
the three splits to see how many Sentences are there:
# print the number of Sentences in the train split
print(len(corpus.train))
# print the number of Sentences in the test split
print(len(corpus.test))
# print the number of Sentences in the dev split
print(len(corpus.dev))
You can also access the Sentence objects in each split directly. For instance, let us look at the first Sentence in the training split of the English UD:
# get the first Sentence in the training split
sentence = corpus.test[0]
# print with all annotations
print(sentence)
# print only with POS annotations (better readability)
print(sentence.to_tagged_string('pos'))
The sentence is fully tagged with syntactic and morphological information. With the latter line, you print out only the POS tags:
Sentence: "What if Google Morphed Into GoogleOS ?" → ["What"/WP, "if"/IN, "Google"/NNP, "Morphed"/VBD, "Into"/IN, "GoogleOS"/NNP, "?"/.]
So the corpus is tagged and ready for training.
Helper functions
A Corpus
contains a bunch of useful helper functions.
For instance, you can downsample the data by calling downsample()
and passing a ratio. So, if you normally get a
corpus like this:
import flair.datasets
corpus = flair.datasets.UD_ENGLISH()
then you can downsample the corpus, simply like this:
import flair.datasets
downsampled_corpus = flair.datasets.UD_ENGLISH().downsample(0.1)
If you print both corpora, you see that the second one has been downsampled to 10% of the data.
print("--- 1 Original ---")
print(corpus)
print("--- 2 Downsampled ---")
print(downsampled_corpus)
This should print:
--- 1 Original ---
Corpus: 12543 train + 2002 dev + 2077 test sentences
--- 2 Downsampled ---
Corpus: 1255 train + 201 dev + 208 test sentences
Creating label dictionaries
For many learning tasks you need to create a "dictionary" that contains all the labels you want to predict.
You can generate this dictionary directly out of the Corpus
by calling the method make_label_dictionary
and passing the desired label_type
.
For instance, the UD_ENGLISH corpus instantiated above has multiple layers of annotation like regular
POS tags ('pos'), universal POS tags ('upos'), morphological tags ('tense', 'number'..) and so on.
Create label dictionaries for universal POS tags by passing label_type='upos'
like this:
# create label dictionary for a Universal Part-of-Speech tagging task
upos_dictionary = corpus.make_label_dictionary(label_type='upos')
# print dictionary
print(upos_dictionary)
This will print out the created dictionary:
Dictionary with 17 tags: PROPN, PUNCT, ADJ, NOUN, VERB, DET, ADP, AUX, PRON, PART, SCONJ, NUM, ADV, CCONJ, X, INTJ, SYM
Dictionaries for other label types
If you don't know the label types in a corpus, just call make_label_dictionary
with
any random label name (e.g. corpus.make_label_dictionary(label_type='abcd')
). This will print
out statistics on all label types in the corpus:
The corpus contains the following label types: 'lemma' (in 12543 sentences), 'upos' (in 12543 sentences), 'pos' (in 12543 sentences), 'dependency' (in 12543 sentences), 'number' (in 12036 sentences), 'verbform' (in 10122 sentences), 'prontype' (in 9744 sentences), 'person' (in 9381 sentences), 'mood' (in 8911 sentences), 'tense' (in 8747 sentences), 'degree' (in 7148 sentences), 'definite' (in 6851 sentences), 'case' (in 6486 sentences), 'gender' (in 2824 sentences), 'numtype' (in 2771 sentences), 'poss' (in 2516 sentences), 'voice' (in 1085 sentences), 'typo' (in 399 sentences), 'extpos' (in 185 sentences), 'abbr' (in 168 sentences), 'reflex' (in 98 sentences), 'style' (in 31 sentences), 'foreign' (in 5 sentences)
This means that you can create dictionaries for any of these label types for the UD_ENGLISH corpus. Let's create dictionaries for regular part of speech tags and a morphological number tagging task:
# create label dictionary for a regular POS tagging task
pos_dictionary = corpus.make_label_dictionary(label_type='pos')
# create label dictionary for a morphological number tagging task
tense_dictionary = corpus.make_label_dictionary(label_type='number')
If you print these dictionaries, you will find that the POS dictionary contains 50 tags and the number dictionary only 2 for this corpus (singular and plural).
Dictionaries for other corpora types
The method make_label_dictionary
can be used for any corpus, including text classification corpora:
# create label dictionary for a text classification task
corpus = flair.datasets.TREC_6()
corpus.make_label_dictionary('question_class')
The MultiCorpus Object
If you want to train multiple tasks at once, you can use the MultiCorpus
object.
To initiate the MultiCorpus
you first need to create any number of Corpus
objects. Afterwards, you can pass
a list of Corpus
to the MultiCorpus
object. For instance, the following snippet loads a combination corpus
consisting of the English, German and Dutch Universal Dependency Treebanks.
english_corpus = flair.datasets.UD_ENGLISH()
german_corpus = flair.datasets.UD_GERMAN()
dutch_corpus = flair.datasets.UD_DUTCH()
# make a multi corpus consisting of three UDs
from flair.data import MultiCorpus
multi_corpus = MultiCorpus([english_corpus, german_corpus, dutch_corpus])
The MultiCorpus
inherits from Corpus
, so you can use it like any other corpus to train your models.
Datasets included in Flair
Flair supports many datasets out of the box. It automatically downloads and sets up the data the first time you call the corresponding constructor ID.
The following datasets are supported:
Named Entity Recognition
Object | Languages | Description |
---|---|---|
'CONLL_03' | English | CoNLL-03 4-class NER (requires manual download) |
'CONLL_03_GERMAN' | German | CoNLL-03 4-class NER (requires manual download) |
'CONLL_03_DUTCH' | Dutch | CoNLL-03 4-class NER |
'CONLL_03_SPANISH' | Spanish | CoNLL-03 4-class NER |
'ONTONOTES' | Arabic, English, Chinese | Ontonotes 18-class NER |
'FEWNERD' | English | FewNERD 66-class NER |
'NER_ARABIC_ANER' | Arabic | Arabic Named Entity Recognition Corpus 4-class NER |
'NER_ARABIC_AQMAR' | Arabic | American and Qatari Modeling of Arabic 4-class NER (modified) |
'NER_BASQUE' | Basque | NER dataset for Basque |
'NER_CHINESE_WEIBO' | Chinese | Weibo NER corpus. |
'NER_DANISH_DANE' | Danish | DaNE dataset |
'NER_ENGLISH_MOVIE_SIMPLE' | English | NER dataset for movie reviews - simple NER |
'NER_ENGLISH_MOVIE_COMPLEX' | English | NER dataset for movie reviews - complex NER |
'NER_ENGLISH_PERSON' | English | PERSON_NER NER with person names |
'NER_ENGLISH_RESTAURANT' | English | NER dataset for restaurant reviews |
'NER_ENGLISH_SEC_FILLINGS' | English | SEC-fillings with 4-class NER labels from (Alvarado et al, 2015)[https://aclanthology.org/U15-1010/] here |
'NER_ENGLISH_STACKOVERFLOW' | English | NER on StackOverflow posts |
'NER_ENGLISH_TWITTER' | English | Twitter NER dataset |
'NER_ENGLISH_WIKIGOLD' | English | Wikigold a manually annotated collection of Wikipedia text |
'NER_ENGLISH_WNUT_2020' | English | WNUT-20 named entity extraction |
'NER_ENGLISH_WEBPAGES' | English | 4-class NER on web pages from Ratinov and Roth (2009) |
'NER_FINNISH' | Finnish | Finer-data |
'NER_GERMAN_BIOFID' | German | CoNLL-03 Biodiversity literature NER |
'NER_GERMAN_EUROPARL' | German | German Europarl dataset NER in German EU parliament speeches |
'NER_GERMAN_GERMEVAL' | German | GermEval 14 NER corpus |
'NER_GERMAN_LEGAL' | German | Legal Entity Recognition NER in German Legal Documents |
'NER_GERMAN_POLITICS' | German | NEMGP corpus |
'NER_HIPE_2022' | 5 languages | NER dataset for HIPE-2022 (Identifying Historical People, Places and other Entities) |
'NER_HUNGARIAN' | Hungarian | NER on Hungarian business news |
'NER_ICELANDIC' | Icelandic | NER on Icelandic |
'NER_JAPANESE' | Japanese | Japanese NER dataset automatically generated from Wikipedia |
'NER_MASAKHANE' | 10 languages | MasakhaNER: Named Entity Recognition for African Languages corpora |
'NER_SWEDISH' | Swedish | Swedish Spraakbanken NER 4-class NER |
'NER_TURKU' | Finnish | TURKU_NER NER corpus created by the Turku NLP Group, University of Turku, Finland |
'NER_UKRAINIAN' | Ukrainian | lang-uk NER corpus created by the Lang-uk community |
'NER_MULTI_WIKIANN' | 282 languages | Gigantic corpus for cross-lingual NER derived from Wikipedia. |
'NER_MULTI_WIKINER' | 8 languages | WikiNER NER dataset automatically generated from Wikipedia (English, German, French, Italian, Spanish, Portuguese, Polish, Russian) |
'NER_MULTI_XTREME' | 176 languages | Xtreme corpus by Google Research for cross-lingual NER consisting of datasets of a total of 176 languages |
'WNUT_17' | English | WNUT-17 emerging entity detection |
Biomedical Named Entity Recognition
We support 31 biomedical NER datasets, listed
Entity Linking
Object | Languages | Description |
---|---|---|
'NEL_ENGLISH_AIDA' | English | AIDA CoNLL-YAGO Entity Linking corpus on the CoNLL-03 corpus |
'NEL_ENGLISH_AQUAINT' | English | Aquaint Entity Linking corpus introduced in Milne and Witten (2008) |
'NEL_ENGLISH_IITB' | English | ITTB Entity Linking corpus introduced in Sayali et al. (2009) |
'NEL_ENGLISH_REDDIT' | English | Reddit Entity Linking corpus introduced in Botzer et al. (2021) (only gold annotations) |
'NEL_ENGLISH_TWEEKI' | English | ITTB Entity Linking corpus introduced in Harandizadeh and Singh (2020) |
'NEL_GERMAN_HIPE' | German | HIPE Entity Linking corpus for historical German as a sentence-segmented version |
Relation Extraction
Object | Languages | Description |
---|---|---|
'RE_ENGLISH_CONLL04' | English | CoNLL-04 Relation Extraction |
'RE_ENGLISH_SEMEVAL2010' | English | SemEval-2010 Task 8 on Multi-Way Classification of Semantic Relations Between Pairs of Nominals |
'RE_ENGLISH_TACRED' | English | TAC Relation Extraction Dataset with 41 relations (download required) |
'RE_ENGLISH_DRUGPROT' | English | DrugProt corpus: Biocreative VII Track 1 - drug and chemical-protein interactions |
GLUE Benchmark
Object | Languages | Description |
---|---|---|
'GLUE_COLA' | English | The Corpus of Linguistic Acceptability from GLUE benchmark |
'GLUE_MNLI' | English | The Multi-Genre Natural Language Inference Corpus from the GLUE benchmark |
'GLUE_RTE' | English | The RTE task from the GLUE benchmark |
'GLUE_QNLI' | English | The Stanford Question Answering Dataset formated as NLI task from the GLUE benchmark |
'GLUE_WNLI' | English | The Winograd Schema Challenge formated as NLI task from the GLUE benchmark |
'GLUE_MRPC' | English | The MRPC task from GLUE benchmark |
'GLUE_QQP' | English | The Quora Question Pairs dataset where the task is to determine whether a pair of questions are semantically equivalent |
'SUPERGLUE_RTE' | English | The RTE task from the SuperGLUE benchmark |
Universal Proposition Banks
We also support loading the Universal Proposition Banks for the purpose of training multilingual frame detection systems.
Object | Languages | Description |
---|---|---|
'UP_CHINESE' | Chinese | Universal Propositions for Chinese |
'UP_ENGLISH' | English | Universal Propositions for English |
'UP_FINNISH' | Finnish | Universal Propositions for Finnish |
'UP_FRENCH' | French | Universal Propositions for French |
'UP_GERMAN' | German | Universal Propositions for German |
'UP_ITALIAN', | Italian | Universal Propositions for Italian |
'UP_SPANISH' | Spanish | Universal Propositions for Spanish |
'UP_SPANISH_ANCORA' | Spanish (Ancora Corpus) | Universal Propositions for Spanish |
Universal Dependency Treebanks
Object | Languages | Description |
---|---|---|
'UD_ARABIC' | Arabic | Universal Dependency Treebank for Arabic |
'UD_BASQUE' | Basque | Universal Dependency Treebank for Basque |
'UD_BULGARIAN' | Bulgarian | Universal Dependency Treebank for Bulgarian |
'UD_CATALAN', | Catalan | Universal Dependency Treebank for Catalan |
'UD_CHINESE' | Chinese | Universal Dependency Treebank for Chinese |
'UD_CHINESE_KYOTO' | Classical Chinese | Universal Dependency Treebank for Classical Chinese |
'UD_CROATIAN' | Croatian | Universal Dependency Treebank for Croatian |
'UD_CZECH' | Czech | Very large Universal Dependency Treebank for Czech |
'UD_DANISH' | Danish | Universal Dependency Treebank for Danish |
'UD_DUTCH' | Dutch | Universal Dependency Treebank for Dutch |
'UD_ENGLISH' | English | Universal Dependency Treebank for English |
'UD_FINNISH' | Finnish | Universal Dependency Treebank for Finnish |
'UD_FRENCH' | French | Universal Dependency Treebank for French |
'UD_GERMAN' | German | Universal Dependency Treebank for German |
'UD_GERMAN-HDT' | German | Very large Universal Dependency Treebank for German |
'UD_HEBREW' | Hebrew | Universal Dependency Treebank for Hebrew |
'UD_HINDI' | Hindi | Universal Dependency Treebank for Hindi |
'UD_INDONESIAN' | Indonesian | Universal Dependency Treebank for Indonesian |
'UD_ITALIAN' | Italian | Universal Dependency Treebank for Italian |
'UD_JAPANESE' | Japanese | Universal Dependency Treebank for Japanese |
'UD_KOREAN' | Korean | Universal Dependency Treebank for Korean |
'UD_NORWEGIAN', | Norwegian | Universal Dependency Treebank for Norwegian |
'UD_PERSIAN' | Persian / Farsi | Universal Dependency Treebank for Persian |
'UD_POLISH' | Polish | Universal Dependency Treebank for Polish |
'UD_PORTUGUESE' | Portuguese | Universal Dependency Treebank for Portuguese |
'UD_ROMANIAN' | Romanian | Universal Dependency Treebank for Romanian |
'UD_RUSSIAN' | Russian | Universal Dependency Treebank for Russian |
'UD_SERBIAN' | Serbian | Universal Dependency Treebank for Serbian |
'UD_SLOVAK' | Slovak | Universal Dependency Treebank for Slovak |
'UD_SLOVENIAN' | Slovenian | Universal Dependency Treebank for Slovenian |
'UD_SPANISH' | Spanish | Universal Dependency Treebank for Spanish |
'UD_SWEDISH' | Swedish | Universal Dependency Treebank for Swedish |
'UD_TURKISH' | Turkish | Universal Dependency Treebank for Tturkish |
'UD_UKRAINIAN' | Ukrainian | Universal Dependency Treebank for Ukrainian |
Text Classification
Object | Languages | Description |
---|---|---|
'AMAZON_REVIEWS' | English | Amazon product reviews dataset with sentiment annotation |
'COMMUNICATIVE_FUNCTIONS' | English | Communicative functions of sentences in scholarly papers |
'GERMEVAL_2018_OFFENSIVE_LANGUAGE' | German | Offensive language detection for German |
'GO_EMOTIONS' | English | GoEmotions dataset Reddit comments labeled with 27 emotions |
'IMDB' | English | IMDB dataset of movie reviews with sentiment annotation |
'NEWSGROUPS' | English | The popular 20 newsgroups classification dataset |
'YAHOO_ANSWERS' | English | The 10 largest main categories from the Yahoo! Answers |
'SENTIMENT_140' | English | Tweets dataset with sentiment annotation |
'SENTEVAL_CR' | English | Customer reviews dataset of SentEval with sentiment annotation |
'SENTEVAL_MR' | English | Movie reviews dataset of SentEval with sentiment annotation |
'SENTEVAL_SUBJ' | English | Subjectivity dataset of SentEval |
'SENTEVAL_MPQA' | English | Opinion-polarity dataset of SentEval with opinion-polarity annotation |
'SENTEVAL_SST_BINARY' | English | Stanford sentiment treebank dataset of of SentEval with sentiment annotation |
'SENTEVAL_SST_GRANULAR' | English | Stanford sentiment treebank dataset of of SentEval with fine-grained sentiment annotation |
'TREC_6', 'TREC_50' | English | The TREC question classification dataset |
Text Regression
Object | Languages | Description |
---|---|---|
'WASSA_ANGER' | English | The WASSA emotion-intensity detection challenge (anger) |
'WASSA_FEAR' | English | The WASSA emotion-intensity detection challenge (fear) |
'WASSA_JOY' | English | The WASSA emotion-intensity detection challenge (joy) |
'WASSA_SADNESS' | English | The WASSA emotion-intensity detection challenge (sadness) |
Other Sequence Labeling
Object | Languages | Description |
---|---|---|
'CONLL_2000' | English | Syntactic chunking with CoNLL-2000 |
'BIOSCOPE' | English | Negation and speculation scoping wih BioScope biomedical texts annotated for uncertainty, negation and their scopes |
'KEYPHRASE_INSPEC' | English | Keyphrase dectection with INSPEC original corpus (2000 docs) from INSPEC database, adapted by midas-research |
'KEYPHRASE_SEMEVAL2017' | English | Keyphrase dectection with SEMEVAL2017 dataset (500 docs) from ScienceDirect, adapted by midas-research |
'KEYPHRASE_SEMEVAL2010' | English | Keyphrase dectection with SEMEVAL2010 dataset (~250 docs) from ACM Digital Library, adapted by midas-research |
Experimental: Similarity Learning
Object | Languages | Description |
---|---|---|
'FeideggerCorpus' | German | Feidegger dataset fashion images and German-language descriptions |
'OpusParallelCorpus' | Any language pair | Parallel corpora of the OPUS project, currently supports only Tatoeba corpus |