Skip to main content

How to load a prepared dataset

This part of the tutorial shows how you can load a corpus for training a model.

The Corpus Object

The Corpus represents a dataset that you use to train a model. It consists of a list of train sentences, a list of dev sentences, and a list of test sentences, which correspond to the training, validation and testing split during model training.

The following example snippet instantiates the Universal Dependency Treebank for English as a corpus object:

import flair.datasets
corpus = flair.datasets.UD_ENGLISH()

The first time you call this snippet, it triggers a download of the Universal Dependency Treebank for English onto your hard drive. It then reads the train, test and dev splits into the Corpus which it returns. Check the length of the three splits to see how many Sentences are there:

# print the number of Sentences in the train split

# print the number of Sentences in the test split

# print the number of Sentences in the dev split

You can also access the Sentence objects in each split directly. For instance, let us look at the first Sentence in the training split of the English UD:

# get the first Sentence in the training split
sentence = corpus.test[0]

# print with all annotations

# print only with POS annotations (better readability)

The sentence is fully tagged with syntactic and morphological information. With the latter line, you print out only the POS tags:

Sentence: "What if Google Morphed Into GoogleOS ?" → ["What"/WP, "if"/IN, "Google"/NNP, "Morphed"/VBD, "Into"/IN, "GoogleOS"/NNP, "?"/.]

So the corpus is tagged and ready for training.

Helper functions

A Corpus contains a bunch of useful helper functions. For instance, you can downsample the data by calling downsample() and passing a ratio. So, if you normally get a corpus like this:

import flair.datasets
corpus = flair.datasets.UD_ENGLISH()

then you can downsample the corpus, simply like this:

import flair.datasets
downsampled_corpus = flair.datasets.UD_ENGLISH().downsample(0.1)

If you print both corpora, you see that the second one has been downsampled to 10% of the data.

print("--- 1 Original ---")

print("--- 2 Downsampled ---")

This should print:

--- 1 Original ---
Corpus: 12543 train + 2002 dev + 2077 test sentences

--- 2 Downsampled ---
Corpus: 1255 train + 201 dev + 208 test sentences

Creating label dictionaries

For many learning tasks you need to create a "dictionary" that contains all the labels you want to predict. You can generate this dictionary directly out of the Corpus by calling the method make_label_dictionary and passing the desired label_type.

For instance, the UD_ENGLISH corpus instantiated above has multiple layers of annotation like regular POS tags ('pos'), universal POS tags ('upos'), morphological tags ('tense', 'number'..) and so on. Create label dictionaries for universal POS tags by passing label_type='upos' like this:

# create label dictionary for a Universal Part-of-Speech tagging task
upos_dictionary = corpus.make_label_dictionary(label_type='upos')

# print dictionary

This will print out the created dictionary:


Dictionaries for other label types

If you don't know the label types in a corpus, just call make_label_dictionary with any random label name (e.g. corpus.make_label_dictionary(label_type='abcd')). This will print out statistics on all label types in the corpus:

The corpus contains the following label types: 'lemma' (in 12543 sentences), 'upos' (in 12543 sentences), 'pos' (in 12543 sentences), 'dependency' (in 12543 sentences), 'number' (in 12036 sentences), 'verbform' (in 10122 sentences), 'prontype' (in 9744 sentences), 'person' (in 9381 sentences), 'mood' (in 8911 sentences), 'tense' (in 8747 sentences), 'degree' (in 7148 sentences), 'definite' (in 6851 sentences), 'case' (in 6486 sentences), 'gender' (in 2824 sentences), 'numtype' (in 2771 sentences), 'poss' (in 2516 sentences), 'voice' (in 1085 sentences), 'typo' (in 399 sentences), 'extpos' (in 185 sentences), 'abbr' (in 168 sentences), 'reflex' (in 98 sentences), 'style' (in 31 sentences), 'foreign' (in 5 sentences)

This means that you can create dictionaries for any of these label types for the UD_ENGLISH corpus. Let's create dictionaries for regular part of speech tags and a morphological number tagging task:

# create label dictionary for a regular POS tagging task
pos_dictionary = corpus.make_label_dictionary(label_type='pos')

# create label dictionary for a morphological number tagging task
tense_dictionary = corpus.make_label_dictionary(label_type='number')

If you print these dictionaries, you will find that the POS dictionary contains 50 tags and the number dictionary only 2 for this corpus (singular and plural).

Dictionaries for other corpora types

The method make_label_dictionary can be used for any corpus, including text classification corpora:

# create label dictionary for a text classification task
corpus = flair.datasets.TREC_6()

The MultiCorpus Object

If you want to train multiple tasks at once, you can use the MultiCorpus object. To initiate the MultiCorpus you first need to create any number of Corpus objects. Afterwards, you can pass a list of Corpus to the MultiCorpus object. For instance, the following snippet loads a combination corpus consisting of the English, German and Dutch Universal Dependency Treebanks.

english_corpus = flair.datasets.UD_ENGLISH()
german_corpus = flair.datasets.UD_GERMAN()
dutch_corpus = flair.datasets.UD_DUTCH()

# make a multi corpus consisting of three UDs
from import MultiCorpus
multi_corpus = MultiCorpus([english_corpus, german_corpus, dutch_corpus])

The MultiCorpus inherits from Corpus, so you can use it like any other corpus to train your models.

Datasets included in Flair

Flair supports many datasets out of the box. It automatically downloads and sets up the data the first time you call the corresponding constructor ID.

The following datasets are supported:

Named Entity Recognition

'CONLL_03'EnglishCoNLL-03 4-class NER (requires manual download)
'CONLL_03_GERMAN'GermanCoNLL-03 4-class NER (requires manual download)
'CONLL_03_DUTCH'DutchCoNLL-03 4-class NER
'CONLL_03_SPANISH'SpanishCoNLL-03 4-class NER
'ONTONOTES'Arabic, English, ChineseOntonotes 18-class NER
'FEWNERD'EnglishFewNERD 66-class NER
'NER_ARABIC_ANER'ArabicArabic Named Entity Recognition Corpus 4-class NER
'NER_ARABIC_AQMAR'ArabicAmerican and Qatari Modeling of Arabic 4-class NER (modified)
'NER_BASQUE'BasqueNER dataset for Basque
'NER_CHINESE_WEIBO'ChineseWeibo NER corpus.
'NER_DANISH_DANE'DanishDaNE dataset
'NER_ENGLISH_MOVIE_SIMPLE'EnglishNER dataset for movie reviews - simple NER
'NER_ENGLISH_MOVIE_COMPLEX'EnglishNER dataset for movie reviews - complex NER
'NER_ENGLISH_PERSON'EnglishPERSON_NER NER with person names
'NER_ENGLISH_RESTAURANT'EnglishNER dataset for restaurant reviews
'NER_ENGLISH_SEC_FILLINGS'EnglishSEC-fillings with 4-class NER labels from (Alvarado et al, 2015)[] here
'NER_ENGLISH_STACKOVERFLOW'EnglishNER on StackOverflow posts
'NER_ENGLISH_TWITTER'EnglishTwitter NER dataset
'NER_ENGLISH_WIKIGOLD'EnglishWikigold a manually annotated collection of Wikipedia text
'NER_ENGLISH_WNUT_2020'EnglishWNUT-20 named entity extraction
'NER_ENGLISH_WEBPAGES'English4-class NER on web pages from Ratinov and Roth (2009)
'NER_GERMAN_BIOFID'GermanCoNLL-03 Biodiversity literature NER
'NER_GERMAN_EUROPARL'GermanGerman Europarl dataset NER in German EU parliament speeches
'NER_GERMAN_GERMEVAL'GermanGermEval 14 NER corpus
'NER_GERMAN_LEGAL'GermanLegal Entity Recognition NER in German Legal Documents
'NER_HIPE_2022'5 languagesNER dataset for HIPE-2022 (Identifying Historical People, Places and other Entities)
'NER_HUNGARIAN'HungarianNER on Hungarian business news
'NER_ICELANDIC'IcelandicNER on Icelandic
'NER_JAPANESE'JapaneseJapanese NER dataset automatically generated from Wikipedia
'NER_MASAKHANE'10 languagesMasakhaNER: Named Entity Recognition for African Languages corpora
'NER_SWEDISH'SwedishSwedish Spraakbanken NER 4-class NER
'NER_TURKU'FinnishTURKU_NER NER corpus created by the Turku NLP Group, University of Turku, Finland
'NER_UKRAINIAN'Ukrainianlang-uk NER corpus created by the Lang-uk community
'NER_MULTI_WIKIANN'282 languagesGigantic corpus for cross-lingual NER derived from Wikipedia.
'NER_MULTI_WIKINER'8 languagesWikiNER NER dataset automatically generated from Wikipedia (English, German, French, Italian, Spanish, Portuguese, Polish, Russian)
'NER_MULTI_XTREME'176 languagesXtreme corpus by Google Research for cross-lingual NER consisting of datasets of a total of 176 languages
'WNUT_17'EnglishWNUT-17 emerging entity detection

Biomedical Named Entity Recognition

We support 31 biomedical NER datasets, listed

Entity Linking

'NEL_ENGLISH_AIDA'EnglishAIDA CoNLL-YAGO Entity Linking corpus on the CoNLL-03 corpus
'NEL_ENGLISH_AQUAINT'EnglishAquaint Entity Linking corpus introduced in Milne and Witten (2008)
'NEL_ENGLISH_IITB'EnglishITTB Entity Linking corpus introduced in Sayali et al. (2009)
'NEL_ENGLISH_REDDIT'EnglishReddit Entity Linking corpus introduced in Botzer et al. (2021) (only gold annotations)
'NEL_ENGLISH_TWEEKI'EnglishITTB Entity Linking corpus introduced in Harandizadeh and Singh (2020)
'NEL_GERMAN_HIPE'GermanHIPE Entity Linking corpus for historical German as a sentence-segmented version

Relation Extraction

'RE_ENGLISH_CONLL04'EnglishCoNLL-04 Relation Extraction
'RE_ENGLISH_SEMEVAL2010'EnglishSemEval-2010 Task 8 on Multi-Way Classification of Semantic Relations Between Pairs of Nominals
'RE_ENGLISH_TACRED'EnglishTAC Relation Extraction Dataset with 41 relations (download required)
'RE_ENGLISH_DRUGPROT'EnglishDrugProt corpus: Biocreative VII Track 1 - drug and chemical-protein interactions

GLUE Benchmark

'GLUE_COLA'EnglishThe Corpus of Linguistic Acceptability from GLUE benchmark
'GLUE_MNLI'EnglishThe Multi-Genre Natural Language Inference Corpus from the GLUE benchmark
'GLUE_RTE'EnglishThe RTE task from the GLUE benchmark
'GLUE_QNLI'EnglishThe Stanford Question Answering Dataset formated as NLI task from the GLUE benchmark
'GLUE_WNLI'EnglishThe Winograd Schema Challenge formated as NLI task from the GLUE benchmark
'GLUE_MRPC'EnglishThe MRPC task from GLUE benchmark
'GLUE_QQP'EnglishThe Quora Question Pairs dataset where the task is to determine whether a pair of questions are semantically equivalent
'SUPERGLUE_RTE'EnglishThe RTE task from the SuperGLUE benchmark

Universal Proposition Banks

We also support loading the Universal Proposition Banks for the purpose of training multilingual frame detection systems.

'UP_CHINESE'ChineseUniversal Propositions for Chinese
'UP_ENGLISH'EnglishUniversal Propositions for English
'UP_FINNISH'FinnishUniversal Propositions for Finnish
'UP_FRENCH'FrenchUniversal Propositions for French
'UP_GERMAN'GermanUniversal Propositions for German
'UP_ITALIAN',ItalianUniversal Propositions for Italian
'UP_SPANISH'SpanishUniversal Propositions for Spanish
'UP_SPANISH_ANCORA'Spanish (Ancora Corpus)Universal Propositions for Spanish

Universal Dependency Treebanks

'UD_ARABIC'ArabicUniversal Dependency Treebank for Arabic
'UD_BASQUE'BasqueUniversal Dependency Treebank for Basque
'UD_BULGARIAN'BulgarianUniversal Dependency Treebank for Bulgarian
'UD_CATALAN',CatalanUniversal Dependency Treebank for Catalan
'UD_CHINESE'ChineseUniversal Dependency Treebank for Chinese
'UD_CHINESE_KYOTO'Classical ChineseUniversal Dependency Treebank for Classical Chinese
'UD_CROATIAN'CroatianUniversal Dependency Treebank for Croatian
'UD_CZECH'CzechVery large Universal Dependency Treebank for Czech
'UD_DANISH'DanishUniversal Dependency Treebank for Danish
'UD_DUTCH'DutchUniversal Dependency Treebank for Dutch
'UD_ENGLISH'EnglishUniversal Dependency Treebank for English
'UD_FINNISH'FinnishUniversal Dependency Treebank for Finnish
'UD_FRENCH'FrenchUniversal Dependency Treebank for French
'UD_GERMAN'GermanUniversal Dependency Treebank for German
'UD_GERMAN-HDT'GermanVery large Universal Dependency Treebank for German
'UD_HEBREW'HebrewUniversal Dependency Treebank for Hebrew
'UD_HINDI'HindiUniversal Dependency Treebank for Hindi
'UD_INDONESIAN'IndonesianUniversal Dependency Treebank for Indonesian
'UD_ITALIAN'ItalianUniversal Dependency Treebank for Italian
'UD_JAPANESE'JapaneseUniversal Dependency Treebank for Japanese
'UD_KOREAN'KoreanUniversal Dependency Treebank for Korean
'UD_NORWEGIAN',NorwegianUniversal Dependency Treebank for Norwegian
'UD_PERSIAN'Persian / FarsiUniversal Dependency Treebank for Persian
'UD_POLISH'PolishUniversal Dependency Treebank for Polish
'UD_PORTUGUESE'PortugueseUniversal Dependency Treebank for Portuguese
'UD_ROMANIAN'RomanianUniversal Dependency Treebank for Romanian
'UD_RUSSIAN'RussianUniversal Dependency Treebank for Russian
'UD_SERBIAN'SerbianUniversal Dependency Treebank for Serbian
'UD_SLOVAK'SlovakUniversal Dependency Treebank for Slovak
'UD_SLOVENIAN'SlovenianUniversal Dependency Treebank for Slovenian
'UD_SPANISH'SpanishUniversal Dependency Treebank for Spanish
'UD_SWEDISH'SwedishUniversal Dependency Treebank for Swedish
'UD_TURKISH'TurkishUniversal Dependency Treebank for Tturkish
'UD_UKRAINIAN'UkrainianUniversal Dependency Treebank for Ukrainian

Text Classification

'AMAZON_REVIEWS'EnglishAmazon product reviews dataset with sentiment annotation
'COMMUNICATIVE_FUNCTIONS'EnglishCommunicative functions of sentences in scholarly papers
'GERMEVAL_2018_OFFENSIVE_LANGUAGE'GermanOffensive language detection for German
'GO_EMOTIONS'EnglishGoEmotions dataset Reddit comments labeled with 27 emotions
'IMDB'EnglishIMDB dataset of movie reviews with sentiment annotation
'NEWSGROUPS'EnglishThe popular 20 newsgroups classification dataset
'YAHOO_ANSWERS'EnglishThe 10 largest main categories from the Yahoo! Answers
'SENTIMENT_140'EnglishTweets dataset with sentiment annotation
'SENTEVAL_CR'EnglishCustomer reviews dataset of SentEval with sentiment annotation
'SENTEVAL_MR'EnglishMovie reviews dataset of SentEval with sentiment annotation
'SENTEVAL_SUBJ'EnglishSubjectivity dataset of SentEval
'SENTEVAL_MPQA'EnglishOpinion-polarity dataset of SentEval with opinion-polarity annotation
'SENTEVAL_SST_BINARY'EnglishStanford sentiment treebank dataset of of SentEval with sentiment annotation
'SENTEVAL_SST_GRANULAR'EnglishStanford sentiment treebank dataset of of SentEval with fine-grained sentiment annotation
'TREC_6', 'TREC_50'EnglishThe TREC question classification dataset

Text Regression

'WASSA_ANGER'EnglishThe WASSA emotion-intensity detection challenge (anger)
'WASSA_FEAR'EnglishThe WASSA emotion-intensity detection challenge (fear)
'WASSA_JOY'EnglishThe WASSA emotion-intensity detection challenge (joy)
'WASSA_SADNESS'EnglishThe WASSA emotion-intensity detection challenge (sadness)

Other Sequence Labeling

'CONLL_2000'EnglishSyntactic chunking with CoNLL-2000
'BIOSCOPE'EnglishNegation and speculation scoping wih BioScope biomedical texts annotated for uncertainty, negation and their scopes
'KEYPHRASE_INSPEC'EnglishKeyphrase dectection with INSPEC original corpus (2000 docs) from INSPEC database, adapted by midas-research
'KEYPHRASE_SEMEVAL2017'EnglishKeyphrase dectection with SEMEVAL2017 dataset (500 docs) from ScienceDirect, adapted by midas-research
'KEYPHRASE_SEMEVAL2010'EnglishKeyphrase dectection with SEMEVAL2010 dataset (~250 docs) from ACM Digital Library, adapted by midas-research

Experimental: Similarity Learning

'FeideggerCorpus'GermanFeidegger dataset fashion images and German-language descriptions
'OpusParallelCorpus'Any language pairParallel corpora of the OPUS project, currently supports only Tatoeba corpus