# How to load a prepared dataset

This part of the tutorial shows how you can load a corpus for training a model. 

## The Corpus Object

The [`Corpus`](#flair.data.Corpus) represents a dataset that you use to train a model. It consists of a list of `train` sentences,
a list of `dev` sentences, and a list of `test` sentences, which correspond to the training, validation and testing
split during model training.

The following example snippet instantiates the Universal Dependency Treebank for English as a corpus object:

```python
from flair.datasets import UD_ENGLISH
corpus = UD_ENGLISH()
```

The first time you call this snippet, it triggers a download of the Universal Dependency Treebank for English onto your
hard drive. It then reads the train, test and dev splits into the [`Corpus`](#flair.data.Corpus) which it returns. Check the length of
the three splits to see how many Sentences are there:

```python
# print the number of Sentences in the train split
print(len(corpus.train))

# print the number of Sentences in the test split
print(len(corpus.test))

# print the number of Sentences in the dev split
print(len(corpus.dev))
```

You can also access the [`Sentence`](#flair.data.Sentence) objects in each split directly. For instance, let us look at the first Sentence in
the training split of the English UD:

```python
# get the first Sentence in the training split
sentence = corpus.test[0]

# print with all annotations
print(sentence)

# print only with POS annotations (better readability)
print(sentence.to_tagged_string('pos'))
```

The sentence is fully tagged with syntactic and morphological information. With the latter line,
you print out only the POS tags:

```console
Sentence: "What if Google Morphed Into GoogleOS ?" → ["What"/WP, "if"/IN, "Google"/NNP, "Morphed"/VBD, "Into"/IN, "GoogleOS"/NNP, "?"/.]
```

So the corpus is tagged and ready for training.

### Helper functions

A [`Corpus`](#flair.data.Corpus) contains a bunch of useful helper functions.
For instance, you can downsample the data by calling [`Corpus.downsample()`](#flair.data.Corpus.downsample) and passing a ratio. So, if you normally get a
corpus like this:

```python
from flair.datasets import UD_ENGLISH
corpus = UD_ENGLISH()
```

then you can downsample the corpus, simply like this:

```python
from flair.datasets import UD_ENGLISH
downsampled_corpus = UD_ENGLISH().downsample(0.1)
```

If you print both corpora, you see that the second one has been downsampled to 10% of the data.

```python
print("--- 1 Original ---")
print(corpus)

print("--- 2 Downsampled ---")
print(downsampled_corpus)
```

This should print:

```console
--- 1 Original ---
Corpus: 12543 train + 2002 dev + 2077 test sentences

--- 2 Downsampled ---
Corpus: 1255 train + 201 dev + 208 test sentences
```

### Creating label dictionaries

For many learning tasks you need to create a "dictionary" that contains all the labels you want to predict.
You can generate this dictionary directly out of the [`Corpus`](#flair.data.Corpus) by calling the method [`Corpus.make_label_dictionary`](#flair.data.Corpus.make_label_dictionary)
and passing the desired `label_type`.

For instance, the UD_ENGLISH corpus instantiated above has multiple layers of annotation like regular
POS tags ('pos'), universal POS tags ('upos'), morphological tags ('tense', 'number'..) and so on.
Create label dictionaries for universal POS tags by passing `label_type='upos'` like this:

```python
# create label dictionary for a Universal Part-of-Speech tagging task
upos_dictionary = corpus.make_label_dictionary(label_type='upos')

# print dictionary
print(upos_dictionary)
```

This will print out the created dictionary:

```console
Dictionary with 17 tags: PROPN, PUNCT, ADJ, NOUN, VERB, DET, ADP, AUX, PRON, PART, SCONJ, NUM, ADV, CCONJ, X, INTJ, SYM
```

#### Dictionaries for other label types

If you don't know the label types in a corpus, just call [`Corpus.make_label_dictionary`](#flair.data.Corpus.make_label_dictionary) with
any random label name (e.g. `corpus.make_label_dictionary(label_type='abcd')`). This will print
out statistics on all label types in the corpus:

```console
The corpus contains the following label types: 'lemma' (in 12543 sentences), 'upos' (in 12543 sentences), 'pos' (in 12543 sentences), 'dependency' (in 12543 sentences), 'number' (in 12036 sentences), 'verbform' (in 10122 sentences), 'prontype' (in 9744 sentences), 'person' (in 9381 sentences), 'mood' (in 8911 sentences), 'tense' (in 8747 sentences), 'degree' (in 7148 sentences), 'definite' (in 6851 sentences), 'case' (in 6486 sentences), 'gender' (in 2824 sentences), 'numtype' (in 2771 sentences), 'poss' (in 2516 sentences), 'voice' (in 1085 sentences), 'typo' (in 399 sentences), 'extpos' (in 185 sentences), 'abbr' (in 168 sentences), 'reflex' (in 98 sentences), 'style' (in 31 sentences), 'foreign' (in 5 sentences)
```

This means that you can create dictionaries for any of these label types for the [`UD_ENGLISH`](#flair.datasets.treebanks.UD_ENGLISH) corpus. Let's create dictionaries for regular part of speech tags
and a morphological number tagging task:

```python
# create label dictionary for a regular POS tagging task
pos_dictionary = corpus.make_label_dictionary(label_type='pos')

# create label dictionary for a morphological number tagging task
tense_dictionary = corpus.make_label_dictionary(label_type='number')
```

If you print these dictionaries, you will find that the POS dictionary contains 50 tags and the number dictionary only 2 for this corpus (singular and plural).


#### Dictionaries for other corpora types

The method [`Corpus.make_label_dictionary`](#flair.data.Corpus.make_label_dictionary) can be used for any corpus, including text classification corpora:

```python
# create label dictionary for a text classification task
from flair.datasets import TREC_6
corpus = TREC_6()
corpus.make_label_dictionary('question_class')
```

### The MultiCorpus Object

If you want to train multiple tasks at once, you can use the [`MultiCorpus`](#flair.data.MultiCorpus) object.
To initiate the [`MultiCorpus`](#flair.data.MultiCorpus) you first need to create any number of [`Corpus`](#flair.data.Corpus) objects. Afterwards, you can pass
a list of [`Corpus`](#flair.data.Corpus) to the [`MultiCorpus`](#flair.data.MultiCorpus) object. For instance, the following snippet loads a combination corpus
consisting of the English, German and Dutch Universal Dependency Treebanks.

```python
from flair.datasets import UD_ENGLISH, UD_GERMAN, UD_DUTCH
english_corpus = UD_ENGLISH()
german_corpus = UD_GERMAN()
dutch_corpus = UD_DUTCH()

# make a multi corpus consisting of three UDs
from flair.data import MultiCorpus
multi_corpus = MultiCorpus([english_corpus, german_corpus, dutch_corpus])
```

The [`MultiCorpus`](#flair.data.MultiCorpus) inherits from `[`Corpus`](#flair.data.Corpus), so you can use it like any other corpus to train your models.

## Datasets included in Flair

Flair supports many datasets out of the box. It usually automatically downloads and sets up the data the first time you
call the corresponding constructor ID.
The datasets are split into multiple modules, however they all can be imported from `flair.datasets` too.
You can look up the respective modules to find the possible datasets.

The following datasets are supported:

| Task                                | Module                                                                                                                                      |
|-------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|
| Named Entity Recognition            | [flair.datasets.sequence_labeling](#flair.datasets.sequence_labeling)                                                                       |
| Text Classification                 | [flair.datasets.document_classification](#flair.datasets.document_classification)                                                           |
| Text Regression                     | [flair.datasets.document_classification](#flair.datasets.document_classification)                                                           |
| Biomedical Named Entity Recognition | [flair.datasets.biomedical](#flair.datasets.biomedical)                                                                                     |
| Entity Linking                      | [flair.datasets.entity_linking](#flair.datasets.entity_linking)                                                                             |
| Relation Extraction                 | [flair.datasets.relation_extraction](#flair.datasets.relation_extraction)                                                                   |
| Sequence Labeling                   | [flair.datasets.sequence_labeling](#flair.datasets.sequence_labeling)                                                                       |
| Glue Benchmark                      | [flair.datasets.text_text](#flair.datasets.text_text) and [flair.datasets.document_classification](#flair.datasets.document_classification) |
| Universal Proposition Banks         | [flair.datasets.treebanks](#flair.datasets.treebanks)                                                                                       |
| Universal Dependency Treebanks      | [flair.datasets.treebanks](#flair.datasets.treebanks)                                                                                       |
| OCR-Layout-NER                      | [flair.datasets.ocr](#flair.datasets.ocr)                                                                                                   |