# Flair embeddings

Contextual string embeddings are [powerful embeddings](https://drive.google.com/file/d/17yVpFA7MmXaQFTe-HDpZuqw9fJlmzg56/view?usp=sharing)
 that capture latent syntactic-semantic information that goes beyond
standard word embeddings. Key differences are:
1) they are trained without any explicit notion of words and thus fundamentally model words as sequences of characters.
2) they are **contextualized** by their surrounding text, meaning that the *same word will have different embeddings depending on its contextual use*.

With Flair, you can use these embeddings simply by instantiating the appropriate embedding class, same as standard word embeddings:

```python
from flair.embeddings import FlairEmbeddings

# init embedding
flair_embedding_forward = FlairEmbeddings('news-forward')

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
flair_embedding_forward.embed(sentence)
```

You choose which embeddings you load by passing the appropriate string to the constructor of the [`FlairEmbeddings`](#flair.embeddings.token.FlairEmbeddings) class.
Currently, the following contextual string embeddings are provided (note: replace '*X*' with either '*forward*' or '*backward*'):

| ID                      | Language                                        | Embedding                                                                                                                                                                                                                                    |
|-------------------------|-------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 'multi-X'               | 300+                                            | [JW300 corpus](http://opus.nlpl.eu/JW300.php), as proposed by [Agić and Vulić (2019)](https://www.aclweb.org/anthology/P19-1310/). The corpus is licensed under CC-BY-NC-SA                                                                  
| 'multi-X-fast'          | English, German, French, Italian, Dutch, Polish | Mix of corpora (Web, Wikipedia, Subtitles, News), CPU-friendly                                                                                                                                                                               |
| 'news-X'                | English                                         | Trained with 1 billion word corpus                                                                                                                                                                                                           |
| 'news-X-fast'           | English                                         | Trained with 1 billion word corpus, CPU-friendly                                                                                                                                                                                             |
| 'mix-X'                 | English                                         | Trained with mixed corpus (Web, Wikipedia, Subtitles)                                                                                                                                                                                        |
| 'ar-X'                  | Arabic                                          | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS                                                                                                                                      |
| 'bg-X'                  | Bulgarian                                       | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS                                                                                                                                      |
| 'bg-X-fast'             | Bulgarian                                       | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Trained with various sources (Europarl, Wikipedia or SETimes)                                                                                                                 |
| 'cs-X'                  | Czech                                           | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS                                                                                                                                      |
| 'cs-v0-X'               | Czech                                           | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): LM embeddings (earlier version)                                                                                                                                               |
| 'de-X'                  | German                                          | Trained with mixed corpus (Web, Wikipedia, Subtitles)                                                                                                                                                                                        |
| 'de-historic-ha-X'      | German (historical)                             | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Historical German trained over *Hamburger Anzeiger*                                                                                                                           |
| 'de-historic-wz-X'      | German (historical)                             | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Historical German trained over *Wiener Zeitung*                                                                                                                               |
| 'de-historic-rw-X'      | German (historical)                             | Added by [@redewiedergabe](https://github.com/redewiedergabe): Historical German trained over 100 million tokens                                                                                                                             |
| 'es-X'                  | Spanish                                         | Added by [@iamyihwa](https://github.com/zalandoresearch/flair/issues/80): Trained with Wikipedia                                                                                                                                             |
| 'es-X-fast'             | Spanish                                         | Added by [@iamyihwa](https://github.com/zalandoresearch/flair/issues/80): Trained with Wikipedia, CPU-friendly                                                                                                                               |
| 'es-clinical-'          | Spanish (clinical)                              | Added by [@matirojasg](https://github.com/flairNLP/flair/issues/2292): Trained with Wikipedia                                                                                                                                                |
| 'eu-X'                  | Basque                                          | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS                                                                                                                                      |
| 'eu-v0-X'               | Basque                                          | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): LM embeddings (earlier version)                                                                                                                                               |
| 'fa-X'                  | Persian                                         | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS                                                                                                                                      |
| 'fi-X'                  | Finnish                                         | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS                                                                                                                                      |
| 'fr-X'                  | French                                          | Added by [@mhham](https://github.com/mhham): Trained with French Wikipedia                                                                                                                                                                   |
| 'he-X'                  | Hebrew                                          | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS                                                                                                                                      |
| 'hi-X'                  | Hindi                                           | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS                                                                                                                                      |
| 'hr-X'                  | Croatian                                        | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS                                                                                                                                      |
| 'id-X'                  | Indonesian                                      | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS                                                                                                                                      |
| 'it-X'                  | Italian                                         | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS                                                                                                                                      |
| 'ja-X'                  | Japanese                                        | Added by [@frtacoa](https://github.com/zalandoresearch/flair/issues/527): Trained with 439M words of Japanese Web crawls (2048 hidden states, 2 layers)                                                                                      |
| 'nl-X'                  | Dutch                                           | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS                                                                                                                                      |
| 'nl-v0-X'               | Dutch                                           | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): LM embeddings (earlier version)                                                                                                                                               |
| 'no-X'                  | Norwegian                                       | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS                                                                                                                                      |
| 'pl-X'                  | Polish                                          | Added by [@borchmann](https://github.com/applicaai/poleval-2018): Trained with web crawls (Polish part of CommonCrawl)                                                                                                                       |
| 'pl-opus-X'             | Polish                                          | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS                                                                                                                                      |
| 'pt-X'                  | Portuguese                                      | Added by [@ericlief](https://github.com/ericlief/language_models): LM embeddings                                                                                                                                                             |
| 'sl-X'                  | Slovenian                                       | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS                                                                                                                                      |
| 'sl-v0-X'               | Slovenian                                       | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Trained with various sources (Europarl, Wikipedia and OpenSubtitles2018)                                                                                                      |
| 'sv-X'                  | Swedish                                         | Added by [@stefan-it](https://github.com/zalandoresearch/flair/issues/614): Trained with Wikipedia/OPUS                                                                                                                                      |
| 'sv-v0-X'               | Swedish                                         | Added by [@stefan-it](https://github.com/stefan-it/flair-lms): Trained with various sources (Europarl, Wikipedia or OpenSubtitles2018)                                                                                                       |
| 'ta-X'                  | Tamil                                           | Added by [@stefan-it](https://github.com/stefan-it/plur)                                                                                                                                                                                     |
| 'pubmed-X'              | English                                         | Added by [@jessepeng](https://github.com/zalandoresearch/flair/pull/519): Trained with 5% of PubMed abstracts until 2015 (1150 hidden states, 3 layers)                                                                                      |
| 'de-impresso-hipe-v1-X' | German (historical)                             | In-domain data (Swiss and Luxembourgish newspapers) for [CLEF HIPE Shared task](https://impresso.github.io/CLEF-HIPE-2020). More information on the shared task can be found in [this paper](https://zenodo.org/record/3752679#.XqgzxXUzZzU) |
| 'en-impresso-hipe-v1-X' | English (historical)                            | In-domain data (Chronicling America material) for [CLEF HIPE Shared task](https://impresso.github.io/CLEF-HIPE-2020). More information on the shared task can be found in [this paper](https://zenodo.org/record/3752679#.XqgzxXUzZzU)       |
| 'fr-impresso-hipe-v1-X' | French (historical)                             | In-domain data (Swiss and Luxembourgish newspapers) for [CLEF HIPE Shared task](https://impresso.github.io/CLEF-HIPE-2020). More information on the shared task can be found in [this paper](https://zenodo.org/record/3752679#.XqgzxXUzZzU) |
| 'am-X'                  | Amharic                                         | Based on 6.5m Amharic text corpus crawled from different sources. See [this paper](https://www.mdpi.com/1999-5903/13/11/275) and the official [GitHub Repository](https://github.com/uhh-lt/amharicmodels) for more information.             |
| 'uk-X'                  | Ukrainian                                       | Added by [@dchaplinsky](https://github.com/dchaplinsky): Trained with [UberText](https://lang.org.ua/en/corpora/) corpus.                                                                                                                    |

So, if you want to load embeddings from the German forward LM model, instantiate the method as follows:

```python
flair_de_forward = FlairEmbeddings('de-forward')
```

And if you want to load embeddings from the Bulgarian backward LM model, instantiate the method as follows:

```python
flair_bg_backward = FlairEmbeddings('bg-backward')
```

## Recommended Flair usage

We recommend combining both forward and backward Flair embeddings. Depending on the task, we also recommend adding standard [`WordEmbeddings`](#flair.embeddings.token.WordEmbeddings) into the mix. So, our recommended [`StackedEmbeddings`](#flair.embeddings.token.StackedEmbeddings) for most English tasks is:


```python
from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings

# create a StackedEmbedding object that combines glove and forward/backward flair embeddings
stacked_embeddings = StackedEmbeddings([
                                        WordEmbeddings('glove'),
                                        FlairEmbeddings('news-forward'),
                                        FlairEmbeddings('news-backward'),
                                       ])
```

That's it! Now just use this embedding like all the other embeddings, i.e. call the [`embed()`](#flair.embeddings.base.Embeddings.embed) method over your sentences.

```python
sentence = Sentence('The grass is green .')

# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)
```
Words are now embedded using a concatenation of three different embeddings. This combination often gives state-of-the-art accuracy.


## Pooled Flair embeddings

We also developed a pooled variant of the [`FlairEmbeddings`](#flair.embeddings.token.FlairEmbeddings). These embeddings differ in that they *constantly evolve over time*, even at prediction time (i.e. after training is complete). This means that the same words in the same sentence at two different points in time may have different embeddings.

[`PooledFlairEmbeddings`](#flair.embeddings.token.PooledFlairEmbeddings) manage a 'global' representation of each distinct word by using a pooling operation of all past occurences. More details on how this works may be found in [Akbik et al. (2019)](https://www.aclweb.org/anthology/N19-1078/).

You can instantiate and use [`PooledFlairEmbeddings`](#flair.embeddings.token.PooledFlairEmbeddings) like [`FlairEmbeddings`](#flair.embeddings.token.FlairEmbeddings):

```python
from flair.embeddings import PooledFlairEmbeddings

# init embedding
flair_embedding_forward = PooledFlairEmbeddings('news-forward')

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
flair_embedding_forward.embed(sentence)
```

Note that while we get some of our best results with [`PooledFlairEmbeddings`](#flair.embeddings.token.PooledFlairEmbeddings) they are very ineffective memory-wise since they keep past embeddings of all words in memory. In many cases, regular [`FlairEmbeddings`](#flair.embeddings.token.FlairEmbeddings) will be nearly as good but with much lower memory requirements.