Flair embeddings#

Contextual string embeddings are powerful embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. Key differences are:

  1. they are trained without any explicit notion of words and thus fundamentally model words as sequences of characters.

  2. they are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use.

With Flair, you can use these embeddings simply by instantiating the appropriate embedding class, same as standard word embeddings:

from flair.embeddings import FlairEmbeddings

# init embedding
flair_embedding_forward = FlairEmbeddings('news-forward')

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
flair_embedding_forward.embed(sentence)

You choose which embeddings you load by passing the appropriate string to the constructor of the FlairEmbeddings class. Currently, the following contextual string embeddings are provided (note: replace ‘X’ with either ‘forward’ or ‘backward’):

ID

Language

Embedding

‘multi-X’

300+

JW300 corpus, as proposed by Agić and Vulić (2019). The corpus is licensed under CC-BY-NC-SA

‘multi-X-fast’

English, German, French, Italian, Dutch, Polish

Mix of corpora (Web, Wikipedia, Subtitles, News), CPU-friendly

‘news-X’

English

Trained with 1 billion word corpus

‘news-X-fast’

English

Trained with 1 billion word corpus, CPU-friendly

‘mix-X’

English

Trained with mixed corpus (Web, Wikipedia, Subtitles)

‘ar-X’

Arabic

Added by @stefan-it: Trained with Wikipedia/OPUS

‘bg-X’

Bulgarian

Added by @stefan-it: Trained with Wikipedia/OPUS

‘bg-X-fast’

Bulgarian

Added by @stefan-it: Trained with various sources (Europarl, Wikipedia or SETimes)

‘cs-X’

Czech

Added by @stefan-it: Trained with Wikipedia/OPUS

‘cs-v0-X’

Czech

Added by @stefan-it: LM embeddings (earlier version)

‘de-X’

German

Trained with mixed corpus (Web, Wikipedia, Subtitles)

‘de-historic-ha-X’

German (historical)

Added by @stefan-it: Historical German trained over Hamburger Anzeiger

‘de-historic-wz-X’

German (historical)

Added by @stefan-it: Historical German trained over Wiener Zeitung

‘de-historic-rw-X’

German (historical)

Added by @redewiedergabe: Historical German trained over 100 million tokens

‘es-X’

Spanish

Added by @iamyihwa: Trained with Wikipedia

‘es-X-fast’

Spanish

Added by @iamyihwa: Trained with Wikipedia, CPU-friendly

‘es-clinical-’

Spanish (clinical)

Added by @matirojasg: Trained with Wikipedia

‘eu-X’

Basque

Added by @stefan-it: Trained with Wikipedia/OPUS

‘eu-v0-X’

Basque

Added by @stefan-it: LM embeddings (earlier version)

‘fa-X’

Persian

Added by @stefan-it: Trained with Wikipedia/OPUS

‘fi-X’

Finnish

Added by @stefan-it: Trained with Wikipedia/OPUS

‘fr-X’

French

Added by @mhham: Trained with French Wikipedia

‘he-X’

Hebrew

Added by @stefan-it: Trained with Wikipedia/OPUS

‘hi-X’

Hindi

Added by @stefan-it: Trained with Wikipedia/OPUS

‘hr-X’

Croatian

Added by @stefan-it: Trained with Wikipedia/OPUS

‘id-X’

Indonesian

Added by @stefan-it: Trained with Wikipedia/OPUS

‘it-X’

Italian

Added by @stefan-it: Trained with Wikipedia/OPUS

‘ja-X’

Japanese

Added by @frtacoa: Trained with 439M words of Japanese Web crawls (2048 hidden states, 2 layers)

‘nl-X’

Dutch

Added by @stefan-it: Trained with Wikipedia/OPUS

‘nl-v0-X’

Dutch

Added by @stefan-it: LM embeddings (earlier version)

‘no-X’

Norwegian

Added by @stefan-it: Trained with Wikipedia/OPUS

‘pl-X’

Polish

Added by @borchmann: Trained with web crawls (Polish part of CommonCrawl)

‘pl-opus-X’

Polish

Added by @stefan-it: Trained with Wikipedia/OPUS

‘pt-X’

Portuguese

Added by @ericlief: LM embeddings

‘sl-X’

Slovenian

Added by @stefan-it: Trained with Wikipedia/OPUS

‘sl-v0-X’

Slovenian

Added by @stefan-it: Trained with various sources (Europarl, Wikipedia and OpenSubtitles2018)

‘sv-X’

Swedish

Added by @stefan-it: Trained with Wikipedia/OPUS

‘sv-v0-X’

Swedish

Added by @stefan-it: Trained with various sources (Europarl, Wikipedia or OpenSubtitles2018)

‘ta-X’

Tamil

Added by @stefan-it

‘pubmed-X’

English

Added by @jessepeng: Trained with 5% of PubMed abstracts until 2015 (1150 hidden states, 3 layers)

‘de-impresso-hipe-v1-X’

German (historical)

In-domain data (Swiss and Luxembourgish newspapers) for CLEF HIPE Shared task. More information on the shared task can be found in this paper

‘en-impresso-hipe-v1-X’

English (historical)

In-domain data (Chronicling America material) for CLEF HIPE Shared task. More information on the shared task can be found in this paper

‘fr-impresso-hipe-v1-X’

French (historical)

In-domain data (Swiss and Luxembourgish newspapers) for CLEF HIPE Shared task. More information on the shared task can be found in this paper

‘am-X’

Amharic

Based on 6.5m Amharic text corpus crawled from different sources. See this paper and the official GitHub Repository for more information.

‘uk-X’

Ukrainian

Added by @dchaplinsky: Trained with UberText corpus.

So, if you want to load embeddings from the German forward LM model, instantiate the method as follows:

flair_de_forward = FlairEmbeddings('de-forward')

And if you want to load embeddings from the Bulgarian backward LM model, instantiate the method as follows:

flair_bg_backward = FlairEmbeddings('bg-backward')

Pooled Flair embeddings#

We also developed a pooled variant of the FlairEmbeddings. These embeddings differ in that they constantly evolve over time, even at prediction time (i.e. after training is complete). This means that the same words in the same sentence at two different points in time may have different embeddings.

PooledFlairEmbeddings manage a ‘global’ representation of each distinct word by using a pooling operation of all past occurences. More details on how this works may be found in Akbik et al. (2019).

You can instantiate and use PooledFlairEmbeddings like FlairEmbeddings:

from flair.embeddings import PooledFlairEmbeddings

# init embedding
flair_embedding_forward = PooledFlairEmbeddings('news-forward')

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
flair_embedding_forward.embed(sentence)

Note that while we get some of our best results with PooledFlairEmbeddings they are very ineffective memory-wise since they keep past embeddings of all words in memory. In many cases, regular FlairEmbeddings will be nearly as good but with much lower memory requirements.