Flair embeddings#
Contextual string embeddings are powerful embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. Key differences are:
they are trained without any explicit notion of words and thus fundamentally model words as sequences of characters.
they are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use.
With Flair, you can use these embeddings simply by instantiating the appropriate embedding class, same as standard word embeddings:
from flair.embeddings import FlairEmbeddings
# init embedding
flair_embedding_forward = FlairEmbeddings('news-forward')
# create a sentence
sentence = Sentence('The grass is green .')
# embed words in sentence
flair_embedding_forward.embed(sentence)
You choose which embeddings you load by passing the appropriate string to the constructor of the FlairEmbeddings
class.
Currently, the following contextual string embeddings are provided (note: replace ‘X’ with either ‘forward’ or ‘backward’):
ID |
Language |
Embedding |
---|---|---|
‘multi-X’ |
300+ |
JW300 corpus, as proposed by Agić and Vulić (2019). The corpus is licensed under CC-BY-NC-SA |
‘multi-X-fast’ |
English, German, French, Italian, Dutch, Polish |
Mix of corpora (Web, Wikipedia, Subtitles, News), CPU-friendly |
‘news-X’ |
English |
Trained with 1 billion word corpus |
‘news-X-fast’ |
English |
Trained with 1 billion word corpus, CPU-friendly |
‘mix-X’ |
English |
Trained with mixed corpus (Web, Wikipedia, Subtitles) |
‘ar-X’ |
Arabic |
Added by @stefan-it: Trained with Wikipedia/OPUS |
‘bg-X’ |
Bulgarian |
Added by @stefan-it: Trained with Wikipedia/OPUS |
‘bg-X-fast’ |
Bulgarian |
Added by @stefan-it: Trained with various sources (Europarl, Wikipedia or SETimes) |
‘cs-X’ |
Czech |
Added by @stefan-it: Trained with Wikipedia/OPUS |
‘cs-v0-X’ |
Czech |
Added by @stefan-it: LM embeddings (earlier version) |
‘de-X’ |
German |
Trained with mixed corpus (Web, Wikipedia, Subtitles) |
‘de-historic-ha-X’ |
German (historical) |
Added by @stefan-it: Historical German trained over Hamburger Anzeiger |
‘de-historic-wz-X’ |
German (historical) |
Added by @stefan-it: Historical German trained over Wiener Zeitung |
‘de-historic-rw-X’ |
German (historical) |
Added by @redewiedergabe: Historical German trained over 100 million tokens |
‘es-X’ |
Spanish |
Added by @iamyihwa: Trained with Wikipedia |
‘es-X-fast’ |
Spanish |
Added by @iamyihwa: Trained with Wikipedia, CPU-friendly |
‘es-clinical-’ |
Spanish (clinical) |
Added by @matirojasg: Trained with Wikipedia |
‘eu-X’ |
Basque |
Added by @stefan-it: Trained with Wikipedia/OPUS |
‘eu-v0-X’ |
Basque |
Added by @stefan-it: LM embeddings (earlier version) |
‘fa-X’ |
Persian |
Added by @stefan-it: Trained with Wikipedia/OPUS |
‘fi-X’ |
Finnish |
Added by @stefan-it: Trained with Wikipedia/OPUS |
‘fr-X’ |
French |
Added by @mhham: Trained with French Wikipedia |
‘he-X’ |
Hebrew |
Added by @stefan-it: Trained with Wikipedia/OPUS |
‘hi-X’ |
Hindi |
Added by @stefan-it: Trained with Wikipedia/OPUS |
‘hr-X’ |
Croatian |
Added by @stefan-it: Trained with Wikipedia/OPUS |
‘id-X’ |
Indonesian |
Added by @stefan-it: Trained with Wikipedia/OPUS |
‘it-X’ |
Italian |
Added by @stefan-it: Trained with Wikipedia/OPUS |
‘ja-X’ |
Japanese |
Added by @frtacoa: Trained with 439M words of Japanese Web crawls (2048 hidden states, 2 layers) |
‘nl-X’ |
Dutch |
Added by @stefan-it: Trained with Wikipedia/OPUS |
‘nl-v0-X’ |
Dutch |
Added by @stefan-it: LM embeddings (earlier version) |
‘no-X’ |
Norwegian |
Added by @stefan-it: Trained with Wikipedia/OPUS |
‘pl-X’ |
Polish |
Added by @borchmann: Trained with web crawls (Polish part of CommonCrawl) |
‘pl-opus-X’ |
Polish |
Added by @stefan-it: Trained with Wikipedia/OPUS |
‘pt-X’ |
Portuguese |
Added by @ericlief: LM embeddings |
‘sl-X’ |
Slovenian |
Added by @stefan-it: Trained with Wikipedia/OPUS |
‘sl-v0-X’ |
Slovenian |
Added by @stefan-it: Trained with various sources (Europarl, Wikipedia and OpenSubtitles2018) |
‘sv-X’ |
Swedish |
Added by @stefan-it: Trained with Wikipedia/OPUS |
‘sv-v0-X’ |
Swedish |
Added by @stefan-it: Trained with various sources (Europarl, Wikipedia or OpenSubtitles2018) |
‘ta-X’ |
Tamil |
Added by @stefan-it |
‘pubmed-X’ |
English |
Added by @jessepeng: Trained with 5% of PubMed abstracts until 2015 (1150 hidden states, 3 layers) |
‘de-impresso-hipe-v1-X’ |
German (historical) |
In-domain data (Swiss and Luxembourgish newspapers) for CLEF HIPE Shared task. More information on the shared task can be found in this paper |
‘en-impresso-hipe-v1-X’ |
English (historical) |
In-domain data (Chronicling America material) for CLEF HIPE Shared task. More information on the shared task can be found in this paper |
‘fr-impresso-hipe-v1-X’ |
French (historical) |
In-domain data (Swiss and Luxembourgish newspapers) for CLEF HIPE Shared task. More information on the shared task can be found in this paper |
‘am-X’ |
Amharic |
Based on 6.5m Amharic text corpus crawled from different sources. See this paper and the official GitHub Repository for more information. |
‘uk-X’ |
Ukrainian |
Added by @dchaplinsky: Trained with UberText corpus. |
So, if you want to load embeddings from the German forward LM model, instantiate the method as follows:
flair_de_forward = FlairEmbeddings('de-forward')
And if you want to load embeddings from the Bulgarian backward LM model, instantiate the method as follows:
flair_bg_backward = FlairEmbeddings('bg-backward')
Recommended Flair usage#
We recommend combining both forward and backward Flair embeddings. Depending on the task, we also recommend adding standard WordEmbeddings
into the mix. So, our recommended StackedEmbeddings
for most English tasks is:
from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings
# create a StackedEmbedding object that combines glove and forward/backward flair embeddings
stacked_embeddings = StackedEmbeddings([
WordEmbeddings('glove'),
FlairEmbeddings('news-forward'),
FlairEmbeddings('news-backward'),
])
That’s it! Now just use this embedding like all the other embeddings, i.e. call the embed()
method over your sentences.
sentence = Sentence('The grass is green .')
# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)
# now check out the embedded tokens.
for token in sentence:
print(token)
print(token.embedding)
Words are now embedded using a concatenation of three different embeddings. This combination often gives state-of-the-art accuracy.
Pooled Flair embeddings#
We also developed a pooled variant of the FlairEmbeddings
. These embeddings differ in that they constantly evolve over time, even at prediction time (i.e. after training is complete). This means that the same words in the same sentence at two different points in time may have different embeddings.
PooledFlairEmbeddings
manage a ‘global’ representation of each distinct word by using a pooling operation of all past occurences. More details on how this works may be found in Akbik et al. (2019).
You can instantiate and use PooledFlairEmbeddings
like FlairEmbeddings
:
from flair.embeddings import PooledFlairEmbeddings
# init embedding
flair_embedding_forward = PooledFlairEmbeddings('news-forward')
# create a sentence
sentence = Sentence('The grass is green .')
# embed words in sentence
flair_embedding_forward.embed(sentence)
Note that while we get some of our best results with PooledFlairEmbeddings
they are very ineffective memory-wise since they keep past embeddings of all words in memory. In many cases, regular FlairEmbeddings
will be nearly as good but with much lower memory requirements.