Basics
This tutorial explains the basic concepts used in Flair:
- what is a
Sentence
- what is a
Label
You should be familiar with these two concepts in order to get the most out of Flair.
What is a Sentence
If you want to tag a sentence, you need to first make a Sentence
object for it.
For example, say you want to tag the text "The grass is green.".
Let's start by making a Sentence
object for this sentence.
# The sentence objects holds a sentence that we may want to embed or tag
from flair.data import Sentence
# Make a sentence object by passing a string
sentence = Sentence('The grass is green.')
# Print the object to see what's in there
print(sentence)
This should print:
Sentence[5]: "The grass is green."
The print-out tells us that the sentence consists of 5 tokens.
A token is an atomic unit of the text, often a word or punctuation. The printout is therefore telling us that the sentence "The grass is green." consists of 5 such atomic units.
Iterating over the tokens in a Sentence
So what are the 5 tokens in this example sentence?
You can iterate over all tokens in a sentence like this:
for token in sentence:
print(token)
This should print:
Token[0]: "The"
Token[1]: "grass"
Token[2]: "is"
Token[3]: "green"
Token[4]: "."
This printout is telling us that the 5 tokens in the text are the words "The", "grass", "is", "green", with a separate token for the full stop at the end. The tokens therefore correspond to the words and the punctuation of the text.
Directly accessing a token
You can access the tokens of a sentence via their token id or with their index:
# using the token id
print(sentence.get_token(4))
# using the index itself
print(sentence[3])
which should print in both cases
Token[3]: "green"
This print-out includes the token index (3) and the lexical value of the token ("green").
Tokenization
When you create a Sentence
as above, the text is automatically tokenized (segmented into words) using the segtok library.
You can also use a different tokenizer if you like. To learn more about this, check out our tokenization tutorial.
What is a Label
All Flair models predict labels. For instance, our sentiment analysis models will predict labels for a sentence. Our NER models will predict labels for tokens in a sentence.
Example 1: Labeling a token in a sentence
To illustrate how labels work, let's use the same example sentence as above: "The grass is green.".
Let us label all "color words" in this sentence. Since the sentence contains only one color word (namely "green"), we only need to add a label to one of the tokens.
We access token 3 in the sentence, and set a label for it:
# Make a sentence object by passing a string
sentence = Sentence('The grass is green.')
# add an NER tag to token 3 in the sentence
sentence[3].add_label('ner', 'color')
# print the sentence (now with this annotation)
print(sentence)
This should print:
Sentence: "The grass is green ." → ["green"/color]
The output indicates that the word "green" in this sentence is labeled as a "color". You can also iterate through each token and print it to see if it has labels:
for token in sentence:
print(token)
This should print:
Token[0]: "The"
Token[1]: "grass"
Token[2]: "is"
Token[3]: "green" → color (1.0)
Token[4]: "."
This shows that there are 5 tokens in the sentence, one of which has a label.
The add_label
method used here has two mandatory parameters: A typename
and a value
.
- The
typename
is a string that indicates the type of annotation (in this case: ner). - The
value
is the specific value (in this case: color).
Example 2: Labeling a whole sentence
Sometimes you want to label an entire sentence instead of only a token. Do this by calling add_label
for the whole sentence.
For example, say we want to add a sentiment label to the sentence "The grass is green.":
sentence = Sentence('The grass is green.')
# add a label to a sentence
sentence.add_label('sentiment', 'POSITIVE')
print(sentence)
This should print:
Sentence[5]: "The grass is green." → POSITIVE (1.0)
Indicating that this sentence is now labeled as having a positive sentiment.
Multiple labels
Importantly, in Flair you can add as many labels to a sentence as you like.
Let's bring the two examples above together: We will label the sentence "The grass is green." with an overall positive sentiment, and also add a "color" tag to the token "grass":
sentence = Sentence('The grass is green.')
# add a sentiment label to the sentence
sentence.add_label('sentiment', 'POSITIVE')
# add an NER tag to token 3 in the sentence
sentence[3].add_label('ner', 'color')
# print the sentence with all annotations
print(sentence)
This will print:
Sentence[5]: "The grass is green." → POSITIVE (1.0) → ["green"/color]
Indicating that the sentence is now labeled with two different types of information.
Accessing labels
You can iterate through all labels of a sentence using the .get_labels()
method:
# iterate over all labels and print
for label in sentence.get_labels():
print(label)
This will get each label and print it. For instance, let's re-use the previous example in which we add two different labels to the same sentence:
sentence = Sentence('The grass is green.')
# add a sentiment label to the sentence
sentence.add_label('sentiment', 'POSITIVE')
# add an NER tag to token 3 in the sentence
sentence[3].add_label('ner', 'color')
# iterate over all labels and print
for label in sentence.get_labels():
print(label)
This will now print the following two lines:
Sentence[5]: "The grass is green." → POSITIVE (1.0)
Token[3]: "green" → color (1.0)
This printout tells us that there are two labels: The first is for the whole sentence, tagged as POSITIVE. The second is only for the token "green", tagged as "color".
If you only want to iterate over labels of a specific type, add the label name as parameter to get_labels(). For instance, to only iterate over all NER labels, do:
# iterate over all NER labels only
for label in sentence.get_labels('ner'):
print(label)
Information for each label
Each label is of class Label
which next to the value has a score indicating confidence. It also has a pointer back to the data point to which it attaches.
This means that you can print the value, the confidence and the labeled text of each label:
sentence = Sentence('The grass is green.')
# add an NER tag to token 3 in the sentence
sentence[3].add_label('ner', 'color')
# iterate over all labels and print
for label in sentence.get_labels():
# Print the text, the label value and the label score
print(f'"{label.data_point.text}" is classified as "{label.value}" with score {label.score}')
This should print:
"green" is classified as "color" with score 1.0
Our color tag has a score of 1.0 since we manually added it. If a tag is predicted by our sequence labeler, the score value will indicate classifier confidence.