Skip to main content

Tagging biomedical entities

In this tutorial, we show how to use our pre-trained HunFlair2 models to tag your text.

Tagging with pre-trained HunFlair2-models

Let's use the pre-trained HunFlair2 model for biomedical named entity recognition (NER). This model was trained over multiple biomedical NER data sets and can recognize 5 different entity types, i.e. cell lines, chemicals, disease, gene / proteins and species.

from flair.nn import Classifier

tagger = Classifier.load("biomed")

All you need to do is use the predict() method of the tagger on a sentence. This will add predicted tags to the tokens in the sentence. Lets use a sentence with four named entities:

from flair.data import Sentence

sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome")

# predict NER tags
tagger.predict(sentence)

# print the predicted tags
for entity in sentence.get_labels():
print(entity)

This should print:

Span[0:2]: "Behavioral abnormalities" → Disease (1.0)
Span[4:5]: "Fmr1" → Gene (1.0)
Span[6:7]: "Mouse" → Species (1.0)
Span[9:12]: "Fragile X Syndrome" → Disease (1.0)

The output indicates that there are two diseases mentioned in the text ("Behavioral Abnormalities" and "Fragile X Syndrome") as well as one gene ("fmr1") and one species ("Mouse"). For each entity the text span in the sentence mention it is given and Label with a value and a score (confidence in the prediction). You can also get additional information, such as the position offsets of each entity in the sentence in a structured way by calling the to_dict() method:

print(sentence.to_dict())

This should print:

{
'text': 'Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome',
'labels': [],
'entities': [
{'text': 'Behavioral abnormalities', 'start_pos': 0, 'end_pos': 24, 'labels': [{'value': 'Disease', 'confidence': 0.9999860525131226}]},
{'text': 'Fmr1', 'start_pos': 32, 'end_pos': 36, 'labels': [{'value': 'Gene', 'confidence': 0.9999895095825195}]},
{'text': 'Mouse', 'start_pos': 41, 'end_pos': 46, 'labels': [{'value': 'Species', 'confidence': 0.9999873638153076}]},
{'text': 'Fragile X Syndrome', 'start_pos': 56, 'end_pos': 74, 'labels': [{'value': 'Disease', 'confidence': 0.9999928871790568}]}
],
# further sentence information
}

Using a biomedical tokenizer

Tokenization, i.e. separating a text into tokens / words, is an important issue in natural language processing in general and biomedical text mining in particular. So far, we used a tokenizer for general domain text. This can be unfavourable if applied to biomedical texts.

HunFlair2 integrates SciSpaCy, a library specially designed to work with scientific text. To use the library we first have to install it and download one of its models:

pip install scispacy==0.5.1
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz

Then we can use the SciSpacyTokenizer, we just have to pass it as parameter to when instancing a sentence:

from flair.tokenization import SciSpacyTokenizer

tokenizer = SciSpacyTokenizer()

sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome",
use_tokenizer=tokenizer)

Working with longer Texts

Often, we are concerned with complete scientific abstracts or full-texts when performing biomedical text mining, e.g.

abstract = "Fragile X syndrome (FXS) is a developmental disorder caused by a mutation in the X-linked FMR1 gene, " \
"coding for the FMRP protein which is largely involved in synaptic function. FXS patients present several " \
"behavioral abnormalities, including hyperactivity, anxiety, sensory hyper-responsiveness, and cognitive " \
"deficits. Autistic symptoms, e.g., altered social interaction and communication, are also often observed: " \
"FXS is indeed the most common monogenic cause of autism."

To work with complete abstracts or full-text, we first have to split them into separate sentences. We can apply the SciSpacySentenceSplitter, an integration of the SciSpaCy library:

from flair.splitter import SciSpacySentenceSplitter

# initialize the sentence splitter
splitter = SciSpacySentenceSplitter()

# split text into a list of Sentence objects
sentences = splitter.split(abstract)

# you can apply the HunFlair tagger directly to this list
tagger.predict(sentences)

We can access the annotations of the single sentences by just iterating over the list:

for sentence in sentences:
print(sentence.to_tagged_string())

This should print:

Sentence[35]: "Fragile X syndrome (FXS) is a developmental disorder caused by a mutation in the X-linked FMR1 gene, coding for the FMRP protein which is largely involved in synaptic function." \
→ ["Fragile X syndrome"/Disease, "FXS"/Disease, "developmental disorder"/Disease, "X-linked"/Gene, "FMR1"/Gene, "FMRP"/Gene]
Sentence[23]: "FXS patients present several behavioral abnormalities, including hyperactivity, anxiety, sensory hyper-responsiveness, and cognitive deficits." \
→ ["FXS"/Disease, "patients"/Species, "behavioral abnormalities"/Disease, "hyperactivity"/Disease, "anxiety"/Disease, "sensory hyper-responsiveness"/Disease, "cognitive deficits"/Disease]
Sentence[27]: "Autistic symptoms, e.g., altered social interaction and communication, are also often observed: FXS is indeed the most common monogenic cause of autism." \
→ ["Autistic symptoms"/Disease, "altered social interaction and communication"/Disease, "FXS"/Disease, "autism"/Disease]