Tagging biomedical entities
In this tutorial, we show how to use our pre-trained HunFlair2 models to tag your text.
Tagging with pre-trained HunFlair2-models
Let's use the pre-trained HunFlair2 model for biomedical named entity recognition (NER). This model was trained over multiple biomedical NER data sets and can recognize 5 different entity types, i.e. cell lines, chemicals, disease, gene / proteins and species.
from flair.nn import Classifier
tagger = Classifier.load("biomed")
All you need to do is use the predict() method of the tagger on a sentence. This will add predicted tags to the tokens in the sentence. Lets use a sentence with four named entities:
from flair.data import Sentence
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome")
# predict NER tags
tagger.predict(sentence)
# print the predicted tags
for entity in sentence.get_labels():
print(entity)
This should print:
Span[0:2]: "Behavioral abnormalities" → Disease (1.0)
Span[4:5]: "Fmr1" → Gene (1.0)
Span[6:7]: "Mouse" → Species (1.0)
Span[9:12]: "Fragile X Syndrome" → Disease (1.0)
The output indicates that there are two diseases mentioned in the text ("Behavioral Abnormalities" and
"Fragile X Syndrome") as well as one gene ("fmr1") and one species ("Mouse"). For each entity the
text span in the sentence mention it is given and Label with a value and a score (confidence in the
prediction). You can also get additional information, such as the position offsets of each entity
in the sentence in a structured way by calling the to_dict()
method:
print(sentence.to_dict())
This should print:
{
'text': 'Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome',
'labels': [],
'entities': [
{'text': 'Behavioral abnormalities', 'start_pos': 0, 'end_pos': 24, 'labels': [{'value': 'Disease', 'confidence': 0.9999860525131226}]},
{'text': 'Fmr1', 'start_pos': 32, 'end_pos': 36, 'labels': [{'value': 'Gene', 'confidence': 0.9999895095825195}]},
{'text': 'Mouse', 'start_pos': 41, 'end_pos': 46, 'labels': [{'value': 'Species', 'confidence': 0.9999873638153076}]},
{'text': 'Fragile X Syndrome', 'start_pos': 56, 'end_pos': 74, 'labels': [{'value': 'Disease', 'confidence': 0.9999928871790568}]}
],
# further sentence information
}
Using a biomedical tokenizer
Tokenization, i.e. separating a text into tokens / words, is an important issue in natural language processing in general and biomedical text mining in particular. So far, we used a tokenizer for general domain text. This can be unfavourable if applied to biomedical texts.
HunFlair2 integrates SciSpaCy, a library specially designed to work with scientific text. To use the library we first have to install it and download one of its models:
pip install scispacy==0.5.1
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz
Then we can use the SciSpacyTokenizer
, we just have to pass it as parameter to when instancing a sentence:
from flair.tokenization import SciSpacyTokenizer
tokenizer = SciSpacyTokenizer()
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome",
use_tokenizer=tokenizer)
Working with longer Texts
Often, we are concerned with complete scientific abstracts or full-texts when performing biomedical text mining, e.g.
abstract = "Fragile X syndrome (FXS) is a developmental disorder caused by a mutation in the X-linked FMR1 gene, " \
"coding for the FMRP protein which is largely involved in synaptic function. FXS patients present several " \
"behavioral abnormalities, including hyperactivity, anxiety, sensory hyper-responsiveness, and cognitive " \
"deficits. Autistic symptoms, e.g., altered social interaction and communication, are also often observed: " \
"FXS is indeed the most common monogenic cause of autism."
To work with complete abstracts or full-text, we first have to split them into separate sentences.
We can apply the SciSpacySentenceSplitter
, an integration of the SciSpaCy library:
from flair.splitter import SciSpacySentenceSplitter
# initialize the sentence splitter
splitter = SciSpacySentenceSplitter()
# split text into a list of Sentence objects
sentences = splitter.split(abstract)
# you can apply the HunFlair tagger directly to this list
tagger.predict(sentences)
We can access the annotations of the single sentences by just iterating over the list:
for sentence in sentences:
print(sentence.to_tagged_string())
This should print:
Sentence[35]: "Fragile X syndrome (FXS) is a developmental disorder caused by a mutation in the X-linked FMR1 gene, coding for the FMRP protein which is largely involved in synaptic function." \
→ ["Fragile X syndrome"/Disease, "FXS"/Disease, "developmental disorder"/Disease, "X-linked"/Gene, "FMR1"/Gene, "FMRP"/Gene]
Sentence[23]: "FXS patients present several behavioral abnormalities, including hyperactivity, anxiety, sensory hyper-responsiveness, and cognitive deficits." \
→ ["FXS"/Disease, "patients"/Species, "behavioral abnormalities"/Disease, "hyperactivity"/Disease, "anxiety"/Disease, "sensory hyper-responsiveness"/Disease, "cognitive deficits"/Disease]
Sentence[27]: "Autistic symptoms, e.g., altered social interaction and communication, are also often observed: FXS is indeed the most common monogenic cause of autism." \
→ ["Autistic symptoms"/Disease, "altered social interaction and communication"/Disease, "FXS"/Disease, "autism"/Disease]