Skip to main content

Linking biomedical entities

Linking goes one step beyond simply identifying biomedical entity names in text. It resolves names to their unique identifies in a knowledge base.

Documents from different biomedical (sub-)fields may use different terms to refer to the exact same concept, e.g., “tumor protein p53”, “tumor suppressor p53”, “TRP53” are all valid names for the gene “TP53” (NCBI Gene:7157).

Linking with pre-trained HunFlair2 models

After adding named entity recognition tags to your sentence, you can link the entities to standard ontologies using distinct, type-specific linking models:

from flair.models import EntityMentionLinker
from flair.nn import Classifier
from flair.data import Sentence

sentence = Sentence(
"The mutation in the ABCD1 gene causes X-linked adrenoleukodystrophy, "
"a neurodegenerative disease, which is exacerbated by exposure to high "
"levels of mercury in mouse populations."
)

# Tag named entities in the text
ner_tagger = Classifier.load("hunflair2")
ner_tagger.predict(sentence)

# Load disease linker and perform disease linking
disease_linker = EntityMentionLinker.load("disease-linker")
disease_linker.predict(sentence)

# Load gene linker and perform gene linking
gene_linker = EntityMentionLinker.load("gene-linker")
gene_linker.predict(sentence)

# Load chemical linker and perform chemical linking
chemical_linker = EntityMentionLinker.load("chemical-linker")
chemical_linker.predict(sentence)

# Load species linker and perform species linking
species_linker = EntityMentionLinker.load("species-linker")
species_linker.predict(sentence)
the ontologies and knowledge bases used are pre-processed the first time the normalisation is executed,
which might takes a certain amount of time. All further calls are then based on this pre-processing and run
much faster.

After running the code we can inspect and output the linked entities via:

for tag in sentence.get_labels("link"):
print(tag)

This should print:

Span[4:5]: "ABCD1" → 215/name=ABCD1 (210.89810180664062)
Span[7:9]: "X-linked adrenoleukodystrophy" → MESH:D000326/name=Adrenoleukodystrophy (195.30780029296875)
Span[11:13]: "neurodegenerative disease" → MESH:D019636/name=Neurodegenerative Diseases (201.1804962158203)
Span[23:24]: "mercury" → MESH:D008628/name=Mercury (220.39199829101562)
Span[25:26]: "mouse" → 10090/name=Mus musculus (213.6201934814453)

For each entity, the output contains both the NER mention annotations and their ontology identifiers to which the mentions were mapped. Moreover, the official name of the entity in the ontology and the similarity score of the entity mention and the ontology concept is given. For instance, the official name for the disease "X-linked adrenoleukodystrophy" is adrenoleukodystrophy. The similarity scores are specific to entity type, ontology and linking model used and can therefore only be compared and related to those using the exact same setup.