# HunFlair2 - Tutorial 1: Tagging

This is part 1 of the tutorial, in which we show how to use our pre-trained *HunFlair2* models to tag your text.

## Tagging with Pre-trained HunFlair2-Models
Let's use the pre-trained *HunFlair2* model for biomedical named entity recognition (NER).
This model was trained over multiple biomedical NER data sets and can recognize 5 different entity types,
i.e. cell lines, chemicals, disease, gene / proteins and species.
```python
from flair.nn import Classifier

tagger = Classifier.load("hunflair2")
```
All you need to do is use the predict() method of the tagger on a sentence.
This will add predicted tags to the tokens in the sentence.
Lets use a sentence with four named entities:
```python
from flair.data import Sentence

sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome")

# predict NER tags
tagger.predict(sentence)

# print the predicted tags
for entity in sentence.get_labels():
    print(entity)
```
This should print:
```console
Span[0:2]: "Behavioral abnormalities" → Disease (1.0)
Span[4:5]: "Fmr1" → Gene (1.0)
Span[6:7]: "Mouse" → Species (1.0)
Span[9:12]: "Fragile X Syndrome" → Disease (1.0)
```
The output indicates that there are two diseases mentioned in the text ("_Behavioral Abnormalities_" and 
"_Fragile X Syndrome_") as well as one gene ("_fmr1_") and one species ("_Mouse_"). For each entity the
text span in the sentence mention it is given and Label with a value and a score (confidence in the 
prediction). You can also get additional information, such as the position offsets of each entity 
in the sentence in a structured way by calling the `to_dict()` method:

```python
print(sentence.to_dict())
```
This should print:
```python
{
    'text': 'Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome', 
    'labels': [], 
    'entities': [
        {'text': 'Behavioral abnormalities', 'start_pos': 0, 'end_pos': 24, 'labels': [{'value': 'Disease', 'confidence': 0.9999860525131226}]}, 
        {'text': 'Fmr1', 'start_pos': 32, 'end_pos': 36, 'labels': [{'value': 'Gene', 'confidence': 0.9999895095825195}]}, 
        {'text': 'Mouse', 'start_pos': 41, 'end_pos': 46, 'labels': [{'value': 'Species', 'confidence': 0.9999873638153076}]}, 
        {'text': 'Fragile X Syndrome', 'start_pos': 56, 'end_pos': 74, 'labels': [{'value': 'Disease', 'confidence': 0.9999928871790568}]}
      ],
    # further sentence information
}
```

## Using a Biomedical Tokenizer
Tokenization, i.e. separating a text into tokens / words, is an important issue in natural language processing
in general and biomedical text mining in particular. So far, we used a tokenizer for general domain text.
This can be unfavourable if applied to biomedical texts.

*HunFlair2* integrates [SciSpaCy](https://allenai.github.io/scispacy/), a library specially designed to work with scientific text.
To use the library we first have to install it and download one of its models:
~~~
pip install scispacy==0.5.1
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz
~~~

Then we can use the [`SciSpacyTokenizer`](#flair.tokenization.SciSpacyTokenizer), we just have to pass it as parameter to when instancing a sentence:
```python
from flair.tokenization import SciSpacyTokenizer

tokenizer = SciSpacyTokenizer()

sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome",
                    use_tokenizer=tokenizer)
```

## Working with longer Texts
Often, we are concerned with complete scientific abstracts or full-texts when performing biomedical text mining, e.g.
```python
abstract = "Fragile X syndrome (FXS) is a developmental disorder caused by a mutation in the X-linked FMR1 gene, " \
           "coding for the FMRP protein which is largely involved in synaptic function. FXS patients present several " \
           "behavioral abnormalities, including hyperactivity, anxiety, sensory hyper-responsiveness, and cognitive " \
           "deficits. Autistic symptoms, e.g., altered social interaction and communication, are also often observed: " \
           "FXS is indeed the most common monogenic cause of autism."
```

To work with complete abstracts or full-text, we first have to split them into separate sentences.
We can apply the [`SciSpacySentenceSplitter`](#flair.splitter.SciSpacySentenceSplitter), an integration of the [SciSpaCy](https://allenai.github.io/scispacy/) library:
```python
from flair.splitter import SciSpacySentenceSplitter

# initialize the sentence splitter
splitter = SciSpacySentenceSplitter()

# split text into a list of Sentence objects
sentences = splitter.split(abstract)

# you can apply the HunFlair tagger directly to this list
tagger.predict(sentences)
```
We can access the annotations of the single sentences by just iterating over the list:
```python
for sentence in sentences:
    print(sentence.to_tagged_string())
```
This should print:
~~~
Sentence[35]: "Fragile X syndrome (FXS) is a developmental disorder caused by a mutation in the X-linked FMR1 gene, coding for the FMRP protein which is largely involved in synaptic function." \
              → ["Fragile X syndrome"/Disease, "FXS"/Disease, "developmental disorder"/Disease, "X-linked"/Gene, "FMR1"/Gene, "FMRP"/Gene]
Sentence[23]: "FXS patients present several behavioral abnormalities, including hyperactivity, anxiety, sensory hyper-responsiveness, and cognitive deficits." \
              → ["FXS"/Disease, "patients"/Species, "behavioral abnormalities"/Disease, "hyperactivity"/Disease, "anxiety"/Disease, "sensory hyper-responsiveness"/Disease, "cognitive deficits"/Disease]
Sentence[27]: "Autistic symptoms, e.g., altered social interaction and communication, are also often observed: FXS is indeed the most common monogenic cause of autism." \
              → ["Autistic symptoms"/Disease, "altered social interaction and communication"/Disease, "FXS"/Disease, "autism"/Disease]
~~~