# Train a span classifier

Span Classification models are used to model problems such as entity linking, where you already have extracted some
relevant spans
within the {term}`Sentence` and want to predict some more fine-grained labels.

This tutorial section show you how to train models using the [Span Classifier](#flair.models.SpanClassifier) in Flair.

## Training an entity linker (NEL) model with transformers

For a state-of-the-art NER sytem you should fine-tune transformer embeddings, and use full document context
(see our [FLERT](https://arxiv.org/abs/2011.06993) paper for details).

Use the following script:

```python
from flair.datasets import ZELDA
from flair.embeddings import TransformerWordEmbeddings
from flair.models import SpanClassifier
from flair.models.entity_linker_model import CandidateGenerator
from flair.trainers import ModelTrainer
from flair.nn.decoder import PrototypicalDecoder


# 1. get the corpus
corpus = ZELDA()
print(corpus)

# 2. what label do we want to predict?
label_type = 'nel'

# 3. make the label dictionary from the corpus
label_dict = corpus.make_label_dictionary(label_type=label_type, add_unk=True)
print(label_dict)

# 4. initialize fine-tuneable transformer embeddings WITH document context
embeddings = TransformerWordEmbeddings(
    model="bert-base-uncased",
    layers="-1",
    subtoken_pooling="first",
    fine_tune=True,
    use_context=True,
)

# 5. initialize bare-bones sequence tagger (no CRF, no RNN, no reprojection)
tagger = SpanClassifier(
    embeddings=embeddings,
    label_dictionary=label_dict,
    label_type=label_type,
    decoder=PrototypicalDecoder(
        num_prototypes=len(label_dict),
        embeddings_size=embeddings.embedding_length * 2, # we use "first_last" encoding for spans
        distance_function="dot_product",
    ),
    candidates=CandidateGenerator("zelda"),
)

# 6. initialize trainer
trainer = ModelTrainer(tagger, corpus)

# 7. run fine-tuning
trainer.fine_tune(
    "resources/taggers/zelda-nel",
    learning_rate=5.0e-6,
    mini_batch_size=4,
    mini_batch_chunk_size=1,  # remove this parameter to speed up computation if you have a big GPU
)
```

As you can see, we use [`TransformerWordEmbeddings`](#flair.embeddings.token.TransformerWordEmbeddings) based on [bert-base-uncased](https://huggingface.co/bert-base-uncased) embeddings. We enable fine-tuning and set `use_context` to True.
We use [Prototypical Networks](https://arxiv.org/abs/1703.05175), to generalize bettwer in the few-shot classification setting.
Also, we set a `CandidateGenerator` in the [`SpanClassifier`](#flair.models.SpanClassifier).
This way we limit the classification to a small set of candidates that are chosen depending on the text of the respective span.

## Loading a ColumnCorpus

In cases you want to train over a custom named entity linking dataset, you can load them with the [`ColumnCorpus`](#flair.datasets.sequence_labeling.ColumnCorpus) object.
Most sequence labeling datasets in NLP use some sort of column format in which each line is a word and each column is
one level of linguistic annotation. See for instance this sentence:

```console
George B-George_Washington
Washington I-George_Washington
went O
to O
Washington B-Washington_D_C

Sam B-Sam_Houston
Houston I-Sam_Houston
stayed O
home O
```

The first column is the word itself, the second BIO-annotated tags used to specify the spans that will be classified. To read such a
dataset, define the column structure as a dictionary and instantiate a [`ColumnCorpus`](#flair.datasets.sequence_labeling.ColumnCorpus).

```python
from flair.data import Corpus
from flair.datasets import ColumnCorpus

# define columns
columns = {0: "text", 1: "nel"}

# this is the folder in which train, test and dev files reside
data_folder = '/path/to/data/folder'

# init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns)
```

## constructing a dataset in memory

If you have a pipeline where you need to construct your dataset from a different data source,
you can always construct a [Corpus](#flair.data.Corpus) with [FlairDatapointDataset](#flair.datasets.base.FlairDatapointDataset) by hand.
Let's assume you create a function `create_datapoint(datapoint) -> Sentence` that looks somewhat like this:
```python
from flair.data import Sentence

def create_sentence(datapoint) -> Sentence:
    tokens = ...  # calculate the tokens from your internal data structure (e.g. pandas dataframe or json dictionary)
    spans = ...  # create a list of tuples (start_token, end_token, label) from your data structure
    sentence = Sentence(tokens)
    for (start, end, label) in spans:
        sentence[start:end+1].add_label("nel", label)
```
Then you can use this function to create a full dataset:
```python
from flair.data import Corpus
from flair.datasets import FlairDatapointDataset

def construct_corpus(data):
    return Corpus(
        train=FlairDatapointDataset([create_sentence(datapoint for datapoint in data["train"])]),
        dev=FlairDatapointDataset([create_sentence(datapoint for datapoint in data["dev"])]),
        test=FlairDatapointDataset([create_sentence(datapoint for datapoint in data["test"])]),
    )
```
And use this to construct a corpus instead of loading a dataset.


## Combining NEL with Mention Detection

often, you don't just want to use a Named Entity Linking model alone, but combine it with a Mention Detection or Named Entity Recognition model.
For this, you can use a [Multitask Model](#flair.models.MultitaskModel) to combine a [SequenceTagger](#flair.models.SequenceTagger) and a [Span Classifier](#flair.models.SpanClassifier).

```python
from flair.datasets import NER_MULTI_WIKINER, ZELDA
from flair.embeddings import TransformerWordEmbeddings
from flair.models import SequenceTagger, SpanClassifier
from flair.models.entity_linker_model import CandidateGenerator
from flair.trainers import ModelTrainer
from flair.nn import PrototypicalDecoder
from flair.nn.multitask import make_multitask_model_and_corpus

# 1. get the corpus
ner_corpus = NER_MULTI_WIKINER()
nel_corpus = ZELDA(column_format={0: "text", 2: "nel"})  # need to set the label type to be the same as the ner one

# --- Embeddings that are shared by both models --- #
shared_embeddings = TransformerWordEmbeddings("distilbert-base-uncased", fine_tune=True)

ner_label_dict = ner_corpus.make_label_dictionary("ner", add_unk=False)

ner_model = SequenceTagger(
    embeddings=shared_embeddings,
    tag_dictionary=ner_label_dict,
    tag_type="ner",
    use_rnn=False,
    use_crf=False,
    reproject_embeddings=False,
)


nel_label_dict = nel_corpus.make_label_dictionary("nel", add_unk=True)

nel_model = SpanClassifier(
    embeddings=shared_embeddings,
    label_dictionary=nel_label_dict,
    label_type="nel",
    span_label_type="ner",
    decoder=PrototypicalDecoder(
        num_prototypes=len(nel_label_dict),
        embeddings_size=shared_embeddings.embedding_length * 2, # we use "first_last" encoding for spans
        distance_function="dot_product",
    ),
    candidates=CandidateGenerator("zelda"),
)


# -- Define mapping (which tagger should train on which model) -- #
multitask_model, multicorpus = make_multitask_model_and_corpus(
    [
        (ner_model, ner_corpus),
        (nel_model, nel_corpus),
    ]
)

# -- Create model trainer and train -- #
trainer = ModelTrainer(multitask_model, multicorpus)
trainer.fine_tune(f"resources/taggers/zelda_with_mention")
```

Here, the [make_multitask_model_and_corpus](#flair.nn.multitask.make_multitask_model_and_corpus) method creates a multitask model and a multicorpus where each sub-model is aligned for a sub-corpus.

### Multitask with aligned training data

If you have sentences with both annotations for ner and for nel, you might want to use a single corpus for both models.

This means, that you need to manually the `multitask_id` to the sentences:

```python
from flair.data import Sentence

def create_sentence(datapoint) -> Sentence:
    tokens = ...  # calculate the tokens from your internal data structure (e.g. pandas dataframe or json dictionary)
    spans = ...  # create a list of tuples (start_token, end_token, label) from your data structure
    sentence = Sentence(tokens)
    for (start, end, ner_label, nel_label) in spans:
        sentence[start:end+1].add_label("ner", ner_label)
        sentence[start:end+1].add_label("nel", nel_label)
    sentence.add_label("multitask_id", "Task_0")  # Task_0 for the NER model
    sentence.add_label("multitask_id", "Task_1")  # Task_1 for the NEL model
```

Then you can run the multitask training script with the exception that you create the [MultitaskModel](#flair.models.MultitaskModel) directly.

```python
...
multitask_model = MultitaskModel([ner_model, nel_model], use_all_tasks=True)
```

Here, setting `use_all_tasks=True` means that we will jointly train on both tasks at the same time. This will save a lot of training time,
as the shared embedding will be calculated once but used twice (once for each model).