flair.datasets.entity_linking#

class flair.datasets.entity_linking.EntityLinkingDictionary(candidates, dataset_name=None)View on GitHub#

Bases: object

Base class for downloading and reading of dictionaries for entity entity linking.

A dictionary represents all entities of a knowledge base and their associated ids.

__init__(candidates, dataset_name=None)View on GitHub#

Initialize the entity linking dictionary.

Parameters:
  • candidates (Iterable[EntityCandidate]) – A iterable sequence of all Candidates contained in the knowledge base.

  • dataset_name (Optional[str]) – string to prefix concept IDs. To be used for custom dictionaries.

property database_name: str#

Name of the database represented by the dictionary.

property text_to_index: dict[str, list[str]]#
property candidates: list[EntityCandidate]#
to_in_memory_dictionary()View on GitHub#
Return type:

InMemoryEntityLinkingDictionary

class flair.datasets.entity_linking.InMemoryEntityLinkingDictionary(candidates, dataset_name)View on GitHub#

Bases: EntityLinkingDictionary

to_state()View on GitHub#
Return type:

dict[str, Any]

classmethod from_state(state)View on GitHub#
Return type:

InMemoryEntityLinkingDictionary

class flair.datasets.entity_linking.HunerEntityLinkingDictionary(path, dataset_name)View on GitHub#

Bases: EntityLinkingDictionary

Base dictionary with data already in huner format.

Every line in the file must be formatted as follows:

concept_id||concept_name

If multiple concept ids are associated to a given name they have to be separated by a |, e.g.

7157||TP53|tumor protein p53

class flair.datasets.entity_linking.CTD_DISEASES_DICTIONARY(base_path=None)View on GitHub#

Bases: EntityLinkingDictionary

Dictionary for named entity linking on diseases using the Comparative Toxicogenomics Database (CTD).

Fur further information can be found at https://ctdbase.org/

download_dictionary(data_dir)View on GitHub#
Return type:

Path

parse_file(original_file)View on GitHub#
Return type:

Iterator[EntityCandidate]

class flair.datasets.entity_linking.CTD_CHEMICALS_DICTIONARY(base_path=None)View on GitHub#

Bases: EntityLinkingDictionary

Dictionary for named entity linking on chemicals using the Comparative Toxicogenomics Database (CTD).

Fur further information can be found at https://ctdbase.org/

download_dictionary(data_dir)View on GitHub#
Return type:

Path

parse_file(original_file)View on GitHub#
Return type:

Iterator[EntityCandidate]

class flair.datasets.entity_linking.NCBI_GENE_HUMAN_DICTIONARY(base_path=None)View on GitHub#

Bases: EntityLinkingDictionary

Dictionary for named entity linking on diseases using the NCBI Gene ontology.

Note that this dictionary only represents human genes - gene from different species aren’t included!

Fur further information can be found at https://www.ncbi.nlm.nih.gov/gene/

_is_invalid_name(name)View on GitHub#

Determine if a name should be skipped.

Return type:

bool

download_dictionary(data_dir)View on GitHub#
Return type:

Path

parse_dictionary(original_file)View on GitHub#
Return type:

Iterator[EntityCandidate]

class flair.datasets.entity_linking.NCBI_TAXONOMY_DICTIONARY(base_path=None)View on GitHub#

Bases: EntityLinkingDictionary

Dictionary for named entity linking on organisms / species using the NCBI taxonomy ontology.

Further information about the ontology can be found at https://www.ncbi.nlm.nih.gov/taxonomy

download_dictionary(data_dir)View on GitHub#
Return type:

Path

parse_dictionary(original_file)View on GitHub#
Return type:

Iterator[EntityCandidate]

class flair.datasets.entity_linking.ZELDA(base_path=None, in_memory=False, column_format={0: 'text', 2: 'nel'}, **corpusargs)View on GitHub#

Bases: MultiFileColumnCorpus

__init__(base_path=None, in_memory=False, column_format={0: 'text', 2: 'nel'}, **corpusargs)View on GitHub#

Initialize ZELDA Entity Linking corpus.

introduced in “ZELDA: A Comprehensive Benchmark for Supervised Entity Disambiguation” (Milich and Akbik, 2023). When calling the constructor for the first time, the dataset gets automatically downloaded.

Parameters:
  • base_path (Union[str, Path], optional) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • column_format (dict[int, str]) – The column-format to specify which columns correspond to the text or label types.

class flair.datasets.entity_linking.NEL_ENGLISH_AQUAINT(base_path=None, in_memory=True, agreement_threshold=0.5, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, agreement_threshold=0.5, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#

Initialize Aquaint Entity Linking corpus.

introduced in: D. Milne and I. H. Witten. Learning to link with wikipedia https://www.cms.waikato.ac.nz/~ihw/papers/08-DNM-IHW-LearningToLinkWithWikipedia.pdf . If you call the constructor the first time the dataset gets automatically downloaded and transformed in tab-separated column format (aquaint.txt).

Parameters:
  • base_path (Union[str, Path], optional) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • agreement_threshold (float) – Some link annotations come with an agreement_score representing the agreement from the human annotators. The score ranges from lowest 0.2 to highest 1.0. The lower the score, the less “important” is the entity because fewer annotators thought it was worth linking. Default is 0.5 which means the majority of annotators must have annoteted the respective entity mention.

  • sentence_splitter (SentenceSplitter) – The sentencesplitter that is used to split the articles into sentences.

class flair.datasets.entity_linking.NEL_GERMAN_HIPE(base_path=None, in_memory=True, wiki_language='dewiki', **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, wiki_language='dewiki', **corpusargs)View on GitHub#

Initialize a sentence-segmented version of the HIPE entity linking corpus for historical German.

see description of HIPE at https://impresso.github.io/CLEF-HIPE-2020/.

This version was segmented by @stefan-it and is hosted at stefan-it/clef-hipe. If you call the constructor the first time the dataset gets automatically downloaded and transformed in tab-separated column format.

Parameters:
  • base_path (Union[str, Path], optional) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • wiki_language (str) – specify the language of the names of the wikipedia pages, i.e. which language version of Wikipedia URLs to use. Since the text is in german the default language is German.

class flair.datasets.entity_linking.NEL_ENGLISH_AIDA(base_path=None, in_memory=True, use_ids_and_check_existence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, use_ids_and_check_existence=False, **corpusargs)View on GitHub#

Initialize AIDA CoNLL-YAGO Entity Linking corpus.

The corpus got introduced here https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/ambiverse-nlu/aida/downloads. License: https://creativecommons.org/licenses/by-sa/3.0/deed.en_US If you call the constructor the first time the dataset gets automatically downloaded.

Parameters:
  • base_path (Union[str, Path], optional) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • use_ids_and_check_existence (bool) – If True the existence of the given wikipedia ids/pagenames is checked and non existent ids/names will be ignored. This also means that one works with current wikipedia-arcticle names and possibly alter some of the out-dated ones in the original dataset

class flair.datasets.entity_linking.NEL_ENGLISH_IITB(base_path=None, in_memory=True, ignore_disagreements=False, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, ignore_disagreements=False, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#

Initialize ITTB Entity Linking corpus.

The corpus got introduced in “Collective Annotation of Wikipedia Entities in Web Text” Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and Soumen Chakrabarti.

If you call the constructor the first time the dataset gets automatically downloaded.

Parameters:
  • base_path (Union[str, Path], optional) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • ignore_disagreements (bool) – If True annotations with annotator disagreement will be ignored.

  • sentence_splitter (SentenceSplitter) – The sentencesplitter that is used to split the articles into sentences.

class flair.datasets.entity_linking.NEL_ENGLISH_TWEEKI(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize Tweeki Entity Linking corpus.

The dataset got introduced in “Tweeki: Linking Named Entities on Twitter to a Knowledge Graph” Harandizadeh, Singh. The data consits of tweets with manually annotated wikipedia links. If you call the constructor the first time the dataset gets automatically downloaded and transformed in tab-separated column format.

Parameters:
  • base_path (Union[str, Path], optional) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

class flair.datasets.entity_linking.NEL_ENGLISH_REDDIT(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the Reddit Entity Linking corpus containing gold annotations only.

see https://arxiv.org/abs/2101.01228v2

The first time you call this constructor it will automatically download the dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence – If True, all sentences of a document are read into a single Sentence object

_text_to_cols(sentence, links, outfile)View on GitHub#

Convert a tokenized sentence into column format.

Parameters:
  • sentence (Sentence) – Flair Sentence object containing a tokenized post title or comment thread

  • links (list) – array containing information about the starting and ending position of an entity mention, as well as its corresponding wiki tag

  • outfile – file, to which the output is written

_fill_annot_array(annot_array, key, post_flag)View on GitHub#

Fills the array containing information about the entity mention annotations.

Parameters:
  • annot_array (list) – array to be filled

  • key (str) – reddit id, on which the post title/comment thread is matched with its corresponding annotation

  • post_flag (bool) – flag indicating whether the annotations are collected for the post titles or comment threads

Return type:

list

_fill_curr_comment(fix_flag)View on GitHub#

Extends the string containing the current comment thread, which is passed to _text_to_cols method, when the comments are parsed.

Parameters:

fix_flag (bool) – flag indicating whether the method is called when the incorrectly imported rows are parsed or regular rows

flair.datasets.entity_linking.from_ufsac_to_tsv(xml_file, conll_file, datasetname, encoding='utf8', cut_multisense=True)View on GitHub#

Function that converts the UFSAC format into tab separated column format in a new file.

Parameters:
  • xml_file (Union[str, Path]) – Path to the xml file.

  • conll_file (Union[str, Path]) – Path for the new conll file.

  • datasetname (str) – Name of the dataset from UFSAC, needed because of different handling of multi-word-spans in the datasets

  • encoding (str, optional) – Encoding used in open function. The default is “utf8”.

  • cut_multisense (bool, optional) – Boolean that determines whether or not the wn30_key tag should be cut if it contains multiple possible senses. If True only the first listed sense will be used. Otherwise the whole list of senses will be detected as one new sense. The default is True.

flair.datasets.entity_linking.determine_tsv_file(filename, data_folder, cut_multisense=True)View on GitHub#

Checks if the converted .tsv file already exists and if not, creates it.

Parameters:
  • filename (str) – The name of the file.

  • data_folder (Path) – The name of the folder in which the CoNLL file should reside.

  • cut_multisense (bool) – Determines whether the wn30_key tag should be cut if it contains multiple possible senses. If True only the first listed sense will be used. Otherwise, the whole list of senses will be detected as one new sense. The default is True.

Return type:

str

Returns:

the name of the file.

class flair.datasets.entity_linking.WSD_UFSAC(filenames=['masc', 'semcor'], base_path=None, in_memory=True, cut_multisense=True, columns={0: 'text', 3: 'sense'}, banned_sentences=None, sample_missing_splits_in_multicorpus=True, sample_missing_splits_in_each_corpus=True, use_raganato_ALL_as_test_data=False, name='multicorpus')View on GitHub#

Bases: MultiCorpus

__init__(filenames=['masc', 'semcor'], base_path=None, in_memory=True, cut_multisense=True, columns={0: 'text', 3: 'sense'}, banned_sentences=None, sample_missing_splits_in_multicorpus=True, sample_missing_splits_in_each_corpus=True, use_raganato_ALL_as_test_data=False, name='multicorpus')View on GitHub#

Initialize a custom corpus with any Word Sense Disambiguation (WSD) datasets in the UFSAC format.

If the constructor is called for the first time the data is automatically downloaded and transformed from xml to a tab separated column format. Since only the WordNet 3.0 version for senses is consistently available for all provided datasets we will only consider this version. Also we ignore the id annotation used in datasets that were originally created for evaluation tasks

Parameters:
  • filenames (Union[str, list[str]]) – Here you can pass a single datasetname or a list of datasetnames. The available names are: ‘masc’, ‘omsti’, ‘raganato_ALL’, ‘raganato_semeval2007’, ‘raganato_semeval2013’, ‘raganato_semeval2015’, ‘raganato_senseval2’, ‘raganato_senseval3’, ‘semcor’, ‘semeval2007task17’, ‘semeval2007task7’, ‘semeval2013task12’, ‘semeval2015task13’, ‘senseval2’, ‘senseval2_lexical_sample_test’, ‘senseval2_lexical_sample_train’, ‘senseval3task1’, ‘senseval3task6_test’, ‘senseval3task6_train’, ‘trainomatic’, ‘wngt’,

  • base_path (Union[str, Path, None]) – You can override this to point to a specific folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • cut_multisense (bool) – Boolean that determines whether the wn30_key tag should be cut if it contains multiple possible senses. If True only the first listed sense will be used and the suffix ‘_cut’ will be added to the name of the CoNLL file. Otherwise the whole list of senses will be detected as one new sense. The default is True.

  • columns – Columns to consider when loading the dataset. You can add 1: “lemma” or 2: “pos” to the default dict {0: “text”, 3: “sense”} if you want to use additional pos and/or lemma for the words.

  • banned_sentences (Optional[list[str]]) – Optionally remove sentences from the corpus. Works only if in_memory is true

  • sample_missing_splits_in_multicorpus (Union[bool, str]) – Whether to sample missing splits when loading the multicorpus (this is redundant if sample_missing_splits_in_each_corpus is True)

  • sample_missing_splits_in_each_corpus (Union[bool, str]) – Whether to sample missing splits when loading each single corpus given in filenames.

  • use_raganato_ALL_as_test_data (bool) – If True, the raganato_ALL dataset (Raganato et al. “Word Sense Disambiguation: A unified evaluation framework and empirical compariso”) will be used as test data. Note that the sample_missing_splits parameters are set to ‘only_dev’ in this case if set to True.

  • name (str) – Name of your corpus

class flair.datasets.entity_linking.WSD_RAGANATO_ALL(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True)View on GitHub#

Initialize ragnato_ALL (concatenation of all SensEval and SemEval all-words tasks) provided in UFSAC.

see getalp/UFSAC When first initializing the corpus the whole UFSAC data is downloaded.

class flair.datasets.entity_linking.WSD_SEMCOR(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Initialize SemCor provided in UFSAC.

see getalp/UFSAC When first initializing the corpus the whole UFSAC data is downloaded.

class flair.datasets.entity_linking.WSD_WORDNET_GLOSS_TAGGED(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Initialize Princeton WordNet Gloss Corpus provided in UFSAC.

see getalp/UFSAC When first initializing the corpus the whole UFSAC data is downloaded.

class flair.datasets.entity_linking.WSD_MASC(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Initialize MASC (Manually Annotated Sub-Corpus) provided in UFSAC.

see getalp/UFSAC When first initializing the corpus the whole UFSAC data is downloaded.

class flair.datasets.entity_linking.WSD_OMSTI(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Initialize OMSTI (One Million Sense-Tagged Instances) provided in UFSAC.

see getalp/UFSAC When first initializing the corpus the whole UFSAC data is downloaded.

class flair.datasets.entity_linking.WSD_TRAINOMATIC(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Initialize Train-O-Matic provided in UFSAC.

see getalp/UFSAC When first initializing the corpus the whole UFSAC data is downloaded.

class flair.datasets.entity_linking.BigBioEntityLinkingCorpus(base_path=None, label_type='el', norm_keys=['db_name', 'db_id'], **kwargs)View on GitHub#

Bases: Corpus, ABC

This class implements an adapter to data sets implemented in the BigBio framework.

See: bigscience-workshop/biomedical

The BigBio framework harmonizes over 120 biomedical data sets and provides a uniform programming api to access them. This adapter allows to use all named entity recognition data sets by using the bigbio_kb schema.

class flair.datasets.entity_linking.BIGBIO_EL_NCBI_DISEASE(base_path=None, label_type='el-diseases', **kwargs)View on GitHub#

Bases: BigBioEntityLinkingCorpus

This class implents the adapter for the NCBI Disease corpus.

See: - Reference: https://www.sciencedirect.com/science/article/pii/S1532046413001974 - Link: https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/

class flair.datasets.entity_linking.BIGBIO_EL_BC5CDR_CHEMICAL(base_path=None, label_type='el-chemical', **kwargs)View on GitHub#

Bases: BigBioEntityLinkingCorpus

This class implents the adapter for the BC5CDR corpus (only chemical annotations).

See: - Reference: https://academic.oup.com/database/article/doi/10.1093/database/baw068/2630414 - Link: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/track-3-cdr/

class flair.datasets.entity_linking.BIGBIO_EL_GNORMPLUS(base_path=None, label_type='el-genes', **kwargs)View on GitHub#

Bases: BigBioEntityLinkingCorpus

This class implents the adapter for the GNormPlus corpus.

See: - Reference: https://www.hindawi.com/journals/bmri/2015/918710/ - Link: https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/gnormplus/