flair.datasets.entity_linking#

class flair.datasets.entity_linking.EntityLinkingDictionary(candidates, dataset_name=None)View on GitHub#

Bases: object

Base class for downloading and reading of dictionaries for entity entity linking.

A dictionary represents all entities of a knowledge base and their associated ids.

property database_name: str#

Name of the database represented by the dictionary.

property text_to_index: Dict[str, List[str]]#
property candidates: List[EntityCandidate]#
to_in_memory_dictionary()View on GitHub#
Return type:

InMemoryEntityLinkingDictionary

class flair.datasets.entity_linking.InMemoryEntityLinkingDictionary(candidates, dataset_name)View on GitHub#

Bases: EntityLinkingDictionary

to_state()View on GitHub#
Return type:

Dict[str, Any]

classmethod from_state(state)View on GitHub#
Return type:

InMemoryEntityLinkingDictionary

class flair.datasets.entity_linking.HunerEntityLinkingDictionary(path, dataset_name)View on GitHub#

Bases: EntityLinkingDictionary

Base dictionary with data already in huner format.

Every line in the file must be formatted as follows:

concept_id||concept_name

If multiple concept ids are associated to a given name they have to be separated by a |, e.g.

7157||TP53|tumor protein p53

class flair.datasets.entity_linking.CTD_DISEASES_DICTIONARY(base_path=None)View on GitHub#

Bases: EntityLinkingDictionary

Dictionary for named entity linking on diseases using the Comparative Toxicogenomics Database (CTD).

Fur further information can be found at https://ctdbase.org/

download_dictionary(data_dir)View on GitHub#
Return type:

Path

parse_file(original_file)View on GitHub#
Return type:

Iterator[EntityCandidate]

class flair.datasets.entity_linking.CTD_CHEMICALS_DICTIONARY(base_path=None)View on GitHub#

Bases: EntityLinkingDictionary

Dictionary for named entity linking on chemicals using the Comparative Toxicogenomics Database (CTD).

Fur further information can be found at https://ctdbase.org/

download_dictionary(data_dir)View on GitHub#
Return type:

Path

parse_file(original_file)View on GitHub#
Return type:

Iterator[EntityCandidate]

class flair.datasets.entity_linking.NCBI_GENE_HUMAN_DICTIONARY(base_path=None)View on GitHub#

Bases: EntityLinkingDictionary

Dictionary for named entity linking on diseases using the NCBI Gene ontology.

Note that this dictionary only represents human genes - gene from different species aren’t included!

Fur further information can be found at https://www.ncbi.nlm.nih.gov/gene/

download_dictionary(data_dir)View on GitHub#
Return type:

Path

parse_dictionary(original_file)View on GitHub#
Return type:

Iterator[EntityCandidate]

class flair.datasets.entity_linking.NCBI_TAXONOMY_DICTIONARY(base_path=None)View on GitHub#

Bases: EntityLinkingDictionary

Dictionary for named entity linking on organisms / species using the NCBI taxonomy ontology.

Further information about the ontology can be found at https://www.ncbi.nlm.nih.gov/taxonomy

download_dictionary(data_dir)View on GitHub#
Return type:

Path

parse_dictionary(original_file)View on GitHub#
Return type:

Iterator[EntityCandidate]

class flair.datasets.entity_linking.ZELDA(base_path=None, in_memory=False, column_format={0: 'text', 2: 'nel'}, **corpusargs)View on GitHub#

Bases: MultiFileColumnCorpus

class flair.datasets.entity_linking.NEL_ENGLISH_AQUAINT(base_path=None, in_memory=True, agreement_threshold=0.5, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.entity_linking.NEL_GERMAN_HIPE(base_path=None, in_memory=True, wiki_language='dewiki', **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.entity_linking.NEL_ENGLISH_AIDA(base_path=None, in_memory=True, use_ids_and_check_existence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.entity_linking.NEL_ENGLISH_IITB(base_path=None, in_memory=True, ignore_disagreements=False, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.entity_linking.NEL_ENGLISH_TWEEKI(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.entity_linking.NEL_ENGLISH_REDDIT(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

flair.datasets.entity_linking.from_ufsac_to_tsv(xml_file, conll_file, datasetname, encoding='utf8', cut_multisense=True)View on GitHub#

Function that converts the UFSAC format into tab separated column format in a new file.

Parameters:
  • xml_file (Union[str, Path]) – Path to the xml file.

  • conll_file (Union[str, Path]) – Path for the new conll file.

  • datasetname (str) – Name of the dataset from UFSAC, needed because of different handling of multi-word-spans in the datasets

  • encoding (str, optional) – Encoding used in open function. The default is “utf8”.

  • cut_multisense (bool, optional) – Boolean that determines whether or not the wn30_key tag should be cut if it contains multiple possible senses. If True only the first listed sense will be used. Otherwise the whole list of senses will be detected as one new sense. The default is True.

flair.datasets.entity_linking.determine_tsv_file(filename, data_folder, cut_multisense=True)View on GitHub#

Checks if the converted .tsv file already exists and if not, creates it.

Parameters:
  • filename (str) – The name of the file.

  • data_folder (Path) – The name of the folder in which the CoNLL file should reside.

  • cut_multisense (bool) – Determines whether the wn30_key tag should be cut if it contains multiple possible senses. If True only the first listed sense will be used. Otherwise, the whole list of senses will be detected as one new sense. The default is True.

Return type:

str

Returns:

the name of the file.

class flair.datasets.entity_linking.WSD_UFSAC(filenames=['masc', 'semcor'], base_path=None, in_memory=True, cut_multisense=True, columns={0: 'text', 3: 'sense'}, banned_sentences=None, sample_missing_splits_in_multicorpus=True, sample_missing_splits_in_each_corpus=True, use_raganato_ALL_as_test_data=False, name='multicorpus')View on GitHub#

Bases: MultiCorpus

class flair.datasets.entity_linking.WSD_RAGANATO_ALL(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.entity_linking.WSD_SEMCOR(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.entity_linking.WSD_WORDNET_GLOSS_TAGGED(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.entity_linking.WSD_MASC(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.entity_linking.WSD_OMSTI(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.entity_linking.WSD_TRAINOMATIC(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.entity_linking.BigBioEntityLinkingCorpus(base_path=None, label_type='el', norm_keys=['db_name', 'db_id'], **kwargs)View on GitHub#

Bases: Corpus, ABC

This class implements an adapter to data sets implemented in the BigBio framework.

See: bigscience-workshop/biomedical

The BigBio framework harmonizes over 120 biomedical data sets and provides a uniform programming api to access them. This adapter allows to use all named entity recognition data sets by using the bigbio_kb schema.

class flair.datasets.entity_linking.BIGBIO_EL_NCBI_DISEASE(base_path=None, label_type='el-diseases', **kwargs)View on GitHub#

Bases: BigBioEntityLinkingCorpus

This class implents the adapter for the NCBI Disease corpus.

See: - Reference: https://www.sciencedirect.com/science/article/pii/S1532046413001974 - Link: https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/

class flair.datasets.entity_linking.BIGBIO_EL_BC5CDR_CHEMICAL(base_path=None, label_type='el-chemical', **kwargs)View on GitHub#

Bases: BigBioEntityLinkingCorpus

This class implents the adapter for the BC5CDR corpus (only chemical annotations).

See: - Reference: https://academic.oup.com/database/article/doi/10.1093/database/baw068/2630414 - Link: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/track-3-cdr/

class flair.datasets.entity_linking.BIGBIO_EL_GNORMPLUS(base_path=None, label_type='el-genes', **kwargs)View on GitHub#

Bases: BigBioEntityLinkingCorpus

This class implents the adapter for the GNormPlus corpus.

See: - Reference: https://www.hindawi.com/journals/bmri/2015/918710/ - Link: https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/gnormplus/