flair.datasets.entity_linking#
- class flair.datasets.entity_linking.EntityLinkingDictionary(candidates, dataset_name=None)View on GitHub#
Bases:
object
Base class for downloading and reading of dictionaries for entity entity linking.
A dictionary represents all entities of a knowledge base and their associated ids.
- property database_name: str#
Name of the database represented by the dictionary.
- property text_to_index: Dict[str, List[str]]#
- property candidates: List[EntityCandidate]#
- to_in_memory_dictionary()View on GitHub#
- Return type:
- class flair.datasets.entity_linking.InMemoryEntityLinkingDictionary(candidates, dataset_name)View on GitHub#
Bases:
EntityLinkingDictionary
- to_state()View on GitHub#
- Return type:
Dict
[str
,Any
]
- classmethod from_state(state)View on GitHub#
- Return type:
- class flair.datasets.entity_linking.HunerEntityLinkingDictionary(path, dataset_name)View on GitHub#
Bases:
EntityLinkingDictionary
Base dictionary with data already in huner format.
Every line in the file must be formatted as follows:
concept_id||concept_name
If multiple concept ids are associated to a given name they have to be separated by a |, e.g.
7157||TP53|tumor protein p53
- class flair.datasets.entity_linking.CTD_DISEASES_DICTIONARY(base_path=None)View on GitHub#
Bases:
EntityLinkingDictionary
Dictionary for named entity linking on diseases using the Comparative Toxicogenomics Database (CTD).
Fur further information can be found at https://ctdbase.org/
- download_dictionary(data_dir)View on GitHub#
- Return type:
Path
- parse_file(original_file)View on GitHub#
- Return type:
Iterator
[EntityCandidate
]
- class flair.datasets.entity_linking.CTD_CHEMICALS_DICTIONARY(base_path=None)View on GitHub#
Bases:
EntityLinkingDictionary
Dictionary for named entity linking on chemicals using the Comparative Toxicogenomics Database (CTD).
Fur further information can be found at https://ctdbase.org/
- download_dictionary(data_dir)View on GitHub#
- Return type:
Path
- parse_file(original_file)View on GitHub#
- Return type:
Iterator
[EntityCandidate
]
- class flair.datasets.entity_linking.NCBI_GENE_HUMAN_DICTIONARY(base_path=None)View on GitHub#
Bases:
EntityLinkingDictionary
Dictionary for named entity linking on diseases using the NCBI Gene ontology.
Note that this dictionary only represents human genes - gene from different species aren’t included!
Fur further information can be found at https://www.ncbi.nlm.nih.gov/gene/
- download_dictionary(data_dir)View on GitHub#
- Return type:
Path
- parse_dictionary(original_file)View on GitHub#
- Return type:
Iterator
[EntityCandidate
]
- class flair.datasets.entity_linking.NCBI_TAXONOMY_DICTIONARY(base_path=None)View on GitHub#
Bases:
EntityLinkingDictionary
Dictionary for named entity linking on organisms / species using the NCBI taxonomy ontology.
Further information about the ontology can be found at https://www.ncbi.nlm.nih.gov/taxonomy
- download_dictionary(data_dir)View on GitHub#
- Return type:
Path
- parse_dictionary(original_file)View on GitHub#
- Return type:
Iterator
[EntityCandidate
]
- class flair.datasets.entity_linking.ZELDA(base_path=None, in_memory=False, column_format={0: 'text', 2: 'nel'}, **corpusargs)View on GitHub#
Bases:
MultiFileColumnCorpus
- class flair.datasets.entity_linking.NEL_ENGLISH_AQUAINT(base_path=None, in_memory=True, agreement_threshold=0.5, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.entity_linking.NEL_GERMAN_HIPE(base_path=None, in_memory=True, wiki_language='dewiki', **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.entity_linking.NEL_ENGLISH_AIDA(base_path=None, in_memory=True, use_ids_and_check_existence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.entity_linking.NEL_ENGLISH_IITB(base_path=None, in_memory=True, ignore_disagreements=False, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.entity_linking.NEL_ENGLISH_TWEEKI(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.entity_linking.NEL_ENGLISH_REDDIT(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- flair.datasets.entity_linking.from_ufsac_to_tsv(xml_file, conll_file, datasetname, encoding='utf8', cut_multisense=True)View on GitHub#
Function that converts the UFSAC format into tab separated column format in a new file.
- Parameters:
xml_file (Union[str, Path]) – Path to the xml file.
conll_file (Union[str, Path]) – Path for the new conll file.
datasetname (str) – Name of the dataset from UFSAC, needed because of different handling of multi-word-spans in the datasets
encoding (str, optional) – Encoding used in open function. The default is “utf8”.
cut_multisense (bool, optional) – Boolean that determines whether or not the wn30_key tag should be cut if it contains multiple possible senses. If True only the first listed sense will be used. Otherwise the whole list of senses will be detected as one new sense. The default is True.
- flair.datasets.entity_linking.determine_tsv_file(filename, data_folder, cut_multisense=True)View on GitHub#
Checks if the converted .tsv file already exists and if not, creates it.
- Parameters:
filename (
str
) – The name of the file.data_folder (
Path
) – The name of the folder in which the CoNLL file should reside.cut_multisense (
bool
) – Determines whether the wn30_key tag should be cut if it contains multiple possible senses. If True only the first listed sense will be used. Otherwise, the whole list of senses will be detected as one new sense. The default is True.
- Return type:
str
- Returns:
the name of the file.
- class flair.datasets.entity_linking.WSD_UFSAC(filenames=['masc', 'semcor'], base_path=None, in_memory=True, cut_multisense=True, columns={0: 'text', 3: 'sense'}, banned_sentences=None, sample_missing_splits_in_multicorpus=True, sample_missing_splits_in_each_corpus=True, use_raganato_ALL_as_test_data=False, name='multicorpus')View on GitHub#
Bases:
MultiCorpus
- class flair.datasets.entity_linking.WSD_RAGANATO_ALL(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.entity_linking.WSD_SEMCOR(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.entity_linking.WSD_WORDNET_GLOSS_TAGGED(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.entity_linking.WSD_MASC(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.entity_linking.WSD_OMSTI(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.entity_linking.WSD_TRAINOMATIC(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Bases:
ColumnCorpus
- class flair.datasets.entity_linking.BigBioEntityLinkingCorpus(base_path=None, label_type='el', norm_keys=['db_name', 'db_id'], **kwargs)View on GitHub#
Bases:
Corpus
,ABC
This class implements an adapter to data sets implemented in the BigBio framework.
See: bigscience-workshop/biomedical
The BigBio framework harmonizes over 120 biomedical data sets and provides a uniform programming api to access them. This adapter allows to use all named entity recognition data sets by using the bigbio_kb schema.
- class flair.datasets.entity_linking.BIGBIO_EL_NCBI_DISEASE(base_path=None, label_type='el-diseases', **kwargs)View on GitHub#
Bases:
BigBioEntityLinkingCorpus
This class implents the adapter for the NCBI Disease corpus.
See: - Reference: https://www.sciencedirect.com/science/article/pii/S1532046413001974 - Link: https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/
- class flair.datasets.entity_linking.BIGBIO_EL_BC5CDR_CHEMICAL(base_path=None, label_type='el-chemical', **kwargs)View on GitHub#
Bases:
BigBioEntityLinkingCorpus
This class implents the adapter for the BC5CDR corpus (only chemical annotations).
See: - Reference: https://academic.oup.com/database/article/doi/10.1093/database/baw068/2630414 - Link: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/track-3-cdr/
- class flair.datasets.entity_linking.BIGBIO_EL_GNORMPLUS(base_path=None, label_type='el-genes', **kwargs)View on GitHub#
Bases:
BigBioEntityLinkingCorpus
This class implents the adapter for the GNormPlus corpus.
See: - Reference: https://www.hindawi.com/journals/bmri/2015/918710/ - Link: https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/gnormplus/