flair.datasets.entity_linking#
- class flair.datasets.entity_linking.ZELDA(base_path=None, in_memory=False, column_format={0: 'text', 2: 'nel'}, **corpusargs)View on GitHub#
Bases:
MultiFileColumnCorpus
- __init__(base_path=None, in_memory=False, column_format={0: 'text', 2: 'nel'}, **corpusargs)View on GitHub#
Initialize ZELDA Entity Linking corpus.
introduced in “ZELDA: A Comprehensive Benchmark for Supervised Entity Disambiguation” (Milich and Akbik, 2023). When calling the constructor for the first time, the dataset gets automatically downloaded.
- Parameters:
base_path (Union[str, Path], optional) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.
in_memory (bool) – If True, keeps dataset in memory giving speedups in training.
column_format (Dict[int, str]) – The column-format to specify which columns correspond to the text or label types.
- class flair.datasets.entity_linking.NEL_ENGLISH_AQUAINT(base_path=None, in_memory=True, agreement_threshold=0.5, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- __init__(base_path=None, in_memory=True, agreement_threshold=0.5, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#
Initialize Aquaint Entity Linking corpus.
introduced in: D. Milne and I. H. Witten. Learning to link with wikipedia https://www.cms.waikato.ac.nz/~ihw/papers/08-DNM-IHW-LearningToLinkWithWikipedia.pdf . If you call the constructor the first time the dataset gets automatically downloaded and transformed in tab-separated column format (aquaint.txt).
- Parameters:
base_path (Union[str, Path], optional) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.
in_memory (bool) – If True, keeps dataset in memory giving speedups in training.
agreement_threshold (float) – Some link annotations come with an agreement_score representing the agreement from the human annotators. The score ranges from lowest 0.2 to highest 1.0. The lower the score, the less “important” is the entity because fewer annotators thought it was worth linking. Default is 0.5 which means the majority of annotators must have annoteted the respective entity mention.
sentence_splitter (SentenceSplitter) – The sentencesplitter that is used to split the articles into sentences.
- class flair.datasets.entity_linking.NEL_GERMAN_HIPE(base_path=None, in_memory=True, wiki_language='dewiki', **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- __init__(base_path=None, in_memory=True, wiki_language='dewiki', **corpusargs)View on GitHub#
Initialize a sentence-segmented version of the HIPE entity linking corpus for historical German.
see description of HIPE at https://impresso.github.io/CLEF-HIPE-2020/.
This version was segmented by @stefan-it and is hosted at stefan-it/clef-hipe. If you call the constructor the first time the dataset gets automatically downloaded and transformed in tab-separated column format.
- Parameters:
base_path (Union[str, Path], optional) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.
in_memory (bool) – If True, keeps dataset in memory giving speedups in training.
wiki_language (str) – specify the language of the names of the wikipedia pages, i.e. which language version of Wikipedia URLs to use. Since the text is in german the default language is German.
- class flair.datasets.entity_linking.NEL_ENGLISH_AIDA(base_path=None, in_memory=True, use_ids_and_check_existence=False, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- __init__(base_path=None, in_memory=True, use_ids_and_check_existence=False, **corpusargs)View on GitHub#
Initialize AIDA CoNLL-YAGO Entity Linking corpus.
The corpus got introduced here https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/ambiverse-nlu/aida/downloads. License: https://creativecommons.org/licenses/by-sa/3.0/deed.en_US If you call the constructor the first time the dataset gets automatically downloaded.
- Parameters:
base_path (Union[str, Path], optional) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.
in_memory (bool) – If True, keeps dataset in memory giving speedups in training.
use_ids_and_check_existence (bool) – If True the existence of the given wikipedia ids/pagenames is checked and non existent ids/names will be ignored. This also means that one works with current wikipedia-arcticle names and possibly alter some of the out-dated ones in the original dataset
- class flair.datasets.entity_linking.NEL_ENGLISH_IITB(base_path=None, in_memory=True, ignore_disagreements=False, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- __init__(base_path=None, in_memory=True, ignore_disagreements=False, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub#
Initialize ITTB Entity Linking corpus.
The corpus got introduced in “Collective Annotation of Wikipedia Entities in Web Text” Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and Soumen Chakrabarti.
If you call the constructor the first time the dataset gets automatically downloaded.
- Parameters:
base_path (Union[str, Path], optional) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.
in_memory (bool) – If True, keeps dataset in memory giving speedups in training.
ignore_disagreements (bool) – If True annotations with annotator disagreement will be ignored.
sentence_splitter (SentenceSplitter) – The sentencesplitter that is used to split the articles into sentences.
- class flair.datasets.entity_linking.NEL_ENGLISH_TWEEKI(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- __init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Initialize Tweeki Entity Linking corpus.
The dataset got introduced in “Tweeki: Linking Named Entities on Twitter to a Knowledge Graph” Harandizadeh, Singh. The data consits of tweets with manually annotated wikipedia links. If you call the constructor the first time the dataset gets automatically downloaded and transformed in tab-separated column format.
- Parameters:
base_path (Union[str, Path], optional) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.
in_memory (bool) – If True, keeps dataset in memory giving speedups in training.
- class flair.datasets.entity_linking.NEL_ENGLISH_REDDIT(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- __init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#
Initialize the Reddit Entity Linking corpus containing gold annotations only.
see https://arxiv.org/abs/2101.01228v2
The first time you call this constructor it will automatically download the dataset.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.in_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.document_as_sequence – If True, all sentences of a document are read into a single Sentence object
- _text_to_cols(sentence, links, outfile)View on GitHub#
Convert a tokenized sentence into column format.
- Parameters:
sentence (
Sentence
) – Flair Sentence object containing a tokenized post title or comment threadlinks (
list
) – array containing information about the starting and ending position of an entity mention, as well as its corresponding wiki tagoutfile – file, to which the output is written
- _fill_annot_array(annot_array, key, post_flag)View on GitHub#
Fills the array containing information about the entity mention annotations.
- Parameters:
annot_array (
list
) – array to be filledkey (
str
) – reddit id, on which the post title/comment thread is matched with its corresponding annotationpost_flag (
bool
) – flag indicating whether the annotations are collected for the post titles or comment threads
- Return type:
list
- _fill_curr_comment(fix_flag)View on GitHub#
Extends the string containing the current comment thread, which is passed to _text_to_cols method, when the comments are parsed.
- Parameters:
fix_flag (
bool
) – flag indicating whether the method is called when the incorrectly imported rows are parsed or regular rows
- flair.datasets.entity_linking.from_ufsac_to_tsv(xml_file, conll_file, datasetname, encoding='utf8', cut_multisense=True)View on GitHub#
Function that converts the UFSAC format into tab separated column format in a new file.
- Parameters:
xml_file (Union[str, Path]) – Path to the xml file.
conll_file (Union[str, Path]) – Path for the new conll file.
datasetname (str) – Name of the dataset from UFSAC, needed because of different handling of multi-word-spans in the datasets
encoding (str, optional) – Encoding used in open function. The default is “utf8”.
cut_multisense (bool, optional) – Boolean that determines whether or not the wn30_key tag should be cut if it contains multiple possible senses. If True only the first listed sense will be used. Otherwise the whole list of senses will be detected as one new sense. The default is True.
- flair.datasets.entity_linking.determine_tsv_file(filename, data_folder, cut_multisense=True)View on GitHub#
Checks if the converted .tsv file already exists and if not, creates it.
- Parameters:
filename (
str
) – The name of the file.data_folder (
Path
) – The name of the folder in which the CoNLL file should reside.cut_multisense (
bool
) – Determines whether the wn30_key tag should be cut if it contains multiple possible senses. If True only the first listed sense will be used. Otherwise, the whole list of senses will be detected as one new sense. The default is True.
- Return type:
str
- Returns:
the name of the file.
- class flair.datasets.entity_linking.WSD_UFSAC(filenames=['masc', 'semcor'], base_path=None, in_memory=True, cut_multisense=True, columns={0: 'text', 3: 'sense'}, banned_sentences=None, sample_missing_splits_in_multicorpus=True, sample_missing_splits_in_each_corpus=True, use_raganato_ALL_as_test_data=False, name='multicorpus')View on GitHub#
Bases:
MultiCorpus
- __init__(filenames=['masc', 'semcor'], base_path=None, in_memory=True, cut_multisense=True, columns={0: 'text', 3: 'sense'}, banned_sentences=None, sample_missing_splits_in_multicorpus=True, sample_missing_splits_in_each_corpus=True, use_raganato_ALL_as_test_data=False, name='multicorpus')View on GitHub#
Initialize a custom corpus with any Word Sense Disambiguation (WSD) datasets in the UFSAC format.
see getalp/UFSAC.
If the constructor is called for the first time the data is automatically downloaded and transformed from xml to a tab separated column format. Since only the WordNet 3.0 version for senses is consistently available for all provided datasets we will only consider this version. Also we ignore the id annotation used in datasets that were originally created for evaluation tasks
- Parameters:
filenames (
Union
[str
,List
[str
]]) – Here you can pass a single datasetname or a list of datasetnames. The available names are: ‘masc’, ‘omsti’, ‘raganato_ALL’, ‘raganato_semeval2007’, ‘raganato_semeval2013’, ‘raganato_semeval2015’, ‘raganato_senseval2’, ‘raganato_senseval3’, ‘semcor’, ‘semeval2007task17’, ‘semeval2007task7’, ‘semeval2013task12’, ‘semeval2015task13’, ‘senseval2’, ‘senseval2_lexical_sample_test’, ‘senseval2_lexical_sample_train’, ‘senseval3task1’, ‘senseval3task6_test’, ‘senseval3task6_train’, ‘trainomatic’, ‘wngt’,base_path (
Union
[str
,Path
,None
]) – You can override this to point to a specific folder but typically this should not be necessary.in_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.cut_multisense (
bool
) – Boolean that determines whether the wn30_key tag should be cut if it contains multiple possible senses. If True only the first listed sense will be used and the suffix ‘_cut’ will be added to the name of the CoNLL file. Otherwise the whole list of senses will be detected as one new sense. The default is True.columns – Columns to consider when loading the dataset. You can add 1: “lemma” or 2: “pos” to the default dict {0: “text”, 3: “sense”} if you want to use additional pos and/or lemma for the words.
banned_sentences (
Optional
[List
[str
]]) – Optionally remove sentences from the corpus. Works only if in_memory is truesample_missing_splits_in_multicorpus (
Union
[bool
,str
]) – Whether to sample missing splits when loading the multicorpus (this is redundant if sample_missing_splits_in_each_corpus is True)sample_missing_splits_in_each_corpus (
Union
[bool
,str
]) – Whether to sample missing splits when loading each single corpus given in filenames.use_raganato_ALL_as_test_data (
bool
) – If True, the raganato_ALL dataset (Raganato et al. “Word Sense Disambiguation: A unified evaluation framework and empirical compariso”) will be used as test data. Note that the sample_missing_splits parameters are set to ‘only_dev’ in this case if set to True.name (
str
) – Name of your corpus
- class flair.datasets.entity_linking.WSD_RAGANATO_ALL(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True)View on GitHub#
Bases:
ColumnCorpus
- __init__(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True)View on GitHub#
Initialize ragnato_ALL (concatenation of all SensEval and SemEval all-words tasks) provided in UFSAC.
see getalp/UFSAC When first initializing the corpus the whole UFSAC data is downloaded.
- class flair.datasets.entity_linking.WSD_SEMCOR(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Bases:
ColumnCorpus
- __init__(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Initialize SemCor provided in UFSAC.
see getalp/UFSAC When first initializing the corpus the whole UFSAC data is downloaded.
- class flair.datasets.entity_linking.WSD_WORDNET_GLOSS_TAGGED(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Bases:
ColumnCorpus
- __init__(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Initialize Princeton WordNet Gloss Corpus provided in UFSAC.
see getalp/UFSAC When first initializing the corpus the whole UFSAC data is downloaded.
- class flair.datasets.entity_linking.WSD_MASC(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Bases:
ColumnCorpus
- __init__(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Initialize MASC (Manually Annotated Sub-Corpus) provided in UFSAC.
see getalp/UFSAC When first initializing the corpus the whole UFSAC data is downloaded.
- class flair.datasets.entity_linking.WSD_OMSTI(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Bases:
ColumnCorpus
- __init__(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, cut_multisense=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Initialize OMSTI (One Million Sense-Tagged Instances) provided in UFSAC.
see getalp/UFSAC When first initializing the corpus the whole UFSAC data is downloaded.
- class flair.datasets.entity_linking.WSD_TRAINOMATIC(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Bases:
ColumnCorpus
- __init__(base_path=None, in_memory=True, columns={0: 'text', 3: 'sense'}, label_name_map=None, banned_sentences=None, sample_missing_splits=True, use_raganato_ALL_as_test_data=False)View on GitHub#
Initialize Train-O-Matic provided in UFSAC.
see getalp/UFSAC When first initializing the corpus the whole UFSAC data is downloaded.