flair.datasets.entity_linking.WSD_UFSAC#

class flair.datasets.entity_linking.WSD_UFSAC(filenames=['masc', 'semcor'], base_path=None, in_memory=True, cut_multisense=True, columns={0: 'text', 3: 'sense'}, banned_sentences=None, sample_missing_splits_in_multicorpus=True, sample_missing_splits_in_each_corpus=True, use_raganato_ALL_as_test_data=False, name='multicorpus')View on GitHub #

Bases: MultiCorpus

__init__(filenames=['masc', 'semcor'], base_path=None, in_memory=True, cut_multisense=True, columns={0: 'text', 3: 'sense'}, banned_sentences=None, sample_missing_splits_in_multicorpus=True, sample_missing_splits_in_each_corpus=True, use_raganato_ALL_as_test_data=False, name='multicorpus')View on GitHub #

Initialize a custom corpus with any Word Sense Disambiguation (WSD) datasets in the UFSAC format.

see getalp/UFSAC.

If the constructor is called for the first time the data is automatically downloaded and transformed from xml to a tab separated column format. Since only the WordNet 3.0 version for senses is consistently available for all provided datasets we will only consider this version. Also we ignore the id annotation used in datasets that were originally created for evaluation tasks

Parameters:

filenames (Union[str, list[str]]) – Here you can pass a single datasetname or a list of datasetnames. The available names are: ‘masc’, ‘omsti’, ‘raganato_ALL’, ‘raganato_semeval2007’, ‘raganato_semeval2013’, ‘raganato_semeval2015’, ‘raganato_senseval2’, ‘raganato_senseval3’, ‘semcor’, ‘semeval2007task17’, ‘semeval2007task7’, ‘semeval2013task12’, ‘semeval2015task13’, ‘senseval2’, ‘senseval2_lexical_sample_test’, ‘senseval2_lexical_sample_train’, ‘senseval3task1’, ‘senseval3task6_test’, ‘senseval3task6_train’, ‘trainomatic’, ‘wngt’,
base_path (Union[str, Path, None]) – You can override this to point to a specific folder but typically this should not be necessary.
in_memory (bool) – If True, keeps dataset in memory giving speedups in training.
cut_multisense (bool) – Boolean that determines whether the wn30_key tag should be cut if it contains multiple possible senses. If True only the first listed sense will be used and the suffix ‘_cut’ will be added to the name of the CoNLL file. Otherwise the whole list of senses will be detected as one new sense. The default is True.
columns – Columns to consider when loading the dataset. You can add 1: “lemma” or 2: “pos” to the default dict {0: “text”, 3: “sense”} if you want to use additional pos and/or lemma for the words.
banned_sentences (Optional[list[str]]) – Optionally remove sentences from the corpus. Works only if in_memory is true
sample_missing_splits_in_multicorpus (Union[bool, str]) – Whether to sample missing splits when loading the multicorpus (this is redundant if sample_missing_splits_in_each_corpus is True)
sample_missing_splits_in_each_corpus (Union[bool, str]) – Whether to sample missing splits when loading each single corpus given in filenames.
use_raganato_ALL_as_test_data (bool) – If True, the raganato_ALL dataset (Raganato et al. “Word Sense Disambiguation: A unified evaluation framework and empirical compariso”) will be used as test data. Note that the sample_missing_splits parameters are set to ‘only_dev’ in this case if set to True.
name (str) – Name of your corpus

Methods

`__init__`([filenames, base_path, in_memory, ...])	Initialize a custom corpus with any Word Sense Disambiguation (WSD) datasets in the UFSAC format.
`add_label_noise`(label_type, labels[, ...])	Adds artificial label noise to a specified split (in-place).
`downsample`([percentage, downsample_train, ...])	Randomly downsample the corpus to the given percentage (by removing data points).
`filter_empty_sentences`()	A method that filters all sentences consisting of 0 tokens.
`filter_long_sentences`(max_charlength)	A method that filters all sentences for which the plain text is longer than a specified number of characters.
`get_all_sentences`()	Returns all sentences (spanning all three splits) in the `Corpus`.
`get_label_distribution`()	Counts occurrences of each label in the corpus and returns them as a dictionary object.
`make_label_dictionary`(label_type[, ...])	Creates a Dictionary for a specific label type from the corpus.
`make_tag_dictionary`(tag_type)	DEPRECATED: Creates tag dictionary ensuring 'O', '<START>', '<STOP>'.
`make_vocab_dictionary`([max_tokens, min_freq])	Creates a `Dictionary` of all tokens contained in the corpus.
`obtain_statistics`([label_type, pretty_print])	Print statistics about the corpus, including the length of the sentences and the labels in the corpus.

Attributes

`corpus_tokenizer`	Returns the custom tokenizer provided during corpus initialization for retokenization, if any.
`dev`	The dev split as a `torch.utils.data.Dataset` object.
`test`	The test split as a `torch.utils.data.Dataset` object.
`train`	The training split as a `torch.utils.data.Dataset` object.

Table of Contents

flair.datasets.entity_linking.WSD_UFSAC#