flair.datasets.entity_linking.WSD_UFSAC#

class flair.datasets.entity_linking.WSD_UFSAC(filenames=['masc', 'semcor'], base_path=None, in_memory=True, cut_multisense=True, columns={0: 'text', 3: 'sense'}, banned_sentences=None, sample_missing_splits_in_multicorpus=True, sample_missing_splits_in_each_corpus=True, use_raganato_ALL_as_test_data=False, name='multicorpus')View on GitHub#

Bases: MultiCorpus

__init__(filenames=['masc', 'semcor'], base_path=None, in_memory=True, cut_multisense=True, columns={0: 'text', 3: 'sense'}, banned_sentences=None, sample_missing_splits_in_multicorpus=True, sample_missing_splits_in_each_corpus=True, use_raganato_ALL_as_test_data=False, name='multicorpus')View on GitHub#

Initialize a custom corpus with any Word Sense Disambiguation (WSD) datasets in the UFSAC format.

If the constructor is called for the first time the data is automatically downloaded and transformed from xml to a tab separated column format. Since only the WordNet 3.0 version for senses is consistently available for all provided datasets we will only consider this version. Also we ignore the id annotation used in datasets that were originally created for evaluation tasks

Parameters:
  • filenames (Union[str, list[str]]) – Here you can pass a single datasetname or a list of datasetnames. The available names are: ‘masc’, ‘omsti’, ‘raganato_ALL’, ‘raganato_semeval2007’, ‘raganato_semeval2013’, ‘raganato_semeval2015’, ‘raganato_senseval2’, ‘raganato_senseval3’, ‘semcor’, ‘semeval2007task17’, ‘semeval2007task7’, ‘semeval2013task12’, ‘semeval2015task13’, ‘senseval2’, ‘senseval2_lexical_sample_test’, ‘senseval2_lexical_sample_train’, ‘senseval3task1’, ‘senseval3task6_test’, ‘senseval3task6_train’, ‘trainomatic’, ‘wngt’,

  • base_path (Union[str, Path, None]) – You can override this to point to a specific folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • cut_multisense (bool) – Boolean that determines whether the wn30_key tag should be cut if it contains multiple possible senses. If True only the first listed sense will be used and the suffix ‘_cut’ will be added to the name of the CoNLL file. Otherwise the whole list of senses will be detected as one new sense. The default is True.

  • columns – Columns to consider when loading the dataset. You can add 1: “lemma” or 2: “pos” to the default dict {0: “text”, 3: “sense”} if you want to use additional pos and/or lemma for the words.

  • banned_sentences (Optional[list[str]]) – Optionally remove sentences from the corpus. Works only if in_memory is true

  • sample_missing_splits_in_multicorpus (Union[bool, str]) – Whether to sample missing splits when loading the multicorpus (this is redundant if sample_missing_splits_in_each_corpus is True)

  • sample_missing_splits_in_each_corpus (Union[bool, str]) – Whether to sample missing splits when loading each single corpus given in filenames.

  • use_raganato_ALL_as_test_data (bool) – If True, the raganato_ALL dataset (Raganato et al. “Word Sense Disambiguation: A unified evaluation framework and empirical compariso”) will be used as test data. Note that the sample_missing_splits parameters are set to ‘only_dev’ in this case if set to True.

  • name (str) – Name of your corpus

Methods

__init__([filenames, base_path, in_memory, ...])

Initialize a custom corpus with any Word Sense Disambiguation (WSD) datasets in the UFSAC format.

add_label_noise(label_type, labels[, ...])

Generates uniform label noise distribution in the chosen dataset split.

downsample([percentage, downsample_train, ...])

Randomly downsample the corpus to the given percentage (by removing data points).

filter_empty_sentences()

A method that filters all sentences consisting of 0 tokens.

filter_long_sentences(max_charlength)

A method that filters all sentences for which the plain text is longer than a specified number of characters.

get_all_sentences()

Returns all sentences (spanning all three splits) in the Corpus.

get_label_distribution()

Counts occurrences of each label in the corpus and returns them as a dictionary object.

make_label_dictionary(label_type[, ...])

Creates a dictionary of all labels assigned to the sentences in the corpus.

make_tag_dictionary(tag_type)

Create a tag dictionary of a given label type.

make_vocab_dictionary([max_tokens, min_freq])

Creates a Dictionary of all tokens contained in the corpus.

obtain_statistics([label_type, pretty_print])

Print statistics about the corpus, including the length of the sentences and the labels in the corpus.

Attributes

dev

The dev split as a torch.utils.data.Dataset object.

test

The test split as a torch.utils.data.Dataset object.

train

The training split as a torch.utils.data.Dataset object.