flair.datasets.entity_linking.NEL_ENGLISH_AQUAINT#

class flair.datasets.entity_linking.NEL_ENGLISH_AQUAINT(base_path=None, in_memory=True, agreement_threshold=0.5, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub #

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, agreement_threshold=0.5, sentence_splitter=<flair.splitter.SegtokSentenceSplitter object>, **corpusargs)View on GitHub #

Initialize Aquaint Entity Linking corpus.

introduced in: D. Milne and I. H. Witten. Learning to link with wikipedia https://www.cms.waikato.ac.nz/~ihw/papers/08-DNM-IHW-LearningToLinkWithWikipedia.pdf . If you call the constructor the first time the dataset gets automatically downloaded and transformed in tab-separated column format (aquaint.txt).

Parameters:

base_path (Union[str, Path], optional) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.
in_memory (bool) – If True, keeps dataset in memory giving speedups in training.
agreement_threshold (float) – Some link annotations come with an agreement_score representing the agreement from the human annotators. The score ranges from lowest 0.2 to highest 1.0. The lower the score, the less “important” is the entity because fewer annotators thought it was worth linking. Default is 0.5 which means the majority of annotators must have annoteted the respective entity mention.
sentence_splitter (SentenceSplitter) – The sentencesplitter that is used to split the articles into sentences.

Methods

`__init__`([base_path, in_memory, ...])	Initialize Aquaint Entity Linking corpus.
`add_label_noise`(label_type, labels[, ...])	Adds artificial label noise to a specified split (in-place).
`downsample`([percentage, downsample_train, ...])	Randomly downsample the corpus to the given percentage (by removing data points).
`filter_empty_sentences`()	A method that filters all sentences consisting of 0 tokens.
`filter_long_sentences`(max_charlength)	A method that filters all sentences for which the plain text is longer than a specified number of characters.
`get_all_sentences`()	Returns all sentences (spanning all three splits) in the `Corpus`.
`get_label_distribution`()	Counts occurrences of each label in the corpus and returns them as a dictionary object.
`make_label_dictionary`(label_type[, ...])	Creates a Dictionary for a specific label type from the corpus.
`make_tag_dictionary`(tag_type)	DEPRECATED: Creates tag dictionary ensuring 'O', '<START>', '<STOP>'.
`make_vocab_dictionary`([max_tokens, min_freq])	Creates a `Dictionary` of all tokens contained in the corpus.
`obtain_statistics`([label_type, pretty_print])	Print statistics about the corpus, including the length of the sentences and the labels in the corpus.

Attributes

`corpus_tokenizer`	Returns the custom tokenizer provided during corpus initialization for retokenization, if any.
`dev`	The dev split as a `torch.utils.data.Dataset` object.
`test`	The test split as a `torch.utils.data.Dataset` object.
`train`	The training split as a `torch.utils.data.Dataset` object.

Table of Contents

flair.datasets.entity_linking.NEL_ENGLISH_AQUAINT#