flair.datasets.sequence_labeling.NER_NERMUD#

class flair.datasets.sequence_labeling.NER_NERMUD(domains='all', base_path=None, in_memory=False, **corpusargs)View on GitHub #

Bases: MultiCorpus

__init__(domains='all', base_path=None, in_memory=False, **corpusargs)View on GitHub #

Initilize the NERMuD 2023 dataset.

NERMuD is a task presented at EVALITA 2023 consisting in the extraction and classification of named-entities in a document, such as persons, organizations, and locations. NERMuD 2023 will include two different sub-tasks:

Domain-agnostic classification (DAC). Participants will be asked to select and classify entities among three categories (person, organization, location) in different types of texts (news, fiction, political speeches) using one single general model.
Domain-specific classification (DSC). Participants will be asked to deploy a different model for each of the above types, trying to increase the accuracy for each considered type.

Parameters:

domains (Union[str, list[str]]) – Domains to be used. Supported are “WN” (Wikinews), “FIC” (fiction), “ADG” (De Gasperi subset) and “all”.
base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.
in_memory (bool) – If True, keeps dataset in memory giving speedups in training. Not recommended due to heavy RAM usage.

Methods

`__init__`([domains, base_path, in_memory])	Initilize the NERMuD 2023 dataset.
`add_label_noise`(label_type, labels[, ...])	Adds artificial label noise to a specified split (in-place).
`downsample`([percentage, downsample_train, ...])	Randomly downsample the corpus to the given percentage (by removing data points).
`filter_empty_sentences`()	A method that filters all sentences consisting of 0 tokens.
`filter_long_sentences`(max_charlength)	A method that filters all sentences for which the plain text is longer than a specified number of characters.
`get_all_sentences`()	Returns all sentences (spanning all three splits) in the `Corpus`.
`get_label_distribution`()	Counts occurrences of each label in the corpus and returns them as a dictionary object.
`make_label_dictionary`(label_type[, ...])	Creates a Dictionary for a specific label type from the corpus.
`make_tag_dictionary`(tag_type)	DEPRECATED: Creates tag dictionary ensuring 'O', '<START>', '<STOP>'.
`make_vocab_dictionary`([max_tokens, min_freq])	Creates a `Dictionary` of all tokens contained in the corpus.
`obtain_statistics`([label_type, pretty_print])	Print statistics about the corpus, including the length of the sentences and the labels in the corpus.

Attributes

`corpus_tokenizer`	Returns the custom tokenizer provided during corpus initialization for retokenization, if any.
`dev`	The dev split as a `torch.utils.data.Dataset` object.
`test`	The test split as a `torch.utils.data.Dataset` object.
`train`	The training split as a `torch.utils.data.Dataset` object.

Table of Contents

flair.datasets.sequence_labeling.NER_NERMUD#