flair.datasets.sequence_labeling.NER_GERMAN_MOBIE#

class flair.datasets.sequence_labeling.NER_GERMAN_MOBIE(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the German MobIE NER dataset.

The German MobIE Dataset was introduced in the MobIE paper (https://aclanthology.org/2021.konvens-1.22/).

This is a German-language dataset that has been human-annotated with 20 coarse- and fine-grained entity types, and it includes entity linking information for geographically linkable entities. The dataset comprises 3,232 social media texts and traffic reports, totaling 91K tokens, with 20.5K annotated entities, of which 13.1K are linked to a knowledge base. In total, 20 different named entities are annotated. :type base_path: Union[str, Path, None] :param base_path: Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training. Not recommended due to heavy RAM usage.

Methods

__init__([base_path, in_memory])

Initialize the German MobIE NER dataset.

add_label_noise(label_type, labels[, ...])

Generates uniform label noise distribution in the chosen dataset split.

downsample([percentage, downsample_train, ...])

Randomly downsample the corpus to the given percentage (by removing data points).

filter_empty_sentences()

A method that filters all sentences consisting of 0 tokens.

filter_long_sentences(max_charlength)

A method that filters all sentences for which the plain text is longer than a specified number of characters.

get_all_sentences()

Returns all sentences (spanning all three splits) in the Corpus.

get_label_distribution()

Counts occurrences of each label in the corpus and returns them as a dictionary object.

make_label_dictionary(label_type[, ...])

Creates a dictionary of all labels assigned to the sentences in the corpus.

make_tag_dictionary(tag_type)

Create a tag dictionary of a given label type.

make_vocab_dictionary([max_tokens, min_freq])

Creates a Dictionary of all tokens contained in the corpus.

obtain_statistics([label_type, pretty_print])

Print statistics about the corpus, including the length of the sentences and the labels in the corpus.

Attributes

dev

The dev split as a torch.utils.data.Dataset object.

test

The test split as a torch.utils.data.Dataset object.

train

The training split as a torch.utils.data.Dataset object.