flair.datasets.sequence_labeling.NER_ICDAR_EUROPEANA#
- class flair.datasets.sequence_labeling.NER_ICDAR_EUROPEANA(language, base_path=None, in_memory=True, **corpusargs)View on GitHub#
Bases:
ColumnCorpus
- __init__(language, base_path=None, in_memory=True, **corpusargs)View on GitHub#
Initialize the ICDAR Europeana NER dataset.
The dataset is based on the French and Dutch Europeana NER corpora from the Europeana Newspapers NER dataset (https://lab.kb.nl/dataset/europeana-newspapers-ner), with additional preprocessing steps being performed (sentence splitting, punctuation normalizing, training/development/test splits). The resulting dataset is released in the “Data Centric Domain Adaptation for Historical Text with OCR Errors” ICDAR paper by Luisa März, Stefan Schweter, Nina Poerner, Benjamin Roth and Hinrich Schütze. :type language:
str
:param language: Language for a supported dataset. Supported languages are “fr” (French) and “nl” (Dutch). :type base_path:Union
[str
,Path
,None
] :param base_path: Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. :type in_memory:bool
:param in_memory: If True, keeps dataset in memory giving speedups in training. Not recommended due to heavy RAM usage.
Methods
__init__
(language[, base_path, in_memory])Initialize the ICDAR Europeana NER dataset.
add_label_noise
(label_type, labels[, ...])Generates uniform label noise distribution in the chosen dataset split.
downsample
([percentage, downsample_train, ...])Randomly downsample the corpus to the given percentage (by removing data points).
filter_empty_sentences
()A method that filters all sentences consisting of 0 tokens.
filter_long_sentences
(max_charlength)A method that filters all sentences for which the plain text is longer than a specified number of characters.
get_all_sentences
()Returns all sentences (spanning all three splits) in the
Corpus
.get_label_distribution
()Counts occurrences of each label in the corpus and returns them as a dictionary object.
make_label_dictionary
(label_type[, ...])Creates a dictionary of all labels assigned to the sentences in the corpus.
make_tag_dictionary
(tag_type)Create a tag dictionary of a given label type.
make_vocab_dictionary
([max_tokens, min_freq])Creates a
Dictionary
of all tokens contained in the corpus.obtain_statistics
([label_type, pretty_print])Print statistics about the corpus, including the length of the sentences and the labels in the corpus.
Attributes
dev
The dev split as a
torch.utils.data.Dataset
object.test
The test split as a
torch.utils.data.Dataset
object.train
The training split as a
torch.utils.data.Dataset
object.