flair.datasets.text_image.FeideggerCorpus#
- class flair.datasets.text_image.FeideggerCorpus(**kwargs)View on GitHub#
Bases:
Corpus- __init__(**kwargs)View on GitHub#
Initializes a Corpus, potentially sampling missing dev/test splits from train.
You can define the train, dev and test split by passing the corresponding Dataset object to the constructor. At least one split should be defined. If the option sample_missing_splits is set to True, missing splits will be randomly sampled from the train split. In most cases, you will not use the constructor yourself. Rather, you will create a corpus using one of our helper methods that read common NLP filetypes. For instance, you can use
flair.datasets.sequence_labeling.ColumnCorpusto read CoNLL-formatted files directly into aCorpus.- Parameters:
train (Optional[Dataset[T_co]], optional) – Training data. Defaults to None.
dev (Optional[Dataset[T_co]], optional) – Development data. Defaults to None.
test (Optional[Dataset[T_co]], optional) – Testing data. Defaults to None.
name (str, optional) – Corpus name. Defaults to “corpus”.
sample_missing_splits (Union[bool, str], optional) – Policy for handling missing splits. True (default): sample dev(10%)/test(10%) from train. False: keep None. “only_dev”: sample only dev. “only_test”: sample only test.
random_seed (Optional[int], optional) – Seed for reproducible sampling. Defaults to None.
Methods
__init__(**kwargs)Initializes a Corpus, potentially sampling missing dev/test splits from train.
add_label_noise(label_type, labels[, ...])Adds artificial label noise to a specified split (in-place).
downsample([percentage, downsample_train, ...])Randomly downsample the corpus to the given percentage (by removing data points).
filter_empty_sentences()A method that filters all sentences consisting of 0 tokens.
filter_long_sentences(max_charlength)A method that filters all sentences for which the plain text is longer than a specified number of characters.
get_all_sentences()Returns all sentences (spanning all three splits) in the
Corpus.get_label_distribution()Counts occurrences of each label in the corpus and returns them as a dictionary object.
make_label_dictionary(label_type[, ...])Creates a Dictionary for a specific label type from the corpus.
make_tag_dictionary(tag_type)DEPRECATED: Creates tag dictionary ensuring 'O', '<START>', '<STOP>'.
make_vocab_dictionary([max_tokens, min_freq])Creates a
Dictionaryof all tokens contained in the corpus.obtain_statistics([label_type, pretty_print])Print statistics about the corpus, including the length of the sentences and the labels in the corpus.
Attributes
corpus_tokenizerReturns the custom tokenizer provided during corpus initialization for retokenization, if any.
devThe dev split as a
torch.utils.data.Datasetobject.testThe test split as a
torch.utils.data.Datasetobject.trainThe training split as a
torch.utils.data.Datasetobject.