flair.datasets.document_classification.ClassificationCorpus#

class flair.datasets.document_classification.ClassificationCorpus(data_folder, label_type='class', train_file=None, test_file=None, dev_file=None, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, sample_missing_splits=True, encoding='utf-8')View on GitHub #

Bases: Corpus

A classification corpus from FastText-formatted text files.

__init__(data_folder, label_type='class', train_file=None, test_file=None, dev_file=None, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, sample_missing_splits=True, encoding='utf-8')View on GitHub #

Instantiates a Corpus from text classification-formatted task data.

Parameters:

data_folder (Union[str, Path]) – base folder with the task data
label_type (str) – name of the label
train_file – the name of the train file
test_file – the name of the test file
dev_file – the name of the dev file, if None, dev data is sampled from train
truncate_to_max_tokens (int) – If set, truncates each Sentence to a maximum number of tokens
truncate_to_max_chars (int) – If set, truncates each Sentence to a maximum number of chars
filter_if_longer_than (int) – If set, filters documents that are longer that the specified number of tokens.
tokenizer (Union[bool, Tokenizer]) – Tokenizer for dataset, default is SegtokTokenizer
memory_mode (str) – Set to what degree to keep corpus in memory (‘full’, ‘partial’ or ‘disk’). Use ‘full’ if full corpus and all embeddings fits into memory for speedups during training. Otherwise use ‘partial’ and if even this is too much for your memory, use ‘disk’.
label_name_map (Optional[dict[str, str]]) – Optionally map label names to different schema.
allow_examples_without_labels – set to True to allow Sentences without label in the corpus.
encoding (str) – Default is ‘utf-8’ but some datasets are in ‘latin-1

Methods

`__init__`(data_folder[, label_type, ...])	Instantiates a Corpus from text classification-formatted task data.
`add_label_noise`(label_type, labels[, ...])	Adds artificial label noise to a specified split (in-place).
`downsample`([percentage, downsample_train, ...])	Randomly downsample the corpus to the given percentage (by removing data points).
`filter_empty_sentences`()	A method that filters all sentences consisting of 0 tokens.
`filter_long_sentences`(max_charlength)	A method that filters all sentences for which the plain text is longer than a specified number of characters.
`get_all_sentences`()	Returns all sentences (spanning all three splits) in the `Corpus`.
`get_label_distribution`()	Counts occurrences of each label in the corpus and returns them as a dictionary object.
`make_label_dictionary`(label_type[, ...])	Creates a Dictionary for a specific label type from the corpus.
`make_tag_dictionary`(tag_type)	DEPRECATED: Creates tag dictionary ensuring 'O', '<START>', '<STOP>'.
`make_vocab_dictionary`([max_tokens, min_freq])	Creates a `Dictionary` of all tokens contained in the corpus.
`obtain_statistics`([label_type, pretty_print])	Print statistics about the corpus, including the length of the sentences and the labels in the corpus.

Attributes

`corpus_tokenizer`	Returns the custom tokenizer provided during corpus initialization for retokenization, if any.
`dev`	The dev split as a `torch.utils.data.Dataset` object.
`test`	The test split as a `torch.utils.data.Dataset` object.
`train`	The training split as a `torch.utils.data.Dataset` object.

Table of Contents

flair.datasets.document_classification.ClassificationCorpus#