flair.datasets.document_classification.GO_EMOTIONS#

class flair.datasets.document_classification.GO_EMOTIONS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

GoEmotions dataset containing 58k Reddit comments labeled with 27 emotion categories.

see google-research/google-research

__init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Initializes the GoEmotions corpus.

Parameters:
  • base_path (Union[str, Path]) – Provide this only if you want to store the corpus in a specific folder, otherwise use default.

  • tokenizer (Union[bool, Tokenizer]) – Specify which tokenizer to use, the default is SegtokTokenizer().

  • memory_mode (str) – Set to what degree to keep corpus in memory (‘full’, ‘partial’ or ‘disk’). Use ‘full’ if full corpus and all embeddings fits into memory for speedups during training. Otherwise use ‘partial’ and if even this is too much for your memory, use ‘disk’.

Methods

__init__([base_path, tokenizer, memory_mode])

Initializes the GoEmotions corpus.

add_label_noise(label_type, labels[, ...])

Adds artificial label noise to a specified split (in-place).

downsample([percentage, downsample_train, ...])

Randomly downsample the corpus to the given percentage (by removing data points).

filter_empty_sentences()

A method that filters all sentences consisting of 0 tokens.

filter_long_sentences(max_charlength)

A method that filters all sentences for which the plain text is longer than a specified number of characters.

get_all_sentences()

Returns all sentences (spanning all three splits) in the Corpus.

get_label_distribution()

Counts occurrences of each label in the corpus and returns them as a dictionary object.

make_label_dictionary(label_type[, ...])

Creates a Dictionary for a specific label type from the corpus.

make_tag_dictionary(tag_type)

DEPRECATED: Creates tag dictionary ensuring 'O', '<START>', '<STOP>'.

make_vocab_dictionary([max_tokens, min_freq])

Creates a Dictionary of all tokens contained in the corpus.

obtain_statistics([label_type, pretty_print])

Print statistics about the corpus, including the length of the sentences and the labels in the corpus.

Attributes

dev

The dev split as a torch.utils.data.Dataset object.

test

The test split as a torch.utils.data.Dataset object.

train

The training split as a torch.utils.data.Dataset object.