flair.datasets.document_classification.AMAZON_REVIEWS#

class flair.datasets.document_classification.AMAZON_REVIEWS(split_max=30000, label_name_map={'1.0': 'NEGATIVE', '2.0': 'NEGATIVE', '3.0': 'NEGATIVE', '4.0': 'POSITIVE', '5.0': 'POSITIVE'}, skip_labels=['3.0', '4.0'], fraction_of_5_star_reviews=10, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub #

Bases: ClassificationCorpus

A very large corpus of Amazon reviews with positivity ratings.

Corpus is downloaded from and documented at https://nijianmo.github.io/amazon/index.html. We download the 5-core subset which is still tens of millions of reviews.

__init__(split_max=30000, label_name_map={'1.0': 'NEGATIVE', '2.0': 'NEGATIVE', '3.0': 'NEGATIVE', '4.0': 'POSITIVE', '5.0': 'POSITIVE'}, skip_labels=['3.0', '4.0'], fraction_of_5_star_reviews=10, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub #

Constructs corpus object.

Split_max indicates how many data points from each of the 28 splits are used, so set this higher or lower to increase/decrease corpus size. :type label_name_map: dict[str, str] :param label_name_map: Map label names to different schema. By default, the 5-star rating is mapped onto 3 classes (POSITIVE, NEGATIVE, NEUTRAL) :type split_max: int :param split_max: Split_max indicates how many data points from each of the 28 splits are used, so set this higher or lower to increase/decrease corpus size. :type memory_mode: :param memory_mode: Set to what degree to keep corpus in memory (‘full’, ‘partial’ or ‘disk’). Use ‘full’ if full corpus and all embeddings fits into memory for speedups during training. Otherwise use ‘partial’ and if even this is too much for your memory, use ‘disk’. :type tokenizer: Tokenizer :param tokenizer: Custom tokenizer to use (default is SegtokTokenizer) :type corpusargs: :param corpusargs: Arguments for ClassificationCorpus

Methods

`__init__`([split_max, label_name_map, ...])	Constructs corpus object.
`add_label_noise`(label_type, labels[, ...])	Adds artificial label noise to a specified split (in-place).
`download_and_prepare_amazon_product_file`(...)
`downsample`([percentage, downsample_train, ...])	Randomly downsample the corpus to the given percentage (by removing data points).
`filter_empty_sentences`()	A method that filters all sentences consisting of 0 tokens.
`filter_long_sentences`(max_charlength)	A method that filters all sentences for which the plain text is longer than a specified number of characters.
`get_all_sentences`()	Returns all sentences (spanning all three splits) in the `Corpus`.
`get_label_distribution`()	Counts occurrences of each label in the corpus and returns them as a dictionary object.
`make_label_dictionary`(label_type[, ...])	Creates a Dictionary for a specific label type from the corpus.
`make_tag_dictionary`(tag_type)	DEPRECATED: Creates tag dictionary ensuring 'O', '<START>', '<STOP>'.
`make_vocab_dictionary`([max_tokens, min_freq])	Creates a `Dictionary` of all tokens contained in the corpus.
`obtain_statistics`([label_type, pretty_print])	Print statistics about the corpus, including the length of the sentences and the labels in the corpus.

Attributes

`corpus_tokenizer`	Returns the custom tokenizer provided during corpus initialization for retokenization, if any.
`dev`	The dev split as a `torch.utils.data.Dataset` object.
`test`	The test split as a `torch.utils.data.Dataset` object.
`train`	The training split as a `torch.utils.data.Dataset` object.

download_and_prepare_amazon_product_file(data_folder, part_name, max_data_points=None, fraction_of_5_star_reviews=None)View on GitHub #

Table of Contents

flair.datasets.document_classification.AMAZON_REVIEWS#