flair.datasets.document_classification#

class flair.datasets.document_classification.ClassificationCorpus(data_folder, label_type='class', train_file=None, test_file=None, dev_file=None, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, sample_missing_splits=True, encoding='utf-8')View on GitHub#

Bases: Corpus

A classification corpus from FastText-formatted text files.

class flair.datasets.document_classification.ClassificationDataset(path_to_file, label_type, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, encoding='utf-8')View on GitHub#

Bases: FlairDataset

Dataset for classification instantiated from a single FastText-formatted file.

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.document_classification.CSVClassificationCorpus(data_folder, column_name_map, label_type, name='csv_corpus', train_file=None, test_file=None, dev_file=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, skip_header=False, encoding='utf-8', no_class_label=None, sample_missing_splits=True, **fmtparams)View on GitHub#

Bases: Corpus

Classification corpus instantiated from CSV data files.

class flair.datasets.document_classification.CSVClassificationDataset(path_to_file, column_name_map, label_type, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, skip_header=False, encoding='utf-8', no_class_label=None, **fmtparams)View on GitHub#

Bases: FlairDataset

Dataset for text classification from CSV column formatted data.

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.document_classification.AMAZON_REVIEWS(split_max=30000, label_name_map={'1.0': 'NEGATIVE', '2.0': 'NEGATIVE', '3.0': 'NEGATIVE', '4.0': 'POSITIVE', '5.0': 'POSITIVE'}, skip_labels=['3.0', '4.0'], fraction_of_5_star_reviews=10, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

A very large corpus of Amazon reviews with positivity ratings.

Corpus is downloaded from and documented at https://nijianmo.github.io/amazon/index.html. We download the 5-core subset which is still tens of millions of reviews.

download_and_prepare_amazon_product_file(data_folder, part_name, max_data_points=None, fraction_of_5_star_reviews=None)View on GitHub#
class flair.datasets.document_classification.IMDB(base_path=None, rebalance_corpus=True, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

Corpus of IMDB movie reviews labeled by sentiment (POSITIVE, NEGATIVE).

Downloaded from and documented at http://ai.stanford.edu/~amaas/data/sentiment/.

class flair.datasets.document_classification.NEWSGROUPS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

20 newsgroups corpus, classifying news items into one of 20 categories.

Downloaded from http://qwone.com/~jason/20Newsgroups

Each data point is a full news article so documents may be very long.

class flair.datasets.document_classification.AGNEWS(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The AG’s News Topic Classification Corpus, classifying news into 4 coarse-grained topics.

Labels: World, Sports, Business, Sci/Tech.

class flair.datasets.document_classification.STACKOVERFLOW(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

Stackoverflow corpus classifying questions into one of 20 labels.

The data will be downloaded from “jacoxu/StackOverflow”,

Each data point is a question.

class flair.datasets.document_classification.SENTIMENT_140(label_name_map=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

Twitter sentiment corpus.

See http://help.sentiment140.com/for-students

Two sentiments in train data (POSITIVE, NEGATIVE) and three sentiments in test data (POSITIVE, NEGATIVE, NEUTRAL).

class flair.datasets.document_classification.SENTEVAL_CR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The customer reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.

see facebookresearch/SentEval

class flair.datasets.document_classification.SENTEVAL_MR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The movie reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.

see facebookresearch/SentEval

class flair.datasets.document_classification.SENTEVAL_SUBJ(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The subjectivity dataset of SentEval, classified into SUBJECTIVE or OBJECTIVE sentiment.

see facebookresearch/SentEval

class flair.datasets.document_classification.SENTEVAL_MPQA(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The opinion-polarity dataset of SentEval, classified into NEGATIVE or POSITIVE polarity.

see facebookresearch/SentEval

class flair.datasets.document_classification.SENTEVAL_SST_BINARY(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The Stanford sentiment treebank dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.

see facebookresearch/SentEval

class flair.datasets.document_classification.SENTEVAL_SST_GRANULAR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The Stanford sentiment treebank dataset of SentEval, classified into 5 sentiment classes.

see facebookresearch/SentEval

class flair.datasets.document_classification.GLUE_COLA(label_type='acceptability', base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

Corpus of Linguistic Acceptability from GLUE benchmark.

see https://gluebenchmark.com/tasks

The task is to predict whether an English sentence is grammatically correct. Additionaly to the Corpus we have eval_dataset containing the unlabeled test data for Glue evaluation.

tsv_from_eval_dataset(folder_path)View on GitHub#

Create eval prediction file.

This function creates a tsv file with predictions of the eval_dataset (after calling classifier.predict(corpus.eval_dataset, label_name=’acceptability’)). The resulting file is called CoLA.tsv and is in the format required for submission to the Glue Benchmark.

class flair.datasets.document_classification.GLUE_SST2(label_type='sentiment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, encoding='utf-8', sample_missing_splits=True, **datasetargs)View on GitHub#

Bases: CSVClassificationCorpus

label_map = {0: 'negative', 1: 'positive'}#
tsv_from_eval_dataset(folder_path)View on GitHub#

Create eval prediction file.

class flair.datasets.document_classification.GO_EMOTIONS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

GoEmotions dataset containing 58k Reddit comments labeled with 27 emotion categories.

see google-research/google-research

class flair.datasets.document_classification.TREC_50(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The TREC Question Classification Corpus, classifying questions into 50 fine-grained answer types.

class flair.datasets.document_classification.TREC_6(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The TREC Question Classification Corpus, classifying questions into 6 coarse-grained answer types.

class flair.datasets.document_classification.YAHOO_ANSWERS(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The YAHOO Question Classification Corpus, classifying questions into 10 coarse-grained answer types.

class flair.datasets.document_classification.GERMEVAL_2018_OFFENSIVE_LANGUAGE(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='full', fine_grained_classes=False, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

GermEval 2018 corpus for identification of offensive language.

Classifying German tweets into 2 coarse-grained categories OFFENSIVE and OTHER or 4 fine-grained categories ABUSE, INSULT, PROFATINTY and OTHER.

class flair.datasets.document_classification.COMMUNICATIVE_FUNCTIONS(base_path=None, memory_mode='full', tokenizer=<flair.tokenization.SpaceTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The Communicative Functions Classification Corpus.

Classifying sentences from scientific papers into 39 communicative functions.

class flair.datasets.document_classification.WASSA_ANGER(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

WASSA-2017 anger emotion-intensity corpus.

see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.

class flair.datasets.document_classification.WASSA_FEAR(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

WASSA-2017 fear emotion-intensity corpus.

see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.

class flair.datasets.document_classification.WASSA_JOY(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

WASSA-2017 joy emotion-intensity dataset corpus.

see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html

class flair.datasets.document_classification.WASSA_SADNESS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

WASSA-2017 sadness emotion-intensity corpus.

see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.