flair.datasets.document_classification#
- class flair.datasets.document_classification.ClassificationCorpus(data_folder, label_type='class', train_file=None, test_file=None, dev_file=None, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, sample_missing_splits=True, encoding='utf-8')View on GitHub#
Bases:
Corpus
A classification corpus from FastText-formatted text files.
- class flair.datasets.document_classification.ClassificationDataset(path_to_file, label_type, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, encoding='utf-8')View on GitHub#
Bases:
FlairDataset
Dataset for classification instantiated from a single FastText-formatted file.
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.document_classification.CSVClassificationCorpus(data_folder, column_name_map, label_type, name='csv_corpus', train_file=None, test_file=None, dev_file=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, skip_header=False, encoding='utf-8', no_class_label=None, sample_missing_splits=True, **fmtparams)View on GitHub#
Bases:
Corpus
Classification corpus instantiated from CSV data files.
- class flair.datasets.document_classification.CSVClassificationDataset(path_to_file, column_name_map, label_type, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, skip_header=False, encoding='utf-8', no_class_label=None, **fmtparams)View on GitHub#
Bases:
FlairDataset
Dataset for text classification from CSV column formatted data.
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.document_classification.AMAZON_REVIEWS(split_max=30000, label_name_map={'1.0': 'NEGATIVE', '2.0': 'NEGATIVE', '3.0': 'NEGATIVE', '4.0': 'POSITIVE', '5.0': 'POSITIVE'}, skip_labels=['3.0', '4.0'], fraction_of_5_star_reviews=10, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
A very large corpus of Amazon reviews with positivity ratings.
Corpus is downloaded from and documented at https://nijianmo.github.io/amazon/index.html. We download the 5-core subset which is still tens of millions of reviews.
- download_and_prepare_amazon_product_file(data_folder, part_name, max_data_points=None, fraction_of_5_star_reviews=None)View on GitHub#
- class flair.datasets.document_classification.IMDB(base_path=None, rebalance_corpus=True, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
Corpus of IMDB movie reviews labeled by sentiment (POSITIVE, NEGATIVE).
Downloaded from and documented at http://ai.stanford.edu/~amaas/data/sentiment/.
- class flair.datasets.document_classification.NEWSGROUPS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
20 newsgroups corpus, classifying news items into one of 20 categories.
Downloaded from http://qwone.com/~jason/20Newsgroups
Each data point is a full news article so documents may be very long.
- class flair.datasets.document_classification.STACKOVERFLOW(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
Stackoverflow corpus classifying questions into one of 20 labels.
The data will be downloaded from “jacoxu/StackOverflow”,
Each data point is a question.
- class flair.datasets.document_classification.SENTIMENT_140(label_name_map=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
Twitter sentiment corpus.
See http://help.sentiment140.com/for-students
Two sentiments in train data (POSITIVE, NEGATIVE) and three sentiments in test data (POSITIVE, NEGATIVE, NEUTRAL).
- class flair.datasets.document_classification.SENTEVAL_CR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The customer reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.
- class flair.datasets.document_classification.SENTEVAL_MR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The movie reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.
- class flair.datasets.document_classification.SENTEVAL_SUBJ(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The subjectivity dataset of SentEval, classified into SUBJECTIVE or OBJECTIVE sentiment.
- class flair.datasets.document_classification.SENTEVAL_MPQA(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The opinion-polarity dataset of SentEval, classified into NEGATIVE or POSITIVE polarity.
- class flair.datasets.document_classification.SENTEVAL_SST_BINARY(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The Stanford sentiment treebank dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.
- class flair.datasets.document_classification.SENTEVAL_SST_GRANULAR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The Stanford sentiment treebank dataset of SentEval, classified into 5 sentiment classes.
- class flair.datasets.document_classification.GLUE_COLA(label_type='acceptability', base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
Corpus of Linguistic Acceptability from GLUE benchmark.
see https://gluebenchmark.com/tasks
The task is to predict whether an English sentence is grammatically correct. Additionaly to the Corpus we have eval_dataset containing the unlabeled test data for Glue evaluation.
- tsv_from_eval_dataset(folder_path)View on GitHub#
Create eval prediction file.
This function creates a tsv file with predictions of the eval_dataset (after calling classifier.predict(corpus.eval_dataset, label_name=’acceptability’)). The resulting file is called CoLA.tsv and is in the format required for submission to the Glue Benchmark.
- class flair.datasets.document_classification.GLUE_SST2(label_type='sentiment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, encoding='utf-8', sample_missing_splits=True, **datasetargs)View on GitHub#
Bases:
CSVClassificationCorpus
- label_map = {0: 'negative', 1: 'positive'}#
- tsv_from_eval_dataset(folder_path)View on GitHub#
Create eval prediction file.
- class flair.datasets.document_classification.GO_EMOTIONS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
GoEmotions dataset containing 58k Reddit comments labeled with 27 emotion categories.
- class flair.datasets.document_classification.TREC_50(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The TREC Question Classification Corpus, classifying questions into 50 fine-grained answer types.
- class flair.datasets.document_classification.TREC_6(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The TREC Question Classification Corpus, classifying questions into 6 coarse-grained answer types.
- class flair.datasets.document_classification.YAHOO_ANSWERS(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The YAHOO Question Classification Corpus, classifying questions into 10 coarse-grained answer types.
- class flair.datasets.document_classification.GERMEVAL_2018_OFFENSIVE_LANGUAGE(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='full', fine_grained_classes=False, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
GermEval 2018 corpus for identification of offensive language.
Classifying German tweets into 2 coarse-grained categories OFFENSIVE and OTHER or 4 fine-grained categories ABUSE, INSULT, PROFATINTY and OTHER.
- class flair.datasets.document_classification.COMMUNICATIVE_FUNCTIONS(base_path=None, memory_mode='full', tokenizer=<flair.tokenization.SpaceTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The Communicative Functions Classification Corpus.
Classifying sentences from scientific papers into 39 communicative functions.
- class flair.datasets.document_classification.WASSA_ANGER(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
WASSA-2017 anger emotion-intensity corpus.
see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.
- class flair.datasets.document_classification.WASSA_FEAR(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
WASSA-2017 fear emotion-intensity corpus.
see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.
- class flair.datasets.document_classification.WASSA_JOY(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
WASSA-2017 joy emotion-intensity dataset corpus.
see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html
- class flair.datasets.document_classification.WASSA_SADNESS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
WASSA-2017 sadness emotion-intensity corpus.
see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.