flair.datasets.document_classification#

class flair.datasets.document_classification.ClassificationCorpus(data_folder, label_type='class', train_file=None, test_file=None, dev_file=None, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, sample_missing_splits=True, encoding='utf-8')View on GitHub#

Bases: Corpus

A classification corpus from FastText-formatted text files.

__init__(data_folder, label_type='class', train_file=None, test_file=None, dev_file=None, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, sample_missing_splits=True, encoding='utf-8')View on GitHub#

Instantiates a Corpus from text classification-formatted task data.

Parameters:
  • data_folder (Union[str, Path]) – base folder with the task data

  • label_type (str) – name of the label

  • train_file – the name of the train file

  • test_file – the name of the test file

  • dev_file – the name of the dev file, if None, dev data is sampled from train

  • truncate_to_max_tokens (int) – If set, truncates each Sentence to a maximum number of tokens

  • truncate_to_max_chars (int) – If set, truncates each Sentence to a maximum number of chars

  • filter_if_longer_than (int) – If set, filters documents that are longer that the specified number of tokens.

  • tokenizer (Union[bool, Tokenizer]) – Tokenizer for dataset, default is SegtokTokenizer

  • memory_mode (str) – Set to what degree to keep corpus in memory (‘full’, ‘partial’ or ‘disk’). Use ‘full’ if full corpus and all embeddings fits into memory for speedups during training. Otherwise use ‘partial’ and if even this is too much for your memory, use ‘disk’.

  • label_name_map (Optional[Dict[str, str]]) – Optionally map label names to different schema.

  • allow_examples_without_labels – set to True to allow Sentences without label in the corpus.

  • encoding (str) – Default is ‘utf-8’ but some datasets are in ‘latin-1

class flair.datasets.document_classification.ClassificationDataset(path_to_file, label_type, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, encoding='utf-8')View on GitHub#

Bases: FlairDataset

Dataset for classification instantiated from a single FastText-formatted file.

__init__(path_to_file, label_type, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, encoding='utf-8')View on GitHub#

Reads a data file for text classification.

The file should contain one document/text per line. The line should have the following format: __label__<class_name> <text> If you have a multi class task, you can have as many labels as you want at the beginning of the line, e.g., __label__<class_name_1> __label__<class_name_2> <text> :type path_to_file: Union[str, Path] :param path_to_file: the path to the data file :type label_type: str :param label_type: name of the label :type truncate_to_max_tokens: :param truncate_to_max_tokens: If set, truncates each Sentence to a maximum number of tokens :type truncate_to_max_chars: :param truncate_to_max_chars: If set, truncates each Sentence to a maximum number of chars :type filter_if_longer_than: int :param filter_if_longer_than: If set, filters documents that are longer that the specified number of tokens. :type tokenizer: Union[bool, Tokenizer] :param tokenizer: Custom tokenizer to use (default is SegtokTokenizer) :type memory_mode: str :param memory_mode: Set to what degree to keep corpus in memory (‘full’, ‘partial’ or ‘disk’). Use ‘full’ if full corpus and all embeddings fits into memory for speedups during training. Otherwise use ‘partial’ and if even this is too much for your memory, use ‘disk’. :type label_name_map: Optional[Dict[str, str]] :param label_name_map: Optionally map label names to different schema. :type allow_examples_without_labels: :param allow_examples_without_labels: set to True to allow Sentences without label in the Dataset. :type encoding: str :param encoding: Default is ‘utf-8’ but some datasets are in ‘latin-1 :return: list of sentences

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.document_classification.CSVClassificationCorpus(data_folder, column_name_map, label_type, name='csv_corpus', train_file=None, test_file=None, dev_file=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, skip_header=False, encoding='utf-8', no_class_label=None, sample_missing_splits=True, **fmtparams)View on GitHub#

Bases: Corpus

Classification corpus instantiated from CSV data files.

__init__(data_folder, column_name_map, label_type, name='csv_corpus', train_file=None, test_file=None, dev_file=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, skip_header=False, encoding='utf-8', no_class_label=None, sample_missing_splits=True, **fmtparams)View on GitHub#

Instantiates a Corpus for text classification from CSV column formatted data.

Parameters:
  • data_folder (Union[str, Path]) – base folder with the task data

  • column_name_map (Dict[int, str]) – a column name map that indicates which column is text and which the label(s)

  • label_type (str) – name of the label

  • train_file – the name of the train file

  • test_file – the name of the test file

  • dev_file – the name of the dev file, if None, dev data is sampled from train

  • max_tokens_per_doc – If set, truncates each Sentence to a maximum number of Tokens

  • max_chars_per_doc – If set, truncates each Sentence to a maximum number of chars

  • tokenizer (Tokenizer) – Tokenizer for dataset, default is SegtokTokenizer

  • in_memory (bool) – If True, keeps dataset as Sentences in memory, otherwise only keeps strings

  • skip_header (bool) – If True, skips first line because it is header

  • encoding (str) – Default is ‘utf-8’ but some datasets are in ‘latin-1

  • fmtparams – additional parameters for the CSV file reader

Returns:

a Corpus with annotated train, dev and test data

class flair.datasets.document_classification.CSVClassificationDataset(path_to_file, column_name_map, label_type, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, skip_header=False, encoding='utf-8', no_class_label=None, **fmtparams)View on GitHub#

Bases: FlairDataset

Dataset for text classification from CSV column formatted data.

__init__(path_to_file, column_name_map, label_type, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, skip_header=False, encoding='utf-8', no_class_label=None, **fmtparams)View on GitHub#

Instantiates a Dataset for text classification from CSV column formatted data.

Parameters:
  • path_to_file (Union[str, Path]) – path to the file with the CSV data

  • column_name_map (Dict[int, str]) – a column name map that indicates which column is text and which the label(s)

  • label_type (str) – name of the label

  • max_tokens_per_doc (int) – If set, truncates each Sentence to a maximum number of Tokens

  • max_chars_per_doc (int) – If set, truncates each Sentence to a maximum number of chars

  • tokenizer (Tokenizer) – Tokenizer for dataset, default is SegTokTokenizer

  • in_memory (bool) – If True, keeps dataset as Sentences in memory, otherwise only keeps strings

  • skip_header (bool) – If True, skips first line because it is header

  • encoding (str) – Most datasets are ‘utf-8’ but some are ‘latin-1’

  • fmtparams – additional parameters for the CSV file reader

Returns:

a Corpus with annotated train, dev and test data

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.document_classification.AMAZON_REVIEWS(split_max=30000, label_name_map={'1.0': 'NEGATIVE', '2.0': 'NEGATIVE', '3.0': 'NEGATIVE', '4.0': 'POSITIVE', '5.0': 'POSITIVE'}, skip_labels=['3.0', '4.0'], fraction_of_5_star_reviews=10, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

A very large corpus of Amazon reviews with positivity ratings.

Corpus is downloaded from and documented at https://nijianmo.github.io/amazon/index.html. We download the 5-core subset which is still tens of millions of reviews.

__init__(split_max=30000, label_name_map={'1.0': 'NEGATIVE', '2.0': 'NEGATIVE', '3.0': 'NEGATIVE', '4.0': 'POSITIVE', '5.0': 'POSITIVE'}, skip_labels=['3.0', '4.0'], fraction_of_5_star_reviews=10, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Constructs corpus object.

Split_max indicates how many data points from each of the 28 splits are used, so set this higher or lower to increase/decrease corpus size. :type label_name_map: Dict[str, str] :param label_name_map: Map label names to different schema. By default, the 5-star rating is mapped onto 3 classes (POSITIVE, NEGATIVE, NEUTRAL) :type split_max: int :param split_max: Split_max indicates how many data points from each of the 28 splits are used, so set this higher or lower to increase/decrease corpus size. :type memory_mode: :param memory_mode: Set to what degree to keep corpus in memory (‘full’, ‘partial’ or ‘disk’). Use ‘full’ if full corpus and all embeddings fits into memory for speedups during training. Otherwise use ‘partial’ and if even this is too much for your memory, use ‘disk’. :type tokenizer: Tokenizer :param tokenizer: Custom tokenizer to use (default is SegtokTokenizer) :type corpusargs: :param corpusargs: Arguments for ClassificationCorpus

download_and_prepare_amazon_product_file(data_folder, part_name, max_data_points=None, fraction_of_5_star_reviews=None)View on GitHub#
class flair.datasets.document_classification.IMDB(base_path=None, rebalance_corpus=True, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

Corpus of IMDB movie reviews labeled by sentiment (POSITIVE, NEGATIVE).

Downloaded from and documented at http://ai.stanford.edu/~amaas/data/sentiment/.

__init__(base_path=None, rebalance_corpus=True, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Initialize the IMDB move review sentiment corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Provide this only if you store the IMDB corpus in a specific folder, otherwise use default.

  • tokenizer (Tokenizer) – Custom tokenizer to use (default is SegtokTokenizer)

  • rebalance_corpus (bool) – Weather to use a 80/10/10 data split instead of the original 50/0/50 split.

  • memory_mode

    Set to ‘partial’ because this is a huge corpus, but you can also set to ‘full’ for faster

    processing or ‘none’ for less memory.

    corpusargs: Other args for ClassificationCorpus.

class flair.datasets.document_classification.NEWSGROUPS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

20 newsgroups corpus, classifying news items into one of 20 categories.

Downloaded from http://qwone.com/~jason/20Newsgroups

Each data point is a full news article so documents may be very long.

__init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Instantiates 20 newsgroups corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Provide this only if you store the IMDB corpus in a specific folder, otherwise use default.

  • tokenizer (Tokenizer) – Custom tokenizer to use (default is SegtokTokenizer)

  • memory_mode (str) – Set to ‘partial’ because this is a big corpus, but you can also set to ‘full’ for faster processing or ‘none’ for less memory.

  • corpusargs – Other args for ClassificationCorpus.

class flair.datasets.document_classification.AGNEWS(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The AG’s News Topic Classification Corpus, classifying news into 4 coarse-grained topics.

Labels: World, Sports, Business, Sci/Tech.

__init__(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Instantiates AGNews Classification Corpus with 4 classes.

Parameters:
  • base_path (Union[str, Path, None]) – Provide this only if you store the AGNEWS corpus in a specific folder, otherwise use default.

  • tokenizer (Union[bool, Tokenizer]) – Custom tokenizer to use (default is SpaceTokenizer)

  • memory_mode – Set to ‘partial’ by default. Can also be ‘full’ or ‘none’.

  • corpusargs – Other args for ClassificationCorpus.

class flair.datasets.document_classification.STACKOVERFLOW(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

Stackoverflow corpus classifying questions into one of 20 labels.

The data will be downloaded from “jacoxu/StackOverflow”,

Each data point is a question.

__init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Instantiates Stackoverflow corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Provide this only if you store the IMDB corpus in a specific folder, otherwise use default.

  • tokenizer (Tokenizer) – Custom tokenizer to use (default is SegtokTokenizer)

  • memory_mode (str) – Set to ‘partial’ because this is a big corpus, but you can also set to ‘full’ for faster processing or ‘none’ for less memory.

  • corpusargs – Other args for ClassificationCorpus.

class flair.datasets.document_classification.SENTIMENT_140(label_name_map=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

Twitter sentiment corpus.

See http://help.sentiment140.com/for-students

Two sentiments in train data (POSITIVE, NEGATIVE) and three sentiments in test data (POSITIVE, NEGATIVE, NEUTRAL).

__init__(label_name_map=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Instantiates twitter sentiment corpus.

Parameters:
  • label_name_map – By default, the numeric values are mapped to (‘NEGATIVE’, ‘POSITIVE’ and ‘NEUTRAL’)

  • tokenizer (Tokenizer) – Custom tokenizer to use (default is SegtokTokenizer)

  • memory_mode (str) – Set to ‘partial’ because this is a big corpus, but you can also set to ‘full’ for faster processing or ‘none’ for less memory.

  • corpusargs – Other args for ClassificationCorpus.

class flair.datasets.document_classification.SENTEVAL_CR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The customer reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.

see facebookresearch/SentEval

__init__(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Instantiates SentEval customer reviews dataset.

Parameters:
  • corpusargs – Other args for ClassificationCorpus.

  • tokenizer (Union[bool, Tokenizer]) – Custom tokenizer to use (default is SpaceTokenizer())

  • memory_mode (str) – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.

class flair.datasets.document_classification.SENTEVAL_MR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The movie reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.

see facebookresearch/SentEval

__init__(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Instantiates SentEval movie reviews dataset.

Parameters:
  • corpusargs – Other args for ClassificationCorpus.

  • tokenizer (Union[bool, Tokenizer]) – Custom tokenizer to use (default is SpaceTokenizer)

  • memory_mode (str) – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.

class flair.datasets.document_classification.SENTEVAL_SUBJ(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The subjectivity dataset of SentEval, classified into SUBJECTIVE or OBJECTIVE sentiment.

see facebookresearch/SentEval

__init__(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Instantiates SentEval subjectivity dataset.

Parameters:
  • corpusargs – Other args for ClassificationCorpus.

  • tokenizer (Union[bool, Tokenizer]) – Custom tokenizer to use (default is SpaceTokenizer)

  • memory_mode (str) – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.

class flair.datasets.document_classification.SENTEVAL_MPQA(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The opinion-polarity dataset of SentEval, classified into NEGATIVE or POSITIVE polarity.

see facebookresearch/SentEval

__init__(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Instantiates SentEval opinion polarity dataset.

Parameters:
  • corpusargs – Other args for ClassificationCorpus.

  • tokenizer (Union[bool, Tokenizer]) – Custom tokenizer to use (default is SpaceTokenizer)

  • memory_mode (str) – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.

class flair.datasets.document_classification.SENTEVAL_SST_BINARY(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The Stanford sentiment treebank dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.

see facebookresearch/SentEval

__init__(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Instantiates SentEval Stanford sentiment treebank dataset.

Parameters:
  • memory_mode (str) – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.

  • tokenizer (Union[bool, Tokenizer]) – Custom tokenizer to use (default is SpaceTokenizer)

  • corpusargs – Other args for ClassificationCorpus.

class flair.datasets.document_classification.SENTEVAL_SST_GRANULAR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The Stanford sentiment treebank dataset of SentEval, classified into 5 sentiment classes.

see facebookresearch/SentEval

__init__(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Instantiates SentEval Stanford sentiment treebank dataset.

Parameters:
  • memory_mode (str) – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.

  • tokenizer (Union[bool, Tokenizer]) – Custom tokenizer to use (default is SpaceTokenizer)

  • corpusargs – Other args for ClassificationCorpus.

class flair.datasets.document_classification.GLUE_COLA(label_type='acceptability', base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

Corpus of Linguistic Acceptability from GLUE benchmark.

see https://gluebenchmark.com/tasks

The task is to predict whether an English sentence is grammatically correct. Additionaly to the Corpus we have eval_dataset containing the unlabeled test data for Glue evaluation.

__init__(label_type='acceptability', base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Instantiates CoLA dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Provide this only if you store the COLA corpus in a specific folder.

  • tokenizer (Tokenizer) – Custom tokenizer to use (default is SegtokTokenizer)

  • corpusargs – Other args for ClassificationCorpus.

tsv_from_eval_dataset(folder_path)View on GitHub#

Create eval prediction file.

This function creates a tsv file with predictions of the eval_dataset (after calling classifier.predict(corpus.eval_dataset, label_name=’acceptability’)). The resulting file is called CoLA.tsv and is in the format required for submission to the Glue Benchmark.

class flair.datasets.document_classification.GLUE_SST2(label_type='sentiment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, encoding='utf-8', sample_missing_splits=True, **datasetargs)View on GitHub#

Bases: CSVClassificationCorpus

label_map = {0: 'negative', 1: 'positive'}#
tsv_from_eval_dataset(folder_path)View on GitHub#

Create eval prediction file.

class flair.datasets.document_classification.GO_EMOTIONS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

GoEmotions dataset containing 58k Reddit comments labeled with 27 emotion categories.

see google-research/google-research

__init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Initializes the GoEmotions corpus.

Parameters:
  • base_path (Union[str, Path]) – Provide this only if you want to store the corpus in a specific folder, otherwise use default.

  • tokenizer (Union[bool, Tokenizer]) – Specify which tokenizer to use, the default is SegtokTokenizer().

  • memory_mode (str) – Set to what degree to keep corpus in memory (‘full’, ‘partial’ or ‘disk’). Use ‘full’ if full corpus and all embeddings fits into memory for speedups during training. Otherwise use ‘partial’ and if even this is too much for your memory, use ‘disk’.

class flair.datasets.document_classification.TREC_50(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The TREC Question Classification Corpus, classifying questions into 50 fine-grained answer types.

__init__(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Instantiates TREC Question Classification Corpus with 6 classes.

Parameters:
  • base_path (Union[str, Path, None]) – Provide this only if you store the TREC corpus in a specific folder, otherwise use default.

  • tokenizer (Union[bool, Tokenizer]) – Custom tokenizer to use (default is SpaceTokenizer)

  • memory_mode – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.

  • corpusargs – Other args for ClassificationCorpus.

class flair.datasets.document_classification.TREC_6(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The TREC Question Classification Corpus, classifying questions into 6 coarse-grained answer types.

__init__(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#

Instantiates TREC Question Classification Corpus with 6 classes.

Parameters:
  • base_path (Union[str, Path, None]) – Provide this only if you store the TREC corpus in a specific folder, otherwise use default.

  • tokenizer (Union[bool, Tokenizer]) – Custom tokenizer to use (default is SpaceTokenizer)

  • memory_mode – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.

  • corpusargs – Other args for ClassificationCorpus.

class flair.datasets.document_classification.YAHOO_ANSWERS(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The YAHOO Question Classification Corpus, classifying questions into 10 coarse-grained answer types.

__init__(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#

Instantiates YAHOO Question Classification Corpus with 10 classes.

Parameters:
  • base_path (Union[str, Path, None]) – Provide this only if you store the YAHOO corpus in a specific folder, otherwise use default.

  • tokenizer (Union[bool, Tokenizer]) – Custom tokenizer to use (default is SpaceTokenizer)

  • memory_mode – Set to ‘partial’ by default since this is a rather big corpus. Can also be ‘full’ or ‘none’.

  • corpusargs – Other args for ClassificationCorpus.

class flair.datasets.document_classification.GERMEVAL_2018_OFFENSIVE_LANGUAGE(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='full', fine_grained_classes=False, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

GermEval 2018 corpus for identification of offensive language.

Classifying German tweets into 2 coarse-grained categories OFFENSIVE and OTHER or 4 fine-grained categories ABUSE, INSULT, PROFATINTY and OTHER.

__init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='full', fine_grained_classes=False, **corpusargs)View on GitHub#

Instantiates GermEval 2018 Offensive Language Classification Corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Provide this only if you store the Offensive Language corpus in a specific folder, otherwise use default.

  • tokenizer (Union[bool, Tokenizer]) – Custom tokenizer to use (default is SegtokTokenizer)

  • memory_mode (str) – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.

  • fine_grained_classes (bool) – Set to True to load the dataset with 4 fine-grained classes

  • corpusargs – Other args for ClassificationCorpus.

class flair.datasets.document_classification.COMMUNICATIVE_FUNCTIONS(base_path=None, memory_mode='full', tokenizer=<flair.tokenization.SpaceTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

The Communicative Functions Classification Corpus.

Classifying sentences from scientific papers into 39 communicative functions.

__init__(base_path=None, memory_mode='full', tokenizer=<flair.tokenization.SpaceTokenizer object>, **corpusargs)View on GitHub#

Instantiates Communicative Functions Classification Corpus with 39 classes.

Parameters:
  • base_path (Union[str, Path, None]) – Provide this only if you store the Communicative Functions date in a specific folder, otherwise use default.

  • tokenizer (Union[bool, Tokenizer]) – Custom tokenizer to use (default is SpaceTokenizer)

  • memory_mode (str) – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.

  • corpusargs – Other args for ClassificationCorpus.

class flair.datasets.document_classification.WASSA_ANGER(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

WASSA-2017 anger emotion-intensity corpus.

see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.

__init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Instantiates WASSA-2017 anger emotion-intensity corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Provide this only if you store the WASSA corpus in a specific folder, otherwise use default.

  • tokenizer (Tokenizer) – Custom tokenizer to use (default is SegtokTokenizer)

  • corpusargs – Other args for ClassificationCorpus.

class flair.datasets.document_classification.WASSA_FEAR(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

WASSA-2017 fear emotion-intensity corpus.

see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.

__init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Instantiates WASSA-2017 fear emotion-intensity corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Provide this only if you store the WASSA corpus in a specific folder, otherwise use default.

  • tokenizer (Tokenizer) – Custom tokenizer to use (default is SegtokTokenizer)

  • corpusargs – Other args for ClassificationCorpus.

class flair.datasets.document_classification.WASSA_JOY(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

WASSA-2017 joy emotion-intensity dataset corpus.

see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html

__init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Instantiates WASSA-2017 joy emotion-intensity corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Provide this only if you store the WASSA corpus in a specific folder, otherwise use default.

  • tokenizer (Tokenizer) – Custom tokenizer to use (default is SegtokTokenizer)

  • corpusargs – Other args for ClassificationCorpus.

class flair.datasets.document_classification.WASSA_SADNESS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Bases: ClassificationCorpus

WASSA-2017 sadness emotion-intensity corpus.

see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.

__init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#

Instantiates WASSA-2017 sadness emotion-intensity dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Provide this only if you store the WASSA corpus in a specific folder, otherwise use default.

  • tokenizer (Tokenizer) – Custom tokenizer to use (default is SegtokTokenizer)

  • corpusargs – Other args for ClassificationCorpus.