flair.datasets.document_classification#
- class flair.datasets.document_classification.ClassificationCorpus(data_folder, label_type='class', train_file=None, test_file=None, dev_file=None, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, sample_missing_splits=True, encoding='utf-8')View on GitHub#
Bases:
Corpus
A classification corpus from FastText-formatted text files.
- __init__(data_folder, label_type='class', train_file=None, test_file=None, dev_file=None, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, sample_missing_splits=True, encoding='utf-8')View on GitHub#
Instantiates a Corpus from text classification-formatted task data.
- Parameters:
data_folder (
Union
[str
,Path
]) – base folder with the task datalabel_type (
str
) – name of the labeltrain_file – the name of the train file
test_file – the name of the test file
dev_file – the name of the dev file, if None, dev data is sampled from train
truncate_to_max_tokens (
int
) – If set, truncates each Sentence to a maximum number of tokenstruncate_to_max_chars (
int
) – If set, truncates each Sentence to a maximum number of charsfilter_if_longer_than (
int
) – If set, filters documents that are longer that the specified number of tokens.tokenizer (
Union
[bool
,Tokenizer
]) – Tokenizer for dataset, default is SegtokTokenizermemory_mode (
str
) – Set to what degree to keep corpus in memory (‘full’, ‘partial’ or ‘disk’). Use ‘full’ if full corpus and all embeddings fits into memory for speedups during training. Otherwise use ‘partial’ and if even this is too much for your memory, use ‘disk’.label_name_map (
Optional
[Dict
[str
,str
]]) – Optionally map label names to different schema.allow_examples_without_labels – set to True to allow Sentences without label in the corpus.
encoding (
str
) – Default is ‘utf-8’ but some datasets are in ‘latin-1
- class flair.datasets.document_classification.ClassificationDataset(path_to_file, label_type, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, encoding='utf-8')View on GitHub#
Bases:
FlairDataset
Dataset for classification instantiated from a single FastText-formatted file.
- __init__(path_to_file, label_type, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, encoding='utf-8')View on GitHub#
Reads a data file for text classification.
The file should contain one document/text per line. The line should have the following format: __label__<class_name> <text> If you have a multi class task, you can have as many labels as you want at the beginning of the line, e.g., __label__<class_name_1> __label__<class_name_2> <text> :type path_to_file:
Union
[str
,Path
] :param path_to_file: the path to the data file :type label_type:str
:param label_type: name of the label :type truncate_to_max_tokens: :param truncate_to_max_tokens: If set, truncates each Sentence to a maximum number of tokens :type truncate_to_max_chars: :param truncate_to_max_chars: If set, truncates each Sentence to a maximum number of chars :type filter_if_longer_than:int
:param filter_if_longer_than: If set, filters documents that are longer that the specified number of tokens. :type tokenizer:Union
[bool
,Tokenizer
] :param tokenizer: Custom tokenizer to use (default is SegtokTokenizer) :type memory_mode:str
:param memory_mode: Set to what degree to keep corpus in memory (‘full’, ‘partial’ or ‘disk’). Use ‘full’ if full corpus and all embeddings fits into memory for speedups during training. Otherwise use ‘partial’ and if even this is too much for your memory, use ‘disk’. :type label_name_map:Optional
[Dict
[str
,str
]] :param label_name_map: Optionally map label names to different schema. :type allow_examples_without_labels: :param allow_examples_without_labels: set to True to allow Sentences without label in the Dataset. :type encoding:str
:param encoding: Default is ‘utf-8’ but some datasets are in ‘latin-1 :return: list of sentences
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.document_classification.CSVClassificationCorpus(data_folder, column_name_map, label_type, name='csv_corpus', train_file=None, test_file=None, dev_file=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, skip_header=False, encoding='utf-8', no_class_label=None, sample_missing_splits=True, **fmtparams)View on GitHub#
Bases:
Corpus
Classification corpus instantiated from CSV data files.
- __init__(data_folder, column_name_map, label_type, name='csv_corpus', train_file=None, test_file=None, dev_file=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, skip_header=False, encoding='utf-8', no_class_label=None, sample_missing_splits=True, **fmtparams)View on GitHub#
Instantiates a Corpus for text classification from CSV column formatted data.
- Parameters:
data_folder (
Union
[str
,Path
]) – base folder with the task datacolumn_name_map (
Dict
[int
,str
]) – a column name map that indicates which column is text and which the label(s)label_type (
str
) – name of the labeltrain_file – the name of the train file
test_file – the name of the test file
dev_file – the name of the dev file, if None, dev data is sampled from train
max_tokens_per_doc – If set, truncates each Sentence to a maximum number of Tokens
max_chars_per_doc – If set, truncates each Sentence to a maximum number of chars
tokenizer (
Tokenizer
) – Tokenizer for dataset, default is SegtokTokenizerin_memory (
bool
) – If True, keeps dataset as Sentences in memory, otherwise only keeps stringsskip_header (
bool
) – If True, skips first line because it is headerencoding (
str
) – Default is ‘utf-8’ but some datasets are in ‘latin-1fmtparams – additional parameters for the CSV file reader
- Returns:
a Corpus with annotated train, dev and test data
- class flair.datasets.document_classification.CSVClassificationDataset(path_to_file, column_name_map, label_type, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, skip_header=False, encoding='utf-8', no_class_label=None, **fmtparams)View on GitHub#
Bases:
FlairDataset
Dataset for text classification from CSV column formatted data.
- __init__(path_to_file, column_name_map, label_type, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, skip_header=False, encoding='utf-8', no_class_label=None, **fmtparams)View on GitHub#
Instantiates a Dataset for text classification from CSV column formatted data.
- Parameters:
path_to_file (
Union
[str
,Path
]) – path to the file with the CSV datacolumn_name_map (
Dict
[int
,str
]) – a column name map that indicates which column is text and which the label(s)label_type (
str
) – name of the labelmax_tokens_per_doc (
int
) – If set, truncates each Sentence to a maximum number of Tokensmax_chars_per_doc (
int
) – If set, truncates each Sentence to a maximum number of charstokenizer (
Tokenizer
) – Tokenizer for dataset, default is SegTokTokenizerin_memory (
bool
) – If True, keeps dataset as Sentences in memory, otherwise only keeps stringsskip_header (
bool
) – If True, skips first line because it is headerencoding (
str
) – Most datasets are ‘utf-8’ but some are ‘latin-1’fmtparams – additional parameters for the CSV file reader
- Returns:
a Corpus with annotated train, dev and test data
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.document_classification.AMAZON_REVIEWS(split_max=30000, label_name_map={'1.0': 'NEGATIVE', '2.0': 'NEGATIVE', '3.0': 'NEGATIVE', '4.0': 'POSITIVE', '5.0': 'POSITIVE'}, skip_labels=['3.0', '4.0'], fraction_of_5_star_reviews=10, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
A very large corpus of Amazon reviews with positivity ratings.
Corpus is downloaded from and documented at https://nijianmo.github.io/amazon/index.html. We download the 5-core subset which is still tens of millions of reviews.
- __init__(split_max=30000, label_name_map={'1.0': 'NEGATIVE', '2.0': 'NEGATIVE', '3.0': 'NEGATIVE', '4.0': 'POSITIVE', '5.0': 'POSITIVE'}, skip_labels=['3.0', '4.0'], fraction_of_5_star_reviews=10, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Constructs corpus object.
Split_max indicates how many data points from each of the 28 splits are used, so set this higher or lower to increase/decrease corpus size. :type label_name_map:
Dict
[str
,str
] :param label_name_map: Map label names to different schema. By default, the 5-star rating is mapped onto 3 classes (POSITIVE, NEGATIVE, NEUTRAL) :type split_max:int
:param split_max: Split_max indicates how many data points from each of the 28 splits are used, so set this higher or lower to increase/decrease corpus size. :type memory_mode: :param memory_mode: Set to what degree to keep corpus in memory (‘full’, ‘partial’ or ‘disk’). Use ‘full’ if full corpus and all embeddings fits into memory for speedups during training. Otherwise use ‘partial’ and if even this is too much for your memory, use ‘disk’. :type tokenizer:Tokenizer
:param tokenizer: Custom tokenizer to use (default is SegtokTokenizer) :type corpusargs: :param corpusargs: Arguments for ClassificationCorpus
- download_and_prepare_amazon_product_file(data_folder, part_name, max_data_points=None, fraction_of_5_star_reviews=None)View on GitHub#
- class flair.datasets.document_classification.IMDB(base_path=None, rebalance_corpus=True, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
Corpus of IMDB movie reviews labeled by sentiment (POSITIVE, NEGATIVE).
Downloaded from and documented at http://ai.stanford.edu/~amaas/data/sentiment/.
- __init__(base_path=None, rebalance_corpus=True, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Initialize the IMDB move review sentiment corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Provide this only if you store the IMDB corpus in a specific folder, otherwise use default.tokenizer (
Tokenizer
) – Custom tokenizer to use (default is SegtokTokenizer)rebalance_corpus (
bool
) – Weather to use a 80/10/10 data split instead of the original 50/0/50 split.memory_mode –
Set to ‘partial’ because this is a huge corpus, but you can also set to ‘full’ for faster
- processing or ‘none’ for less memory.
corpusargs: Other args for ClassificationCorpus.
- class flair.datasets.document_classification.NEWSGROUPS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
20 newsgroups corpus, classifying news items into one of 20 categories.
Downloaded from http://qwone.com/~jason/20Newsgroups
Each data point is a full news article so documents may be very long.
- __init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Instantiates 20 newsgroups corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Provide this only if you store the IMDB corpus in a specific folder, otherwise use default.tokenizer (
Tokenizer
) – Custom tokenizer to use (default is SegtokTokenizer)memory_mode (
str
) – Set to ‘partial’ because this is a big corpus, but you can also set to ‘full’ for faster processing or ‘none’ for less memory.corpusargs – Other args for ClassificationCorpus.
- class flair.datasets.document_classification.AGNEWS(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The AG’s News Topic Classification Corpus, classifying news into 4 coarse-grained topics.
Labels: World, Sports, Business, Sci/Tech.
- __init__(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Instantiates AGNews Classification Corpus with 4 classes.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Provide this only if you store the AGNEWS corpus in a specific folder, otherwise use default.tokenizer (
Union
[bool
,Tokenizer
]) – Custom tokenizer to use (default is SpaceTokenizer)memory_mode – Set to ‘partial’ by default. Can also be ‘full’ or ‘none’.
corpusargs – Other args for ClassificationCorpus.
- class flair.datasets.document_classification.STACKOVERFLOW(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
Stackoverflow corpus classifying questions into one of 20 labels.
The data will be downloaded from “jacoxu/StackOverflow”,
Each data point is a question.
- __init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Instantiates Stackoverflow corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Provide this only if you store the IMDB corpus in a specific folder, otherwise use default.tokenizer (
Tokenizer
) – Custom tokenizer to use (default is SegtokTokenizer)memory_mode (
str
) – Set to ‘partial’ because this is a big corpus, but you can also set to ‘full’ for faster processing or ‘none’ for less memory.corpusargs – Other args for ClassificationCorpus.
- class flair.datasets.document_classification.SENTIMENT_140(label_name_map=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
Twitter sentiment corpus.
See http://help.sentiment140.com/for-students
Two sentiments in train data (POSITIVE, NEGATIVE) and three sentiments in test data (POSITIVE, NEGATIVE, NEUTRAL).
- __init__(label_name_map=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Instantiates twitter sentiment corpus.
- Parameters:
label_name_map – By default, the numeric values are mapped to (‘NEGATIVE’, ‘POSITIVE’ and ‘NEUTRAL’)
tokenizer (
Tokenizer
) – Custom tokenizer to use (default is SegtokTokenizer)memory_mode (
str
) – Set to ‘partial’ because this is a big corpus, but you can also set to ‘full’ for faster processing or ‘none’ for less memory.corpusargs – Other args for ClassificationCorpus.
- class flair.datasets.document_classification.SENTEVAL_CR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The customer reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.
- __init__(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Instantiates SentEval customer reviews dataset.
- Parameters:
corpusargs – Other args for ClassificationCorpus.
tokenizer (
Union
[bool
,Tokenizer
]) – Custom tokenizer to use (default is SpaceTokenizer())memory_mode (
str
) – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.
- class flair.datasets.document_classification.SENTEVAL_MR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The movie reviews dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.
- __init__(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Instantiates SentEval movie reviews dataset.
- Parameters:
corpusargs – Other args for ClassificationCorpus.
tokenizer (
Union
[bool
,Tokenizer
]) – Custom tokenizer to use (default is SpaceTokenizer)memory_mode (
str
) – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.
- class flair.datasets.document_classification.SENTEVAL_SUBJ(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The subjectivity dataset of SentEval, classified into SUBJECTIVE or OBJECTIVE sentiment.
- __init__(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Instantiates SentEval subjectivity dataset.
- Parameters:
corpusargs – Other args for ClassificationCorpus.
tokenizer (
Union
[bool
,Tokenizer
]) – Custom tokenizer to use (default is SpaceTokenizer)memory_mode (
str
) – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.
- class flair.datasets.document_classification.SENTEVAL_MPQA(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The opinion-polarity dataset of SentEval, classified into NEGATIVE or POSITIVE polarity.
- __init__(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Instantiates SentEval opinion polarity dataset.
- Parameters:
corpusargs – Other args for ClassificationCorpus.
tokenizer (
Union
[bool
,Tokenizer
]) – Custom tokenizer to use (default is SpaceTokenizer)memory_mode (
str
) – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.
- class flair.datasets.document_classification.SENTEVAL_SST_BINARY(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The Stanford sentiment treebank dataset of SentEval, classified into NEGATIVE or POSITIVE sentiment.
- __init__(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Instantiates SentEval Stanford sentiment treebank dataset.
- Parameters:
memory_mode (
str
) – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.tokenizer (
Union
[bool
,Tokenizer
]) – Custom tokenizer to use (default is SpaceTokenizer)corpusargs – Other args for ClassificationCorpus.
- class flair.datasets.document_classification.SENTEVAL_SST_GRANULAR(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The Stanford sentiment treebank dataset of SentEval, classified into 5 sentiment classes.
- __init__(tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Instantiates SentEval Stanford sentiment treebank dataset.
- Parameters:
memory_mode (
str
) – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.tokenizer (
Union
[bool
,Tokenizer
]) – Custom tokenizer to use (default is SpaceTokenizer)corpusargs – Other args for ClassificationCorpus.
- class flair.datasets.document_classification.GLUE_COLA(label_type='acceptability', base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
Corpus of Linguistic Acceptability from GLUE benchmark.
see https://gluebenchmark.com/tasks
The task is to predict whether an English sentence is grammatically correct. Additionaly to the Corpus we have eval_dataset containing the unlabeled test data for Glue evaluation.
- __init__(label_type='acceptability', base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Instantiates CoLA dataset.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Provide this only if you store the COLA corpus in a specific folder.tokenizer (
Tokenizer
) – Custom tokenizer to use (default is SegtokTokenizer)corpusargs – Other args for ClassificationCorpus.
- tsv_from_eval_dataset(folder_path)View on GitHub#
Create eval prediction file.
This function creates a tsv file with predictions of the eval_dataset (after calling classifier.predict(corpus.eval_dataset, label_name=’acceptability’)). The resulting file is called CoLA.tsv and is in the format required for submission to the Glue Benchmark.
- class flair.datasets.document_classification.GLUE_SST2(label_type='sentiment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, encoding='utf-8', sample_missing_splits=True, **datasetargs)View on GitHub#
Bases:
CSVClassificationCorpus
- label_map = {0: 'negative', 1: 'positive'}#
- tsv_from_eval_dataset(folder_path)View on GitHub#
Create eval prediction file.
- class flair.datasets.document_classification.GO_EMOTIONS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
GoEmotions dataset containing 58k Reddit comments labeled with 27 emotion categories.
see google-research/google-research
- __init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Initializes the GoEmotions corpus.
- Parameters:
base_path (Union[str, Path]) – Provide this only if you want to store the corpus in a specific folder, otherwise use default.
tokenizer (Union[bool, Tokenizer]) – Specify which tokenizer to use, the default is SegtokTokenizer().
memory_mode (str) – Set to what degree to keep corpus in memory (‘full’, ‘partial’ or ‘disk’). Use ‘full’ if full corpus and all embeddings fits into memory for speedups during training. Otherwise use ‘partial’ and if even this is too much for your memory, use ‘disk’.
- class flair.datasets.document_classification.TREC_50(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The TREC Question Classification Corpus, classifying questions into 50 fine-grained answer types.
- __init__(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Instantiates TREC Question Classification Corpus with 6 classes.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Provide this only if you store the TREC corpus in a specific folder, otherwise use default.tokenizer (
Union
[bool
,Tokenizer
]) – Custom tokenizer to use (default is SpaceTokenizer)memory_mode – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.
corpusargs – Other args for ClassificationCorpus.
- class flair.datasets.document_classification.TREC_6(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The TREC Question Classification Corpus, classifying questions into 6 coarse-grained answer types.
- __init__(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='full', **corpusargs)View on GitHub#
Instantiates TREC Question Classification Corpus with 6 classes.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Provide this only if you store the TREC corpus in a specific folder, otherwise use default.tokenizer (
Union
[bool
,Tokenizer
]) – Custom tokenizer to use (default is SpaceTokenizer)memory_mode – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.
corpusargs – Other args for ClassificationCorpus.
- class flair.datasets.document_classification.YAHOO_ANSWERS(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The YAHOO Question Classification Corpus, classifying questions into 10 coarse-grained answer types.
- __init__(base_path=None, tokenizer=<flair.tokenization.SpaceTokenizer object>, memory_mode='partial', **corpusargs)View on GitHub#
Instantiates YAHOO Question Classification Corpus with 10 classes.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Provide this only if you store the YAHOO corpus in a specific folder, otherwise use default.tokenizer (
Union
[bool
,Tokenizer
]) – Custom tokenizer to use (default is SpaceTokenizer)memory_mode – Set to ‘partial’ by default since this is a rather big corpus. Can also be ‘full’ or ‘none’.
corpusargs – Other args for ClassificationCorpus.
- class flair.datasets.document_classification.GERMEVAL_2018_OFFENSIVE_LANGUAGE(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='full', fine_grained_classes=False, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
GermEval 2018 corpus for identification of offensive language.
Classifying German tweets into 2 coarse-grained categories OFFENSIVE and OTHER or 4 fine-grained categories ABUSE, INSULT, PROFATINTY and OTHER.
- __init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='full', fine_grained_classes=False, **corpusargs)View on GitHub#
Instantiates GermEval 2018 Offensive Language Classification Corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Provide this only if you store the Offensive Language corpus in a specific folder, otherwise use default.tokenizer (
Union
[bool
,Tokenizer
]) – Custom tokenizer to use (default is SegtokTokenizer)memory_mode (
str
) – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.fine_grained_classes (
bool
) – Set to True to load the dataset with 4 fine-grained classescorpusargs – Other args for ClassificationCorpus.
- class flair.datasets.document_classification.COMMUNICATIVE_FUNCTIONS(base_path=None, memory_mode='full', tokenizer=<flair.tokenization.SpaceTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
The Communicative Functions Classification Corpus.
Classifying sentences from scientific papers into 39 communicative functions.
- __init__(base_path=None, memory_mode='full', tokenizer=<flair.tokenization.SpaceTokenizer object>, **corpusargs)View on GitHub#
Instantiates Communicative Functions Classification Corpus with 39 classes.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Provide this only if you store the Communicative Functions date in a specific folder, otherwise use default.tokenizer (
Union
[bool
,Tokenizer
]) – Custom tokenizer to use (default is SpaceTokenizer)memory_mode (
str
) – Set to ‘full’ by default since this is a small corpus. Can also be ‘partial’ or ‘none’.corpusargs – Other args for ClassificationCorpus.
- class flair.datasets.document_classification.WASSA_ANGER(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
WASSA-2017 anger emotion-intensity corpus.
see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.
- __init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Instantiates WASSA-2017 anger emotion-intensity corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Provide this only if you store the WASSA corpus in a specific folder, otherwise use default.tokenizer (
Tokenizer
) – Custom tokenizer to use (default is SegtokTokenizer)corpusargs – Other args for ClassificationCorpus.
- class flair.datasets.document_classification.WASSA_FEAR(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
WASSA-2017 fear emotion-intensity corpus.
see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.
- __init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Instantiates WASSA-2017 fear emotion-intensity corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Provide this only if you store the WASSA corpus in a specific folder, otherwise use default.tokenizer (
Tokenizer
) – Custom tokenizer to use (default is SegtokTokenizer)corpusargs – Other args for ClassificationCorpus.
- class flair.datasets.document_classification.WASSA_JOY(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
WASSA-2017 joy emotion-intensity dataset corpus.
see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html
- __init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Instantiates WASSA-2017 joy emotion-intensity corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Provide this only if you store the WASSA corpus in a specific folder, otherwise use default.tokenizer (
Tokenizer
) – Custom tokenizer to use (default is SegtokTokenizer)corpusargs – Other args for ClassificationCorpus.
- class flair.datasets.document_classification.WASSA_SADNESS(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Bases:
ClassificationCorpus
WASSA-2017 sadness emotion-intensity corpus.
see https://saifmohammad.com/WebPages/EmotionIntensity-SharedTask.html.
- __init__(base_path=None, tokenizer=<flair.tokenization.SegtokTokenizer object>, **corpusargs)View on GitHub#
Instantiates WASSA-2017 sadness emotion-intensity dataset.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Provide this only if you store the WASSA corpus in a specific folder, otherwise use default.tokenizer (
Tokenizer
) – Custom tokenizer to use (default is SegtokTokenizer)corpusargs – Other args for ClassificationCorpus.