flair.datasets.document_classification.GLUE_SST2#
- class flair.datasets.document_classification.GLUE_SST2(label_type='sentiment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, encoding='utf-8', sample_missing_splits=True, **datasetargs)View on GitHub#
- Bases: - CSVClassificationCorpus- __init__(label_type='sentiment', base_path=None, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=False, encoding='utf-8', sample_missing_splits=True, **datasetargs)View on GitHub#
- Instantiates a Corpus for text classification from CSV column formatted data. - Parameters:
- data_folder – base folder with the task data 
- column_name_map – a column name map that indicates which column is text and which the label(s) 
- label_type ( - str) – name of the label
- train_file – the name of the train file 
- test_file – the name of the test file 
- dev_file – the name of the dev file, if None, dev data is sampled from train 
- max_tokens_per_doc – If set, truncates each Sentence to a maximum number of Tokens 
- max_chars_per_doc – If set, truncates each Sentence to a maximum number of chars 
- tokenizer ( - Tokenizer) – Tokenizer for dataset, default is SegtokTokenizer
- in_memory ( - bool) – If True, keeps dataset as Sentences in memory, otherwise only keeps strings
- skip_header – If True, skips first line because it is header 
- encoding ( - str) – Default is ‘utf-8’ but some datasets are in ‘latin-1
- fmtparams – additional parameters for the CSV file reader 
 
- Returns:
- a Corpus with annotated train, dev and test data 
 
 - Methods - __init__([label_type, base_path, ...])- Instantiates a Corpus for text classification from CSV column formatted data. - add_label_noise(label_type, labels[, ...])- Generates uniform label noise distribution in the chosen dataset split. - downsample([percentage, downsample_train, ...])- Randomly downsample the corpus to the given percentage (by removing data points). - filter_empty_sentences()- A method that filters all sentences consisting of 0 tokens. - filter_long_sentences(max_charlength)- A method that filters all sentences for which the plain text is longer than a specified number of characters. - get_all_sentences()- Returns all sentences (spanning all three splits) in the - Corpus.- get_label_distribution()- Counts occurrences of each label in the corpus and returns them as a dictionary object. - make_label_dictionary(label_type[, ...])- Creates a dictionary of all labels assigned to the sentences in the corpus. - make_tag_dictionary(tag_type)- Create a tag dictionary of a given label type. - make_vocab_dictionary([max_tokens, min_freq])- Creates a - Dictionaryof all tokens contained in the corpus.- obtain_statistics([label_type, pretty_print])- Print statistics about the corpus, including the length of the sentences and the labels in the corpus. - tsv_from_eval_dataset(folder_path)- Create eval prediction file. - Attributes - dev- The dev split as a - torch.utils.data.Datasetobject.- test- The test split as a - torch.utils.data.Datasetobject.- train- The training split as a - torch.utils.data.Datasetobject.- label_map = {0: 'negative', 1: 'positive'}#
 - tsv_from_eval_dataset(folder_path)View on GitHub#
- Create eval prediction file.