flair.datasets.document_classification.CSVClassificationDataset#

class flair.datasets.document_classification.CSVClassificationDataset(path_to_file, column_name_map, label_type, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, skip_header=False, encoding='utf-8', no_class_label=None, **fmtparams)View on GitHub#

Bases: FlairDataset

Dataset for text classification from CSV column formatted data.

__init__(path_to_file, column_name_map, label_type, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, skip_header=False, encoding='utf-8', no_class_label=None, **fmtparams)View on GitHub#

Instantiates a Dataset for text classification from CSV column formatted data.

Parameters:
  • path_to_file (Union[str, Path]) – path to the file with the CSV data

  • column_name_map (dict[int, str]) – a column name map that indicates which column is text and which the label(s)

  • label_type (str) – name of the label

  • max_tokens_per_doc (int) – If set, truncates each Sentence to a maximum number of Tokens

  • max_chars_per_doc (int) – If set, truncates each Sentence to a maximum number of chars

  • tokenizer (Tokenizer) – Tokenizer for dataset, default is SegTokTokenizer

  • in_memory (bool) – If True, keeps dataset as Sentences in memory, otherwise only keeps strings

  • skip_header (bool) – If True, skips first line because it is header

  • encoding (str) – Most datasets are ‘utf-8’ but some are ‘latin-1’

  • fmtparams – additional parameters for the CSV file reader

Returns:

a Corpus with annotated train, dev and test data

Methods

__init__(path_to_file, column_name_map, ...)

Instantiates a Dataset for text classification from CSV column formatted data.

is_in_memory()

is_in_memory()View on GitHub#
Return type:

bool