flair.datasets.document_classification.CSVClassificationDataset#

class flair.datasets.document_classification.CSVClassificationDataset(path_to_file, column_name_map, label_type, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, skip_header=False, encoding='utf-8', no_class_label=None, **fmtparams)View on GitHub #

Bases: FlairDataset

Dataset for text classification from CSV column formatted data.

__init__(path_to_file, column_name_map, label_type, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, skip_header=False, encoding='utf-8', no_class_label=None, **fmtparams)View on GitHub #

Instantiates a Dataset for text classification from CSV column formatted data.

Parameters:

path_to_file (Union[str, Path]) – path to the file with the CSV data
column_name_map (dict[int, str]) – a column name map that indicates which column is text and which the label(s)
label_type (str) – name of the label
max_tokens_per_doc (int) – If set, truncates each Sentence to a maximum number of Tokens
max_chars_per_doc (int) – If set, truncates each Sentence to a maximum number of chars
tokenizer (Tokenizer) – Tokenizer for dataset, default is SegTokTokenizer
in_memory (bool) – If True, keeps dataset as Sentences in memory, otherwise only keeps strings
skip_header (bool) – If True, skips first line because it is header
encoding (str) – Most datasets are ‘utf-8’ but some are ‘latin-1’
fmtparams – additional parameters for the CSV file reader

Returns:

a Corpus with annotated train, dev and test data

Methods

`__init__`(path_to_file, column_name_map, ...)	Instantiates a Dataset for text classification from CSV column formatted data.
`is_in_memory`()

is_in_memory()View on GitHub #

Return type:: bool

Table of Contents

flair.datasets.document_classification.CSVClassificationDataset#