flair.datasets.document_classification.CSVClassificationDataset#
- class flair.datasets.document_classification.CSVClassificationDataset(path_to_file, column_name_map, label_type, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, skip_header=False, encoding='utf-8', no_class_label=None, **fmtparams)View on GitHub#
Bases:
FlairDataset
Dataset for text classification from CSV column formatted data.
- __init__(path_to_file, column_name_map, label_type, max_tokens_per_doc=-1, max_chars_per_doc=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, in_memory=True, skip_header=False, encoding='utf-8', no_class_label=None, **fmtparams)View on GitHub#
Instantiates a Dataset for text classification from CSV column formatted data.
- Parameters:
path_to_file (
Union
[str
,Path
]) – path to the file with the CSV datacolumn_name_map (
dict
[int
,str
]) – a column name map that indicates which column is text and which the label(s)label_type (
str
) – name of the labelmax_tokens_per_doc (
int
) – If set, truncates each Sentence to a maximum number of Tokensmax_chars_per_doc (
int
) – If set, truncates each Sentence to a maximum number of charstokenizer (
Tokenizer
) – Tokenizer for dataset, default is SegTokTokenizerin_memory (
bool
) – If True, keeps dataset as Sentences in memory, otherwise only keeps stringsskip_header (
bool
) – If True, skips first line because it is headerencoding (
str
) – Most datasets are ‘utf-8’ but some are ‘latin-1’fmtparams – additional parameters for the CSV file reader
- Returns:
a Corpus with annotated train, dev and test data
Methods
__init__
(path_to_file, column_name_map, ...)Instantiates a Dataset for text classification from CSV column formatted data.
- is_in_memory()View on GitHub#
- Return type:
bool