flair.datasets.document_classification.ClassificationDataset#
- class flair.datasets.document_classification.ClassificationDataset(path_to_file, label_type, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, encoding='utf-8')View on GitHub#
Bases:
FlairDataset
Dataset for classification instantiated from a single FastText-formatted file.
- __init__(path_to_file, label_type, truncate_to_max_tokens=-1, truncate_to_max_chars=-1, filter_if_longer_than=-1, tokenizer=<flair.tokenization.SegtokTokenizer object>, memory_mode='partial', label_name_map=None, skip_labels=None, allow_examples_without_labels=False, encoding='utf-8')View on GitHub#
Reads a data file for text classification.
The file should contain one document/text per line. The line should have the following format: __label__<class_name> <text> If you have a multi class task, you can have as many labels as you want at the beginning of the line, e.g., __label__<class_name_1> __label__<class_name_2> <text> :type path_to_file:
Union
[str
,Path
] :param path_to_file: the path to the data file :type label_type:str
:param label_type: name of the label :type truncate_to_max_tokens: :param truncate_to_max_tokens: If set, truncates each Sentence to a maximum number of tokens :type truncate_to_max_chars: :param truncate_to_max_chars: If set, truncates each Sentence to a maximum number of chars :type filter_if_longer_than:int
:param filter_if_longer_than: If set, filters documents that are longer that the specified number of tokens. :type tokenizer:Union
[bool
,Tokenizer
] :param tokenizer: Custom tokenizer to use (default is SegtokTokenizer) :type memory_mode:str
:param memory_mode: Set to what degree to keep corpus in memory (‘full’, ‘partial’ or ‘disk’). Use ‘full’ if full corpus and all embeddings fits into memory for speedups during training. Otherwise use ‘partial’ and if even this is too much for your memory, use ‘disk’. :type label_name_map:Optional
[dict
[str
,str
]] :param label_name_map: Optionally map label names to different schema. :type allow_examples_without_labels: :param allow_examples_without_labels: set to True to allow Sentences without label in the Dataset. :type encoding:str
:param encoding: Default is ‘utf-8’ but some datasets are in ‘latin-1 :return: list of sentences
Methods
__init__
(path_to_file, label_type[, ...])Reads a data file for text classification.
- is_in_memory()View on GitHub#
- Return type:
bool