flair.datasets.sequence_labeling.ColumnDataset#
- class flair.datasets.sequence_labeling.ColumnDataset(path_to_column_file, column_name_map, column_delimiter='\\\\s+', comment_symbol=None, banned_sentences=None, in_memory=True, document_separator_token=None, encoding='utf-8', skip_first_line=False, label_name_map=None, default_whitespace_after=1)View on GitHub#
Bases:
FlairDataset
- __init__(path_to_column_file, column_name_map, column_delimiter='\\\\s+', comment_symbol=None, banned_sentences=None, in_memory=True, document_separator_token=None, encoding='utf-8', skip_first_line=False, label_name_map=None, default_whitespace_after=1)View on GitHub#
Instantiates a column dataset.
- Parameters:
path_to_column_file (
Union
[str
,Path
]) – path to the file with the column-formatted datacolumn_name_map (
dict
[int
,str
]) – a map specifying the column formatcolumn_delimiter (
str
) – default is to split on any separator, but you can overwrite for instance with “t” to split only on tabscomment_symbol (
Optional
[str
]) – if set, lines that begin with this symbol are treated as commentsin_memory (
bool
) – If set to True, the dataset is kept in memory as Sentence objects, otherwise does disk readsdocument_separator_token (
Optional
[str
]) – If provided, sentences that function as document boundaries are so markedskip_first_line (
bool
) – set to True if your dataset has a header linelabel_name_map (
Optional
[dict
[str
,str
]]) – Optionally map tag names to different schema.banned_sentences (
Optional
[list
[str
]]) – Optionally remove sentences from the corpus. Works only if in_memory is true
Methods
__init__
(path_to_column_file, column_name_map)Instantiates a column dataset.
Attributes
- SPACE_AFTER_KEY = 'space-after'#
- FEATS = ['feats', 'misc']#
- HEAD = ['head', 'head_id']#
- is_in_memory()View on GitHub#
- Return type:
bool