flair.datasets.sequence_labeling.ColumnDataset#

class flair.datasets.sequence_labeling.ColumnDataset(path_to_column_file, column_name_map, column_delimiter='\\\\s+', comment_symbol=None, banned_sentences=None, in_memory=True, document_separator_token=None, encoding='utf-8', skip_first_line=False, label_name_map=None, default_whitespace_after=1)View on GitHub #

Bases: FlairDataset

__init__(path_to_column_file, column_name_map, column_delimiter='\\\\s+', comment_symbol=None, banned_sentences=None, in_memory=True, document_separator_token=None, encoding='utf-8', skip_first_line=False, label_name_map=None, default_whitespace_after=1)View on GitHub #

Instantiates a column dataset.

Parameters:

path_to_column_file (Union[str, Path]) – path to the file with the column-formatted data
column_name_map (dict[int, str]) – a map specifying the column format
column_delimiter (str) – default is to split on any separator, but you can overwrite for instance with “t” to split only on tabs
comment_symbol (Optional[str]) – if set, lines that begin with this symbol are treated as comments
in_memory (bool) – If set to True, the dataset is kept in memory as Sentence objects, otherwise does disk reads
document_separator_token (Optional[str]) – If provided, sentences that function as document boundaries are so marked
skip_first_line (bool) – set to True if your dataset has a header line
label_name_map (Optional[dict[str, str]]) – Optionally map tag names to different schema.
banned_sentences (Optional[list[str]]) – Optionally remove sentences from the corpus. Works only if in_memory is true

Methods

`__init__`(path_to_column_file, column_name_map)	Instantiates a column dataset.
`is_in_memory`()

Attributes

`FEATS`
`HEAD`
`SPACE_AFTER_KEY`

SPACE_AFTER_KEY = 'space-after'#

FEATS = ['feats', 'misc']#

HEAD = ['head', 'head_id']#

is_in_memory()View on GitHub #

Return type:: bool

Table of Contents

flair.datasets.sequence_labeling.ColumnDataset#