flair.datasets.sequence_labeling.ColumnDataset#

class flair.datasets.sequence_labeling.ColumnDataset(path_to_column_file, column_name_map, column_delimiter='\\\\s+', comment_symbol=None, banned_sentences=None, in_memory=True, document_separator_token=None, encoding='utf-8', skip_first_line=False, label_name_map=None, default_whitespace_after=1)View on GitHub#

Bases: FlairDataset

__init__(path_to_column_file, column_name_map, column_delimiter='\\\\s+', comment_symbol=None, banned_sentences=None, in_memory=True, document_separator_token=None, encoding='utf-8', skip_first_line=False, label_name_map=None, default_whitespace_after=1)View on GitHub#

Instantiates a column dataset.

Parameters:
  • path_to_column_file (Union[str, Path]) – path to the file with the column-formatted data

  • column_name_map (dict[int, str]) – a map specifying the column format

  • column_delimiter (str) – default is to split on any separator, but you can overwrite for instance with “t” to split only on tabs

  • comment_symbol (Optional[str]) – if set, lines that begin with this symbol are treated as comments

  • in_memory (bool) – If set to True, the dataset is kept in memory as Sentence objects, otherwise does disk reads

  • document_separator_token (Optional[str]) – If provided, sentences that function as document boundaries are so marked

  • skip_first_line (bool) – set to True if your dataset has a header line

  • label_name_map (Optional[dict[str, str]]) – Optionally map tag names to different schema.

  • banned_sentences (Optional[list[str]]) – Optionally remove sentences from the corpus. Works only if in_memory is true

Methods

__init__(path_to_column_file, column_name_map)

Instantiates a column dataset.

is_in_memory()

Attributes

FEATS

HEAD

SPACE_AFTER_KEY

SPACE_AFTER_KEY = 'space-after'#
FEATS = ['feats', 'misc']#
HEAD = ['head', 'head_id']#
is_in_memory()View on GitHub#
Return type:

bool