flair.datasets.sequence_labeling#

class flair.datasets.sequence_labeling.MultiFileJsonlCorpus(train_files=None, test_files=None, dev_files=None, encoding='utf-8', text_column_name='data', label_column_name='label', metadata_column_name='metadata', label_type='ner', use_tokenizer=True, **corpusargs)View on GitHub#

Bases: Corpus

This class represents a generic Jsonl corpus with multiple train, dev, and test files.

__init__(train_files=None, test_files=None, dev_files=None, encoding='utf-8', text_column_name='data', label_column_name='label', metadata_column_name='metadata', label_type='ner', use_tokenizer=True, **corpusargs)View on GitHub#

Instantiates a MuliFileJsonlCorpus as, e.g., created with doccanos JSONL export.

Note that at least one of train_files, test_files, and dev_files must contain one path. Otherwise, the initialization will fail.

Parameters:
  • corpusargs – Additional arguments for Corpus initialization

  • train_files – the name of the train files

  • test_files – the name of the test files

  • dev_files – the name of the dev files, if empty, dev data is sampled from train

  • encoding (str) – file encoding (default “utf-8”)

  • text_column_name (str) – Name of the text column inside the jsonl files.

  • label_column_name (str) – Name of the label column inside the jsonl files.

  • metadata_column_name (str) – Name of the metadata column inside the jsonl files.

  • label_type (str) – he type of label to predict (default “ner”)

  • use_tokenizer (Union[bool, Tokenizer]) – Specify a custom tokenizer to split the text into tokens.

Raises:

RuntimeError – If no paths are given

class flair.datasets.sequence_labeling.JsonlCorpus(data_folder, train_file=None, test_file=None, dev_file=None, encoding='utf-8', text_column_name='data', label_column_name='label', metadata_column_name='metadata', label_type='ner', autofind_splits=True, name=None, use_tokenizer=True, **corpusargs)View on GitHub#

Bases: MultiFileJsonlCorpus

__init__(data_folder, train_file=None, test_file=None, dev_file=None, encoding='utf-8', text_column_name='data', label_column_name='label', metadata_column_name='metadata', label_type='ner', autofind_splits=True, name=None, use_tokenizer=True, **corpusargs)View on GitHub#

Instantiates a JsonlCorpus with one file per Dataset (train, dev, and test).

Parameters:
  • data_folder (Union[str, Path]) – Path to the folder containing the JSONL corpus

  • train_file (Union[str, Path, None]) – the name of the train file

  • test_file (Union[str, Path, None]) – the name of the test file

  • dev_file (Union[str, Path, None]) – the name of the dev file, if None, dev data is sampled from train

  • encoding (str) – file encoding (default “utf-8”)

  • text_column_name (str) – Name of the text column inside the JSONL file.

  • label_column_name (str) – Name of the label column inside the JSONL file.

  • metadata_column_name (str) – Name of the metadata column inside the JSONL file.

  • label_type (str) – The type of label to predict (default “ner”)

  • autofind_splits (bool) – Whether train, test and dev file should be determined automatically

  • name (Optional[str]) – name of the Corpus see flair.data.Corpus

  • use_tokenizer (Union[bool, Tokenizer]) – Specify a custom tokenizer to split the text into tokens.

class flair.datasets.sequence_labeling.JsonlDataset(path_to_jsonl_file, encoding='utf-8', text_column_name='data', label_column_name='label', metadata_column_name='metadata', label_type='ner', use_tokenizer=True)View on GitHub#

Bases: FlairDataset

__init__(path_to_jsonl_file, encoding='utf-8', text_column_name='data', label_column_name='label', metadata_column_name='metadata', label_type='ner', use_tokenizer=True)View on GitHub#

Instantiates a JsonlDataset and converts all annotated char spans to token tags using the IOB scheme.

The expected file format is:

{
    "<text_column_name>": "<text>",
    "<label_column_name>": [[<start_char_index>, <end_char_index>, <label>],...],
    "<metadata_column_name>": [[<metadata_key>, <metadata_value>],...]
}
Parameters:
  • path_to_jsonl_file (Union[str, Path]) – File to read

  • encoding (str) – file encoding (default “utf-8”)

  • text_column_name (str) – Name of the text column

  • label_column_name (str) – Name of the label column

  • metadata_column_name (str) – Name of the metadata column

  • label_type (str) – The type of label to predict (default “ner”)

  • use_tokenizer (Union[bool, Tokenizer]) – Specify a custom tokenizer to split the text into tokens.

_add_label_to_sentence(text, sentence, start, end, label)View on GitHub#

Adds a NE label to a given sentence.

Parameters:
  • text (str) – raw sentence (with all whitespaces etc.). Is used to determine the token indices.

  • sentence (Sentence) – Tokenized flair Sentence.

  • start (int) – Start character index of the label.

  • end (int) – End character index of the label.

  • label (str) – Label to assign to the given range.

Returns:

Nothing. Changes sentence as INOUT-param

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.sequence_labeling.MultiFileColumnCorpus(column_format, train_files=None, test_files=None, dev_files=None, column_delimiter='\\\\s+', comment_symbol=None, encoding='utf-8', document_separator_token=None, skip_first_line=False, in_memory=True, label_name_map=None, banned_sentences=None, default_whitespace_after=1, **corpusargs)View on GitHub#

Bases: Corpus

__init__(column_format, train_files=None, test_files=None, dev_files=None, column_delimiter='\\\\s+', comment_symbol=None, encoding='utf-8', document_separator_token=None, skip_first_line=False, in_memory=True, label_name_map=None, banned_sentences=None, default_whitespace_after=1, **corpusargs)View on GitHub#

Instantiates a Corpus from CoNLL column-formatted task data such as CoNLL03 or CoNLL2000.

Parameters:
  • data_folder – base folder with the task data

  • column_format (dict[int, str]) – a map specifying the column format

  • train_files – the name of the train files

  • test_files – the name of the test files

  • dev_files – the name of the dev files, if empty, dev data is sampled from train

  • column_delimiter (str) – default is to split on any separatator, but you can overwrite for instance with “t” to split only on tabs

  • comment_symbol (Optional[str]) – if set, lines that begin with this symbol are treated as comments

  • encoding (str) – file encoding (default “utf-8”)

  • document_separator_token (Optional[str]) – If provided, sentences that function as document boundaries are so marked

  • skip_first_line (bool) – set to True if your dataset has a header line

  • in_memory (bool) – If set to True, the dataset is kept in memory as Sentence objects, otherwise does disk reads

  • label_name_map (Optional[dict[str, str]]) – Optionally map tag names to different schema.

  • banned_sentences (Optional[list[str]]) – Optionally remove sentences from the corpus. Works only if in_memory is true

class flair.datasets.sequence_labeling.ColumnCorpus(data_folder, column_format, train_file=None, test_file=None, dev_file=None, autofind_splits=True, name=None, comment_symbol='# ', **corpusargs)View on GitHub#

Bases: MultiFileColumnCorpus

__init__(data_folder, column_format, train_file=None, test_file=None, dev_file=None, autofind_splits=True, name=None, comment_symbol='# ', **corpusargs)View on GitHub#

Instantiates a Corpus from CoNLL column-formatted task data such as CoNLL03 or CoNLL2000.

Parameters:
  • data_folder (Union[str, Path]) – base folder with the task data

  • column_format (dict[int, str]) – a map specifying the column format

  • train_file – the name of the train file

  • test_file – the name of the test file

  • dev_file – the name of the dev file, if None, dev data is sampled from train

  • column_delimiter – default is to split on any separatator, but you can overwrite for instance with “t” to split only on tabs

  • comment_symbol – if set, lines that begin with this symbol are treated as comments

  • document_separator_token – If provided, sentences that function as document boundaries are so marked

  • skip_first_line – set to True if your dataset has a header line

  • in_memory – If set to True, the dataset is kept in memory as Sentence objects, otherwise does disk reads

  • label_name_map – Optionally map tag names to different schema.

  • banned_sentences – Optionally remove sentences from the corpus. Works only if in_memory is true

class flair.datasets.sequence_labeling.ColumnDataset(path_to_column_file, column_name_map, column_delimiter='\\\\s+', comment_symbol=None, banned_sentences=None, in_memory=True, document_separator_token=None, encoding='utf-8', skip_first_line=False, label_name_map=None, default_whitespace_after=1)View on GitHub#

Bases: FlairDataset

SPACE_AFTER_KEY = 'space-after'#
FEATS = ['feats', 'misc']#
HEAD = ['head', 'head_id']#
__init__(path_to_column_file, column_name_map, column_delimiter='\\\\s+', comment_symbol=None, banned_sentences=None, in_memory=True, document_separator_token=None, encoding='utf-8', skip_first_line=False, label_name_map=None, default_whitespace_after=1)View on GitHub#

Instantiates a column dataset.

Parameters:
  • path_to_column_file (Union[str, Path]) – path to the file with the column-formatted data

  • column_name_map (dict[int, str]) – a map specifying the column format

  • column_delimiter (str) – default is to split on any separator, but you can overwrite for instance with “t” to split only on tabs

  • comment_symbol (Optional[str]) – if set, lines that begin with this symbol are treated as comments

  • in_memory (bool) – If set to True, the dataset is kept in memory as Sentence objects, otherwise does disk reads

  • document_separator_token (Optional[str]) – If provided, sentences that function as document boundaries are so marked

  • skip_first_line (bool) – set to True if your dataset has a header line

  • label_name_map (Optional[dict[str, str]]) – Optionally map tag names to different schema.

  • banned_sentences (Optional[list[str]]) – Optionally remove sentences from the corpus. Works only if in_memory is true

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.sequence_labeling.ONTONOTES(base_path=None, version='v4', language='english', domain=None, in_memory=True, **corpusargs)View on GitHub#

Bases: MultiFileColumnCorpus

archive_url = 'https://data.mendeley.com/public-files/datasets/zmycy7t9h9/files/b078e1c4-f7a4-4427-be7f-9389967831ef/file_downloaded'#
classmethod get_available_domains(base_path=None, version='v4', language='english', split='train')View on GitHub#
Return type:

list[str]

classmethod _process_coref_span_annotations_for_word(label, word_index, clusters, coref_stacks)View on GitHub#

For a given coref label, add it to a currently open span(s), complete a span(s) or ignore it, if it is outside of all spans.

This method mutates the clusters and coref_stacks dictionaries.

Parameters:
  • label (str) – The coref label for this word.

  • word_index (int) – The word index into the sentence.

  • clusters (defaultdict[int, list[tuple[int, int]]]) – A dictionary mapping cluster ids to lists of inclusive spans into the sentence.

  • coref_stacks (defaultdict[int, list[int]]) – Stacks for each cluster id to hold the start indices of open spans. Spans with the same id can be nested, which is why we collect these opening spans on a stack, e.g: [Greg, the baker who referred to [himself]_ID1 as ‘the bread man’]_ID1

Return type:

None

classmethod dataset_document_iterator(file_path)View on GitHub#

An iterator over CONLL formatted files which yields documents, regardless of the number of document annotations in a particular file.

This is useful for conll data which has been preprocessed, such as the preprocessing which takes place for the 2012 CONLL Coreference Resolution task.

Return type:

Iterator[list[dict]]

classmethod sentence_iterator(file_path)View on GitHub#

An iterator over the sentences in an individual CONLL formatted file.

Return type:

Iterator

class flair.datasets.sequence_labeling.CONLL_03(base_path=None, column_format={0: 'text', 1: 'pos', 3: 'ner'}, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, column_format={0: 'text', 1: 'pos', 3: 'ner'}, in_memory=True, **corpusargs)View on GitHub#

Initialize the CoNLL-03 corpus.

This is only possible if you’ve manually downloaded it to your machine. Obtain the corpus from https://www.clips.uantwerpen.be/conll2003/ner/ and put the eng.testa, .testb, .train files in a folder called ‘conll_03’. Then set the base_path parameter in the constructor to the path to the parent directory where the conll_03 folder resides. If using entity linking, the conll03 dateset is reduced by about 20 Documents, which are not part of the yago dataset. :type base_path: Union[str, Path, None] :param base_path: Path to the CoNLL-03 corpus (i.e. ‘conll_03’ folder) on your machine POS tags or chunks respectively :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training. :param document_as_sequence: If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.CONLL_03_GERMAN(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the CoNLL-03 corpus for German.

This is only possible if you’ve manually downloaded it to your machine. Obtain the corpus from https://www.clips.uantwerpen.be/conll2003/ner/ and put the respective files in a folder called ‘conll_03_german’. Then set the base_path parameter in the constructor to the path to the parent directory where the conll_03_german folder resides. :type base_path: Union[str, Path, None] :param base_path: Path to the CoNLL-03 corpus (i.e. ‘conll_03_german’ folder) on your machine word lemmas, POS tags or chunks respectively :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training. :param document_as_sequence: If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.CONLL_03_DUTCH(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the CoNLL-03 corpus for Dutch.

The first time you call this constructor it will automatically download the dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

class flair.datasets.sequence_labeling.CONLL_03_SPANISH(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the CoNLL-03 corpus for Spanish.

The first time you call this constructor it will automatically download the dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

class flair.datasets.sequence_labeling.CONLL_2000(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the CoNLL-2000 corpus for English chunking.

The first time you call this constructor it will automatically download the dataset. :type base_path: Union[str, Path, None] :param base_path: Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training.

class flair.datasets.sequence_labeling.WNUT_17(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.sequence_labeling.FEWNERD(setting='supervised', **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.sequence_labeling.BIOSCOPE(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.sequence_labeling.NER_ARABIC_ANER(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Initialize a preprocessed version of the Arabic Named Entity Recognition Corpus (ANERCorp).

The dataset is downloaded from http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp Column order is swapped The first time you call this constructor it will automatically download the dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence (bool) – If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.NER_ARABIC_AQMAR(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Initialize a preprocessed and modified version of the American and Qatari Modeling of Arabic (AQMAR) dataset.

The dataset is downloaded from http://www.cs.cmu.edu/~ark/AQMAR/

  • Modifications from original dataset: Miscellaneous tags (MIS0, MIS1, MIS2, MIS3) are merged to one tag “MISC” as these categories deviate across the original dataset

  • The 28 original Wikipedia articles are merged into a single file containing the articles in alphabetical order

The first time you call this constructor it will automatically download the dataset.

This dataset is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. please cite: “Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer, and Noah A. Smith (2012), Recall-Oriented Learning of Named Entities in Arabic Wikipedia. Proceedings of EACL.”

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence (bool) – If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.NER_BASQUE(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.sequence_labeling.NER_CHINESE_WEIBO(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Initialize the WEIBO_NER corpus.

The first time you call this constructor it will automatically download the dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence (bool) – If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.NER_DANISH_DANE(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.sequence_labeling.NER_ENGLISH_MOVIE_SIMPLE(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the eng corpus of the MIT Movie Corpus.

The first time you call this constructor it will automatically download the dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

class flair.datasets.sequence_labeling.NER_ENGLISH_MOVIE_COMPLEX(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the trivia10k13 corpus of the MIT Movie Corpus.

The first time you call this constructor it will automatically download the dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

class flair.datasets.sequence_labeling.NER_ENGLISH_SEC_FILLINGS(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize corpus of SEC-fillings annotated with English NER tags.

See paper “Domain Adaption of Named Entity Recognition to Support Credit Risk Assessment” by Alvarado et al, 2015: https://aclanthology.org/U15-1010/

Parameters:
  • base_path (Union[str, Path, None]) – Path to the CoNLL-03 corpus (i.e. ‘conll_03’ folder) on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

class flair.datasets.sequence_labeling.NER_ENGLISH_RESTAURANT(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the MIT Restaurant corpus.

The corpus will be downloaded from https://groups.csail.mit.edu/sls/downloads/restaurant/. The first time you call this constructor it will automatically download the dataset. :type base_path: Union[str, Path, None] :param base_path: Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. POS tags instead :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training. :param document_as_sequence: If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.NER_ENGLISH_STACKOVERFLOW(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the STACKOVERFLOW_NER corpus.

The first time you call this constructor it will automatically download the dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence – If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.NER_ENGLISH_TWITTER(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the twitter_ner corpus.

The corpus will be downoaded from https://raw.githubusercontent.com/aritter/twitter_nlp/master/data/annotated/ner.txt. The first time you call this constructor it will automatically download the dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence – If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.NER_ENGLISH_PERSON(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True)View on GitHub#

Initialize the PERSON_NER corpus for person names.

The first time you call this constructor it will automatically download the dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

class flair.datasets.sequence_labeling.NER_ENGLISH_WEBPAGES(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the WEBPAGES_NER corpus.

The corpus was introduced in the paper “Design Challenges and Misconceptions in Named Entity Recognition” by Ratinov and Roth (2009): https://aclanthology.org/W09-1119/. The first time you call this constructor it will automatically download the dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence – If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.NER_ENGLISH_WNUT_2020(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Initialize the WNUT_2020_NER corpus.

The first time you call this constructor it will automatically download the dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence (bool) – If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.NER_ENGLISH_WIKIGOLD(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Initialize the wikigold corpus.

The first time you call this constructor it will automatically download the dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence (bool) – If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.NER_FINNISH(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.sequence_labeling.NER_GERMAN_BIOFID(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.sequence_labeling.NER_GERMAN_EUROPARL(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the EUROPARL_NER_GERMAN corpus.

The first time you call this constructor it will automatically download the dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training. Not recommended due to heavy RAM usage.

  • document_as_sequence – If True, all sentences of a document are read into a single Sentence object.

_add_IOB_tags(data_file, encoding='utf8', ner_column=1)View on GitHub#

Function that adds IOB tags if only chunk names are provided.

e.g. words are tagged PER instead of B-PER or I-PER. Replaces ‘0’ with ‘O’ as the no-chunk tag since ColumnCorpus expects the letter ‘O’. Additionally it removes lines with no tags in the data file and can also be used if the data is only partially IOB tagged.

Parameters:
  • data_file (Union[str, Path]) – Path to the data file.

  • encoding (str, optional) – Encoding used in open function. The default is “utf8”.

  • ner_column (int, optional) – Specifies the ner-tagged column. The default is 1 (the second column).

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the LER_GERMAN (Legal Entity Recognition) corpus.

The first time you call this constructor it will automatically download the dataset. :type base_path: Union[str, Path, None] :param base_path: Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training. Not recommended due to heavy RAM usage.

class flair.datasets.sequence_labeling.NER_GERMAN_GERMEVAL(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the GermEval NER corpus for German.

This is only possible if you’ve manually downloaded it to your machine. Obtain the corpus from https://sites.google.com/site/germeval2014ner/data and put it into some folder. Then point the base_path parameter in the constructor to this folder :type base_path: Union[str, Path, None] :param base_path: Path to the GermEval corpus on your machine :type in_memory: bool :param in_memory:If True, keeps dataset in memory giving speedups in training.

class flair.datasets.sequence_labeling.NER_GERMAN_POLITICS(base_path=None, column_delimiter='\\\\s+', in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, column_delimiter='\\\\s+', in_memory=True, **corpusargs)View on GitHub#

Initialize corpus with Named Entity Model for German Politics (NEMGP).

data from https://www.thomas-zastrow.de/nlp/.

The first time you call this constructor it will automatically download the dataset. :type base_path: Union[str, Path, None] :param base_path: Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. POS tags instead :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training. :param document_as_sequence: If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.NER_HUNGARIAN(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Initialize the NER Business corpus for Hungarian.

The first time you call this constructor it will automatically download the dataset. :type base_path: Union[str, Path, None] :param base_path: Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. POS tags instead :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training. :type document_as_sequence: bool :param document_as_sequence: If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.NER_ICELANDIC(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the ICELANDIC_NER corpus.

The first time you call this constructor it will automatically download the dataset. :type base_path: Union[str, Path, None] :param base_path: Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. POS tags instead :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training. :param document_as_sequence: If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.NER_JAPANESE(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the Hironsan/IOB2 corpus for Japanese.

The first time you call this constructor it will automatically download the dataset.

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

class flair.datasets.sequence_labeling.NER_MASAKHANE(languages='luo', version='v2', base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: MultiCorpus

__init__(languages='luo', version='v2', base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the Masakhane corpus available on masakhane-io/masakhane-ner.

It consists of ten African languages. Pass a language code or a list of language codes to initialize the corpus with the languages you require. If you pass “all”, all languages will be initialized. :version: Specifies version of the dataset. Currently, only “v1” and “v2” are supported, using “v2” as default. :type base_path: Union[str, Path, None] :param base_path: Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. POS tags instead :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training.

class flair.datasets.sequence_labeling.NER_MULTI_CONER(task='multi', base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: MultiFileColumnCorpus

__init__(task='multi', base_path=None, in_memory=True, **corpusargs)View on GitHub#

Download and Initialize the MultiCoNer corpus.

Parameters:
  • task (str) – either ‘multi’, ‘code-switch’, or the language code for one of the mono tasks.

  • base_path (Union[str, Path, None]) – Path to the CoNLL-03 corpus (i.e. ‘conll_03’ folder) on your machine POS tags or chunks respectively

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

class flair.datasets.sequence_labeling.NER_MULTI_CONER_V2(task='multi', base_path=None, in_memory=True, use_dev_as_test=True, **corpusargs)View on GitHub#

Bases: MultiFileColumnCorpus

__init__(task='multi', base_path=None, in_memory=True, use_dev_as_test=True, **corpusargs)View on GitHub#

Initialize the MultiCoNer V2 corpus for the Semeval2023 workshop.

This is only possible if you’ve applied and downloaded it to your machine. Apply for the corpus from here https://multiconer.github.io/dataset and unpack the .zip file’s content into a folder called ‘ner_multi_coner_v2’. Then set the base_path parameter in the constructor to the path to the parent directory where the ner_multi_coner_v2 folder resides. You can also create the multiconer in the {FLAIR_CACHE_ROOT}/datasets folder to leave the path empty. :type base_path: Union[str, Path, None] :param base_path: Path to the ner_multi_coner_v2 corpus (i.e. ‘ner_multi_coner_v2’ folder) on your machine POS tags or chunks respectively :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training. :type use_dev_as_test: bool :param use_dev_as_test: If True, it uses the dev set as test set and samples random training data for a dev split. :type task: str :param task: either ‘multi’, ‘code-switch’, or the language code for one of the mono tasks.

class flair.datasets.sequence_labeling.NER_MULTI_WIKIANN(languages='en', base_path=None, in_memory=False, **corpusargs)View on GitHub#

Bases: MultiCorpus

__init__(languages='en', base_path=None, in_memory=False, **corpusargs)View on GitHub#

Initialize the WkiAnn corpus for cross-lingual NER consisting of datasets from 282 languages that exist in Wikipedia.

See https://elisa-ie.github.io/wikiann/ for details and for the languages and their respective abbreveations, i.e. “en” for english. (license: https://opendatacommons.org/licenses/by/)

Parameters:
  • languages (Union[str, list[str]]) – Should be an abbreviation of a language (“en”, “de”,..) or a list of abbreviations. The datasets of all passed languages will be saved in one MultiCorpus. (Note that, even though listed on https://elisa-ie.github.io/wikiann/ some datasets are empty. This includes “aa”, “cho”, “ho”, “hz”, “ii”, “jam”, “kj”, “kr”, “mus”, “olo” and “tcy”.)

  • base_path (Union[str, Path], optional) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. The data is in bio-format. It will by default (with the string “ner” as value) be transformed into the bioes format. If you dont want that set it to None.

  • in_memory (bool, optional) – Specify that the dataset should be loaded in memory, which speeds up the training process but takes increases the RAM usage significantly.

class flair.datasets.sequence_labeling.NER_MULTI_XTREME(languages='en', base_path=None, in_memory=False, **corpusargs)View on GitHub#

Bases: MultiCorpus

__init__(languages='en', base_path=None, in_memory=False, **corpusargs)View on GitHub#

Xtreme corpus for cross-lingual NER consisting of datasets of a total of 40 languages.

The data comes from the google research work XTREME google-research/xtreme. The data is derived from the wikiann dataset https://elisa-ie.github.io/wikiann/ (license: https://opendatacommons.org/licenses/by/)

Parameters:
  • languages (Union[str, list[str]], optional) – Specify the languages you want to load. Provide an empty list or string to select all languages.

  • base_path (Union[str, Path], optional) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool, optional) – Specify that the dataset should be loaded in memory, which speeds up the training process but takes increases the RAM usage significantly.

class flair.datasets.sequence_labeling.NER_MULTI_WIKINER(languages='en', base_path=None, in_memory=False, **corpusargs)View on GitHub#

Bases: MultiCorpus

class flair.datasets.sequence_labeling.NER_SWEDISH(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the NER_SWEDISH corpus for Swedish.

The first time you call this constructor it will automatically download the dataset. :type base_path: Union[str, Path, None] :param base_path: Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training. :param document_as_sequence: If True, all sentences of a document are read into a single Sentence object

_add_IOB2_tags(data_file, encoding='utf8')View on GitHub#

Function that adds IOB2 tags if only chunk names are provided.

e.g. words are tagged PER instead of B-PER or I-PER. Replaces ‘0’ with ‘O’ as the no-chunk tag since ColumnCorpus expects the letter ‘O’. Additionally it removes lines with no tags in the data file and can also be used if the data is only partially IOB tagged.

Parameters:
  • data_file (Union[str, Path]) – Path to the data file.

  • encoding (str, optional) – Encoding used in open function. The default is “utf8”.

class flair.datasets.sequence_labeling.NER_TURKU(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the Finnish TurkuNER corpus.

The first time you call this constructor it will automatically download the dataset. :type base_path: Union[str, Path, None] :param base_path: Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. POS tags instead :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training. :param document_as_sequence: If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.NER_UKRAINIAN(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the Ukrainian NER corpus from lang-uk project.

The first time you call this constructor it will automatically download the dataset. :type base_path: Union[str, Path, None] :param base_path: Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. POS tags instead :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training. :param document_as_sequence: If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.KEYPHRASE_SEMEVAL2017(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.sequence_labeling.KEYPHRASE_INSPEC(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.sequence_labeling.KEYPHRASE_SEMEVAL2010(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

class flair.datasets.sequence_labeling.UP_CHINESE(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Initialize the Chinese dataset from the Universal Propositions Bank.

The dataset is downloaded from System-T/UniversalPropositions

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence (bool) – If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.UP_ENGLISH(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Initialize the English dataset from the Universal Propositions Bank.

The dataset is downloaded from System-T/UniversalPropositions

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence (bool) – If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.UP_FRENCH(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Initialize the French dataset from the Universal Propositions Bank.

The dataset is downloaded from System-T/UniversalPropositions

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence (bool) – If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.UP_FINNISH(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Initialize the Finnish dataset from the Universal Propositions Bank.

The dataset is downloaded from System-T/UniversalPropositions

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence (bool) – If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.UP_GERMAN(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Initialize the German dataset from the Universal Propositions Bank.

The dataset is downloaded from System-T/UniversalPropositions

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence (bool) – If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.UP_ITALIAN(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Initialize the Italian dataset from the Universal Propositions Bank.

The dataset is downloaded from System-T/UniversalPropositions

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence (bool) – If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.UP_SPANISH(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Initialize the Spanish dataset from the Universal Propositions Bank.

The dataset is downloaded from System-T/UniversalPropositions

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence (bool) – If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.UP_SPANISH_ANCORA(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, document_as_sequence=False, **corpusargs)View on GitHub#

Initialize the Spanish AnCora dataset from the Universal Propositions Bank.

The dataset is downloaded from System-T/UniversalPropositions

Parameters:
  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • document_as_sequence (bool) – If True, all sentences of a document are read into a single Sentence object

class flair.datasets.sequence_labeling.NER_HIPE_2022(dataset_name, language, base_path=None, in_memory=True, version='v2.1', branch_name='main', dev_split_name='dev', add_document_separator=False, sample_missing_splits=False, preproc_fn=None, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(dataset_name, language, base_path=None, in_memory=True, version='v2.1', branch_name='main', dev_split_name='dev', add_document_separator=False, sample_missing_splits=False, preproc_fn=None, **corpusargs)View on GitHub#

Initialize the CLEF-HIPE 2022 NER dataset.

The first time you call this constructor it will automatically download the specified dataset (by given a language). :dataset_name: Supported datasets are: ajmc, hipe2020, letemps, newseye, sonar and topres19th. :language: Language for a supported dataset. :base_path: Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. :in_memory: If True, keeps dataset in memory giving speedups in training. :version: Version of CLEF-HIPE dataset. Currently only v1.0 is supported and available. :branch_name: Defines git branch name of HIPE data repository (main by default). :dev_split_name: Defines default name of development split (dev by default). Only the NewsEye dataset has currently two development splits: dev and dev2. :add_document_separator: If True, a special document seperator will be introduced. This is highly recommended when using our FLERT approach. :sample_missing_splits: If True, data is automatically sampled when certain data splits are None. :preproc_fn: Function that is used for dataset preprocessing. If None, default preprocessing will be performed.

class flair.datasets.sequence_labeling.NER_ICDAR_EUROPEANA(language, base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(language, base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the ICDAR Europeana NER dataset.

The dataset is based on the French and Dutch Europeana NER corpora from the Europeana Newspapers NER dataset (https://lab.kb.nl/dataset/europeana-newspapers-ner), with additional preprocessing steps being performed (sentence splitting, punctuation normalizing, training/development/test splits). The resulting dataset is released in the “Data Centric Domain Adaptation for Historical Text with OCR Errors” ICDAR paper by Luisa März, Stefan Schweter, Nina Poerner, Benjamin Roth and Hinrich Schütze. :type language: str :param language: Language for a supported dataset. Supported languages are “fr” (French) and “nl” (Dutch). :type base_path: Union[str, Path, None] :param base_path: Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training. Not recommended due to heavy RAM usage.

class flair.datasets.sequence_labeling.NER_NERMUD(domains='all', base_path=None, in_memory=False, **corpusargs)View on GitHub#

Bases: MultiCorpus

__init__(domains='all', base_path=None, in_memory=False, **corpusargs)View on GitHub#

Initilize the NERMuD 2023 dataset.

NERMuD is a task presented at EVALITA 2023 consisting in the extraction and classification of named-entities in a document, such as persons, organizations, and locations. NERMuD 2023 will include two different sub-tasks:

  • Domain-agnostic classification (DAC). Participants will be asked to select and classify entities among three categories (person, organization, location) in different types of texts (news, fiction, political speeches) using one single general model.

  • Domain-specific classification (DSC). Participants will be asked to deploy a different model for each of the above types, trying to increase the accuracy for each considered type.

Parameters:
  • domains (Union[str, list[str]]) – Domains to be used. Supported are “WN” (Wikinews), “FIC” (fiction), “ADG” (De Gasperi subset) and “all”.

  • base_path (Union[str, Path, None]) – Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary.

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training. Not recommended due to heavy RAM usage.

class flair.datasets.sequence_labeling.NER_GERMAN_MOBIE(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

__init__(base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the German MobIE NER dataset.

The German MobIE Dataset was introduced in the MobIE paper (https://aclanthology.org/2021.konvens-1.22/).

This is a German-language dataset that has been human-annotated with 20 coarse- and fine-grained entity types, and it includes entity linking information for geographically linkable entities. The dataset comprises 3,232 social media texts and traffic reports, totaling 91K tokens, with 20.5K annotated entities, of which 13.1K are linked to a knowledge base. In total, 20 different named entities are annotated. :type base_path: Union[str, Path, None] :param base_path: Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training. Not recommended due to heavy RAM usage.

class flair.datasets.sequence_labeling.NER_ESTONIAN_NOISY(version=0, base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: ColumnCorpus

data_url = 'https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/patnlp/estner.cnll.zip'#
label_url = 'https://raw.githubusercontent.com/uds-lsv/NoisyNER/master/data/only_labels'#
__init__(version=0, base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the NoisyNER corpus.

Parameters:
  • version (int) – Chooses the labelset for the data. v0 (default): Clean labels v1 to v7: Different kinds of noisy labelsets (details: https://ojs.aaai.org/index.php/AAAI/article/view/16938)

  • base_path (Optional[Union[str, Path]]) – Path to the data. Default is None, meaning the corpus gets automatically downloaded and saved. You can override this by passing a path to a directory containing the unprocessed files but typically this should not be necessary.

  • in_memory (bool) – If True the dataset is kept in memory achieving speedups in training.

  • **corpusargs – The arguments propagated to :meth:’flair.datasets.ColumnCorpus.__init__’.

class flair.datasets.sequence_labeling.MASAKHA_POS(languages='bam', version='v1', base_path=None, in_memory=True, **corpusargs)View on GitHub#

Bases: MultiCorpus

__init__(languages='bam', version='v1', base_path=None, in_memory=True, **corpusargs)View on GitHub#

Initialize the MasakhaPOS corpus available on masakhane-io/masakhane-pos.

It consists of 20 African languages. Pass a language code or a list of language codes to initialize the corpus with the languages you require. If you pass “all”, all languages will be initialized. :version: Specifies version of the dataset. Currently, only “v1” is supported. :type base_path: Union[str, Path, None] :param base_path: Default is None, meaning that corpus gets auto-downloaded and loaded. You can override this to point to a different folder but typically this should not be necessary. :type in_memory: bool :param in_memory: If True, keeps dataset in memory giving speedups in training.