flair.datasets.entity_linking.BigBioEntityLinkingCorpus#

class flair.datasets.entity_linking.BigBioEntityLinkingCorpus(base_path=None, label_type='el', norm_keys=['db_name', 'db_id'], **kwargs)View on GitHub#

Bases: Corpus, ABC

This class implements an adapter to data sets implemented in the BigBio framework.

See: bigscience-workshop/biomedical

The BigBio framework harmonizes over 120 biomedical data sets and provides a uniform programming api to access them. This adapter allows to use all named entity recognition data sets by using the bigbio_kb schema.

__init__(base_path=None, label_type='el', norm_keys=['db_name', 'db_id'], **kwargs)View on GitHub#

Initializes a Corpus, potentially sampling missing dev/test splits from train.

You can define the train, dev and test split by passing the corresponding Dataset object to the constructor. At least one split should be defined. If the option sample_missing_splits is set to True, missing splits will be randomly sampled from the train split. In most cases, you will not use the constructor yourself. Rather, you will create a corpus using one of our helper methods that read common NLP filetypes. For instance, you can use flair.datasets.sequence_labeling.ColumnCorpus to read CoNLL-formatted files directly into a Corpus.

Parameters:
  • train (Optional[Dataset[T_co]], optional) – Training data. Defaults to None.

  • dev (Optional[Dataset[T_co]], optional) – Development data. Defaults to None.

  • test (Optional[Dataset[T_co]], optional) – Testing data. Defaults to None.

  • name (str, optional) – Corpus name. Defaults to “corpus”.

  • sample_missing_splits (Union[bool, str], optional) – Policy for handling missing splits. True (default): sample dev(10%)/test(10%) from train. False: keep None. “only_dev”: sample only dev. “only_test”: sample only test.

  • random_seed (Optional[int], optional) – Seed for reproducible sampling. Defaults to None.

Methods

__init__([base_path, label_type, norm_keys])

Initializes a Corpus, potentially sampling missing dev/test splits from train.

add_label_noise(label_type, labels[, ...])

Adds artificial label noise to a specified split (in-place).

downsample([percentage, downsample_train, ...])

Randomly downsample the corpus to the given percentage (by removing data points).

filter_empty_sentences()

A method that filters all sentences consisting of 0 tokens.

filter_long_sentences(max_charlength)

A method that filters all sentences for which the plain text is longer than a specified number of characters.

get_all_sentences()

Returns all sentences (spanning all three splits) in the Corpus.

get_label_distribution()

Counts occurrences of each label in the corpus and returns them as a dictionary object.

make_label_dictionary(label_type[, ...])

Creates a Dictionary for a specific label type from the corpus.

make_tag_dictionary(tag_type)

DEPRECATED: Creates tag dictionary ensuring 'O', '<START>', '<STOP>'.

make_vocab_dictionary([max_tokens, min_freq])

Creates a Dictionary of all tokens contained in the corpus.

obtain_statistics([label_type, pretty_print])

Print statistics about the corpus, including the length of the sentences and the labels in the corpus.

Attributes

dev

The dev split as a torch.utils.data.Dataset object.

test

The test split as a torch.utils.data.Dataset object.

train

The training split as a torch.utils.data.Dataset object.