
class flair.datasets.biomedical.BIGBIO_NER_CORPUS(dataset_name, base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None, trust_remote_code=False)View on GitHub#

Bases: ColumnCorpus

This class implements an adapter to data sets implemented in the BigBio framework.

see bigscience-workshop/biomedical

The BigBio framework harmonizes over 120 biomedical data sets and provides a uniform programming api to access them. This adapter allows to use all named entity recognition data sets by using the bigbio_kb schema.

__init__(dataset_name, base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None, trust_remote_code=False)View on GitHub#

Initialize the BigBio Corpus.

  • dataset_name (str) – Name of the dataset in the huggingface hub (e.g. nlmchem or bigbio/nlmchem)

  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Custom implementation of SentenceSplitter which segments the text into sentences and tokens (default SciSpacySentenceSplitter)

  • train_split_name (Optional[str]) – Name of the training split in bigbio, usually train (default: None)

  • dev_split_name (Optional[str]) – Name of the development split in bigbio, usually validation (default: None)

  • test_split_name (Optional[str]) – Name of the test split in bigbio, usually test (default: None)


__init__(dataset_name[, base_path, ...])

Initialize the BigBio Corpus.

add_label_noise(label_type, labels[, ...])

Generates uniform label noise distribution in the chosen dataset split.

bin_search_passage(passages, low, high, entity)

Helper methods to find the passage to a given entity mention (incl.


Builds the directory name for the given data set.

downsample([percentage, downsample_train, ...])

Randomly downsample the corpus to the given percentage (by removing data points).


A method that filters all sentences consisting of 0 tokens.


A method that filters all sentences for which the plain text is longer than a specified number of characters.


Returns all sentences (spanning all three splits) in the Corpus.


Return the mapping of entity type given in the dataset to canonical types.


Counts occurrences of each label in the corpus and returns them as a dictionary object.

make_label_dictionary(label_type[, ...])

Creates a dictionary of all labels assigned to the sentences in the corpus.


Create a tag dictionary of a given label type.

make_vocab_dictionary([max_tokens, min_freq])

Creates a Dictionary of all tokens contained in the corpus.

obtain_statistics([label_type, pretty_print])

Print statistics about the corpus, including the length of the sentences and the labels in the corpus.

to_internal_dataset(dataset, split)

Converts a dataset given in hugging datasets format to our internal corpus representation.



The dev split as a torch.utils.data.Dataset object.


The test split as a torch.utils.data.Dataset object.


The training split as a torch.utils.data.Dataset object.

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:


build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:


to_internal_dataset(dataset, split)View on GitHub#

Converts a dataset given in hugging datasets format to our internal corpus representation.

Return type:


bin_search_passage(passages, low, high, entity)View on GitHub#

Helper methods to find the passage to a given entity mention (incl. offset).

The implementation uses binary search to find the passage in the ordered sequence passages.