flair.datasets.biomedical#

class flair.datasets.biomedical.Entity(char_span, entity_type)View on GitHub#

Bases: object

Internal class to represent entities while converting biomedical NER corpora to a standardized format.

Each entity consists of the char span it addresses in the original text as well as the type of entity (e.g. Chemical, Gene, and so on).

is_before(other_entity)View on GitHub#

Checks whether this entity is located before the given one.

Parameters:

other_entity – Entity to check

Return type:

bool

contains(other_entity)View on GitHub#

Checks whether the given entity is fully contained in this entity.

Parameters:

other_entity – Entity to check

Return type:

bool

overlaps(other_entity)View on GitHub#

Checks whether this and the given entity overlap.

Parameters:

other_entity – Entity to check

Return type:

bool

class flair.datasets.biomedical.InternalBioNerDataset(documents, entities_per_document, entity_types=[])View on GitHub#

Bases: object

Internal class to represent a corpus and it’s entities.

class flair.datasets.biomedical.DpEntry(position_end, entity_count, entity_lengths_sum, last_entity)View on GitHub#

Bases: tuple

position_end: int#

Alias for field number 0

entity_count: int#

Alias for field number 1

entity_lengths_sum: int#

Alias for field number 2

last_entity: Optional[Entity]#

Alias for field number 3

flair.datasets.biomedical.merge_datasets(data_sets)View on GitHub#
flair.datasets.biomedical.filter_and_map_entities(dataset, entity_type_to_canonical)View on GitHub#
Return type:

InternalBioNerDataset

flair.datasets.biomedical.filter_nested_entities(dataset)View on GitHub#
Return type:

None

flair.datasets.biomedical.bioc_to_internal(bioc_file)View on GitHub#

Helper function to parse corpora that are given in BIOC format. See.

http://bioc.sourceforge.net/

for details.

flair.datasets.biomedical.brat_to_internal(corpus_dir, ann_file_suffixes=None)View on GitHub#

Helper function to parse corpora that are annotated using BRAT. See.

https://brat.nlplab.org/

for details.

Return type:

InternalBioNerDataset

class flair.datasets.biomedical.CoNLLWriter(sentence_splitter)View on GitHub#

Bases: object

Utility class for writing InternalBioNerDataset to CoNLL files.

__init__(sentence_splitter)View on GitHub#

Initialize CoNLLWriter.

Parameters:

sentence_splitter (SentenceSplitter) – Sentence splitter which segments the text into sentences and tokens.

process_dataset(datasets, out_dir)View on GitHub#
write_to_conll(dataset, output_file)View on GitHub#
class flair.datasets.biomedical.HunerDataset(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus, ABC

Base class for HUNER datasets.

Every subclass has to implement the following methods:
  • “to_internal”, which reads the complete data set (incl. train, dev, test) and returns the corpus as InternalBioNerDataset

  • “split_url”, which returns the base url (i.e. without ‘.train’, ‘.dev’, ‘.test’) to the HUNER split files

For further information see:
abstract to_internal(data_folder)View on GitHub#
Return type:

InternalBioNerDataset

abstract static split_url()View on GitHub#
Return type:

Union[str, List[str]]

get_corpus_sentence_splitter()View on GitHub#

Return the pre-defined sentence splitter if defined, otherwise return None.

Return type:

Optional[SentenceSplitter]

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the HUNER corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Custom implementation of SentenceSplitter which segments the text into sentences and tokens (default SciSpacySentenceSplitter)

get_subset(dataset, split, split_dir)View on GitHub#
class flair.datasets.biomedical.BIO_INFER(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

Original BioInfer corpus.

For further information see Pyysalo et al.:

BioInfer: a corpus for information extraction in the biomedical domain https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-50

__init__(base_path=None, in_memory=True)View on GitHub#

Initialize the BioInfer corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

classmethod download_dataset(data_dir)View on GitHub#
Return type:

Path

classmethod parse_dataset(original_file)View on GitHub#
class flair.datasets.biomedical.HUNER_GENE_BIO_INFER(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the BioInfer corpus containing only gene/protein annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.JNLPBA(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

Original corpus of the JNLPBA shared task.

For further information see Kim et al.: Introduction to the Bio- Entity Recognition Task at JNLPBA https://www.aclweb.org/anthology/W04-1213.pdf

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

__init__(base_path=None, in_memory=True)View on GitHub#

Initialize the JNLPBA corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

class flair.datasets.biomedical.HunerJNLPBAView on GitHub#

Bases: object

classmethod download_and_prepare_train(data_folder, sentence_tag)View on GitHub#
Return type:

InternalBioNerDataset

classmethod download_and_prepare_test(data_folder, sentence_tag)View on GitHub#
Return type:

InternalBioNerDataset

classmethod read_file(input_iob_file, sentence_tag)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.HUNER_JNLPBA(entity_type_mapping, *args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the JNLPBA corpus.

static split_url()View on GitHub#
Return type:

str

get_corpus_sentence_splitter()View on GitHub#

Return the pre-defined sentence splitter if defined, otherwise return None.

Return type:

SentenceSplitter

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.HUNER_GENE_JNLPBA(*args, **kwargs)View on GitHub#

Bases: HUNER_JNLPBA

HUNER version of the JNLPBA corpus containing gene annotations.

class flair.datasets.biomedical.HUNER_CELL_LINE_JNLPBA(*args, **kwargs)View on GitHub#

Bases: HUNER_JNLPBA

HUNER version of the JNLPBA corpus containing cell line annotations.

class flair.datasets.biomedical.HUNER_ALL_JNLPBA(*args, **kwargs)View on GitHub#

Bases: HUNER_JNLPBA

HUNER version of the JNLPBA corpus containing gene and cell line annotations.

class flair.datasets.biomedical.CELL_FINDER(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original CellFinder corpus containing cell line, species and gene annotations.

For futher information see Neves et al.: Annotating and evaluating text for stem cell research https://pdfs.semanticscholar.org/38e3/75aeeeb1937d03c3c80128a70d8e7a74441f.pdf

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the CellFinder corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Custom implementation of SentenceSplitter which segments the text into sentences and tokens.

classmethod download_and_prepare(data_folder)View on GitHub#
Return type:

InternalBioNerDataset

classmethod read_folder(data_folder)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.HUNER_CELL_LINE_CELL_FINDER(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the CellFinder corpus containing only cell line annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.HUNER_SPECIES_CELL_FINDER(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the CellFinder corpus containing only species annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.HUNER_GENE_CELL_FINDER(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the CellFinder corpus containing only gene annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.HUNER_ALL_CELL_FINDER(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the CellFinder corpus containing only gene annotations.

static split_url()View on GitHub#
Return type:

List[str]

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.MIRNA(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original miRNA corpus.

For further information see Bagewadi et al.: Detecting miRNA Mentions and Relations in Biomedical Literature https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4602280/

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the miRNA corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • tokenizer – Callable that segments a sentence into words, defaults to scispacy

  • sentence_splitter (Optional[SentenceSplitter]) – Callable that segments a document into sentences, defaults to scispacy

classmethod download_and_prepare_train(data_folder, sentence_separator)View on GitHub#
classmethod download_and_prepare_test(data_folder, sentence_separator)View on GitHub#
classmethod parse_file(input_file, split, sentence_separator)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.HunerMiRNAHelperView on GitHub#

Bases: object

static get_mirna_subset(dataset, split_url, split_dir)View on GitHub#
class flair.datasets.biomedical.HUNER_MIRNA(entity_type_mapping, *args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the miRNA corpus.

static split_url()View on GitHub#
Return type:

str

get_subset(dataset, split, split_dir)View on GitHub#
get_corpus_sentence_splitter()View on GitHub#

Return the pre-defined sentence splitter if defined, otherwise return None.

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.HUNER_GENE_MIRNA(*args, **kwargs)View on GitHub#

Bases: HUNER_MIRNA

HUNER version of the miRNA corpus containing protein / gene annotations.

class flair.datasets.biomedical.HUNER_SPECIES_MIRNA(*args, **kwargs)View on GitHub#

Bases: HUNER_MIRNA

HUNER version of the miRNA corpus containing species annotations.

class flair.datasets.biomedical.HUNER_DISEASE_MIRNA(*args, **kwargs)View on GitHub#

Bases: HUNER_MIRNA

HUNER version of the miRNA corpus containing disease annotations.

class flair.datasets.biomedical.HUNER_ALL_MIRNA(*args, **kwargs)View on GitHub#

Bases: HUNER_MIRNA

HUNER version of the miRNA corpus containing gene, species and disease annotations.

class flair.datasets.biomedical.KaewphanCorpusHelperView on GitHub#

Bases: object

Helper class for the corpora from Kaewphan et al., i.e. CLL and Gellus.

static download_cll_dataset(data_folder)View on GitHub#
static prepare_and_save_dataset(nersuite_folder, output_file)View on GitHub#
static download_gellus_dataset(data_folder)View on GitHub#
static read_dataset(nersuite_folder, sentence_separator)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.CLL(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

Original CLL corpus containing cell line annotations.

For further information, see Kaewphan et al.: Cell line name recognition in support of the identification of synthetic lethality in cancer from text https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4708107/

__init__(base_path=None, in_memory=True)View on GitHub#

Initialize the CLL corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training

class flair.datasets.biomedical.HUNER_CELL_LINE_CLL(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the CLL corpus containing cell line annotations.

static split_url()View on GitHub#
Return type:

str

get_corpus_sentence_splitter()View on GitHub#

Return the pre-defined sentence splitter if defined, otherwise return None.

Return type:

SentenceSplitter

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.GELLUS(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

Original Gellus corpus containing cell line annotations.

For further information, see Kaewphan et al.: Cell line name recognition in support of the identification of synthetic lethality in cancer from text https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4708107/

__init__(base_path=None, in_memory=True)View on GitHub#

Initialize the GELLUS corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training

class flair.datasets.biomedical.HUNER_CELL_LINE_GELLUS(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the Gellus corpus containing cell line annotations.

static split_url()View on GitHub#
Return type:

str

get_corpus_sentence_splitter()View on GitHub#

Return the pre-defined sentence splitter if defined, otherwise return None.

Return type:

SentenceSplitter

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.LOCTEXT(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original LOCTEXT corpus containing species annotations.

For further information see Cejuela et al.:

LocText: relation extraction of protein localizations to assist database curation https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2021-9

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the LOCTEXT corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Custom implementation of SentenceSplitter that segments a document into sentences and tokens (default SciSpacySentenceSplitter)

static download_dataset(data_dir)View on GitHub#
static parse_dataset(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.HUNER_LOCTEXT(entity_type_mapping, *args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the Loctext corpus.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.HUNER_SPECIES_LOCTEXT(*args, **kwargs)View on GitHub#

Bases: HUNER_LOCTEXT

HUNER version of the Loctext corpus containing species annotations.

class flair.datasets.biomedical.HUNER_GENE_LOCTEXT(*args, **kwargs)View on GitHub#

Bases: HUNER_LOCTEXT

HUNER version of the Loctext corpus containing protein annotations.

class flair.datasets.biomedical.HUNER_ALL_LOCTEXT(*args, **kwargs)View on GitHub#

Bases: HUNER_LOCTEXT

HUNER version of the Loctext corpus containing species and protein annotations.

class flair.datasets.biomedical.CHEMDNER(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original corpus of the CHEMDNER shared task.

For further information see Krallinger et al.: The CHEMDNER corpus of chemicals and drugs and its annotation principles https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-7-S1-S2

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the CHEMDNER corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Custom implementation of SentenceSplitter which segements documents into sentences and tokens

static download_dataset(data_dir)View on GitHub#
class flair.datasets.biomedical.HUNER_CHEMICAL_CHEMDNER(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the CHEMDNER corpus containing chemical annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.IEPA(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

IEPA corpus as provided by http://corpora.informatik.hu-berlin.de/.

For further information see Ding, Berleant, Nettleton, Wurtele: Mining MEDLINE: abstracts, sentences, or phrases? https://www.ncbi.nlm.nih.gov/pubmed/11928487

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

__init__(base_path=None, in_memory=True)View on GitHub#

Initialize the IEPA corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

static download_dataset(data_dir)View on GitHub#
classmethod parse_dataset(original_file)View on GitHub#
class flair.datasets.biomedical.HUNER_GENE_IEPA(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the IEPA corpus containing gene annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.LINNEAUS(base_path=None, in_memory=True, tokenizer=None)View on GitHub#

Bases: ColumnCorpus

Original LINNEAUS corpus containing species annotations.

For further information see Gerner et al.:

LINNAEUS: a species name identification system for biomedical literature https://www.ncbi.nlm.nih.gov/pubmed/20149233

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

__init__(base_path=None, in_memory=True, tokenizer=None)View on GitHub#

Initialize the LINNEAUS corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • tokenizer (Optional[Tokenizer]) – Custom implementation of Tokenizer which segments sentence into tokens (default SciSpacyTokenizer)

static download_and_parse_dataset(data_dir)View on GitHub#
class flair.datasets.biomedical.HUNER_SPECIES_LINNEAUS(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the LINNEAUS corpus containing species annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.CDR(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

CDR corpus as provided by JHnlp/BioCreative-V-CDR-Corpus.

For further information see Li et al.: BioCreative V CDR task corpus: a resource for chemical disease relation extraction https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4860626/

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the CDR corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Implementation of SentenceSplitter which segments documents into sentences and tokens (default SciSpacySentenceSplitter)

static download_dataset(data_dir)View on GitHub#
class flair.datasets.biomedical.HUNER_DISEASE_CDR(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the IEPA corpus containing disease annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.HUNER_CHEMICAL_CDR(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the IEPA corpus containing chemical annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.HUNER_ALL_CDR(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the IEPA corpus containing disease and chemical annotations.

static split_url()View on GitHub#
Return type:

List[str]

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.VARIOME(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Variome corpus as provided by http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/hvp_bioc.xml.zip.

For further information see Verspoor et al.: Annotating the biomedical literature for the human variome https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3676157/

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the Variome corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Implementation of SentenceSplitter which segments documents into sentences and tokens (default SciSpacySentenceSplitter)

static download_dataset(data_dir)View on GitHub#
static parse_corpus(corpus_xml)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.HUNER_GENE_VARIOME(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the Variome corpus containing gene annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.HUNER_DISEASE_VARIOME(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the Variome corpus containing disease annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.HUNER_SPECIES_VARIOME(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the Variome corpus containing species annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.HUNER_ALL_VARIOME(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the Variome corpus containing gene, disease and species annotations.

static split_url()View on GitHub#
Return type:

List[str]

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.NCBI_DISEASE(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original NCBI disease corpus containing disease annotations.

For further information see Dogan et al.: NCBI disease corpus: a resource for disease name recognition and concept normalization https://www.ncbi.nlm.nih.gov/pubmed/24393765

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the NCBI disease corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Implementation of SentenceSplitter which segments documents into sentences and tokens (default SciSpacySentenceSplitter)

classmethod download_corpus(data_dir)View on GitHub#
Return type:

Path

static patch_training_file(orig_train_file, patched_file)View on GitHub#
static parse_input_file(input_file)View on GitHub#
class flair.datasets.biomedical.HUNER_DISEASE_NCBI(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the NCBI corpus containing disease annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.ScaiCorpus(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Base class to support the SCAI chemicals and disease corpora.

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the SCAU corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Implementation of SentenceSplitter which segments documents into sentences and tokens (default SciSpacySentenceSplitter)

download_corpus(data_folder)View on GitHub#
Return type:

Path

static parse_input_file(input_file)View on GitHub#
class flair.datasets.biomedical.SCAI_CHEMICALS(*args, **kwargs)View on GitHub#

Bases: ScaiCorpus

Original SCAI chemicals corpus containing chemical annotations.

For further information see Kolářik et al.: Chemical Names: Terminological Resources and Corpora Annotation https://pub.uni-bielefeld.de/record/2603498

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

download_corpus(data_dir)View on GitHub#
Return type:

Path

static perform_corpus_download(data_dir)View on GitHub#
Return type:

Path

class flair.datasets.biomedical.SCAI_DISEASE(*args, **kwargs)View on GitHub#

Bases: ScaiCorpus

Original SCAI disease corpus containing disease annotations.

For further information see Gurulingappa et al.: An Empirical Evaluation of Resources for the Identification of Diseases and Adverse Effects in Biomedical Literature https://pub.uni-bielefeld.de/record/2603398

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

download_corpus(data_dir)View on GitHub#
Return type:

Path

static perform_corpus_download(data_dir)View on GitHub#
Return type:

Path

class flair.datasets.biomedical.HUNER_CHEMICAL_SCAI(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the SCAI chemicals corpus containing chemical annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.HUNER_DISEASE_SCAI(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the SCAI chemicals corpus containing disease annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.HUNER_ALL_SCAI(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the SCAI chemicals corpus containing chemical and disease annotations.

static split_url()View on GitHub#
Return type:

List[str]

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.OSIRIS(base_path=None, in_memory=True, sentence_splitter=None, load_original_unfixed_annotation=False)View on GitHub#

Bases: ColumnCorpus

Original OSIRIS corpus containing variation and gene annotations.

For further information see Furlong et al.: Osiris v1.2: a named entity recognition system for sequence variants of genes in biomedical literature https://www.ncbi.nlm.nih.gov/pubmed/18251998

Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

__init__(base_path=None, in_memory=True, sentence_splitter=None, load_original_unfixed_annotation=False)View on GitHub#

Initialize the OSIRIS corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Implementation of SentenceSplitter which segments documents into sentences and tokens (default SciSpacySentenceSplitter)

  • load_original_unfixed_annotation – The original annotation of Osiris erroneously annotates two sentences as a protein. Set to True if you don’t want the fixed version.

classmethod download_dataset(data_dir)View on GitHub#
Return type:

Path

classmethod parse_dataset(corpus_folder, fix_annotation=True)View on GitHub#
class flair.datasets.biomedical.HUNER_GENE_OSIRIS(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the OSIRIS corpus containing (only) gene annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.S800(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

S800 corpus.

For further information see Pafilis et al.: The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0065390.

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the S800 corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Implementation of SentenceSplitter which segments documents into sentences and tokens (default SciSpacySentenceSplitter)

static download_dataset(data_dir)View on GitHub#
static parse_dataset(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.HUNER_SPECIES_S800(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the S800 corpus containing species annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.GPRO(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original GPRO corpus containing gene annotations.

For further information see: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/gpro-detailed-task-description/

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the GPRO corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Implementation of SentenceSplitter which segments documents into sentences and tokens (default SciSpacySentenceSplitter)

classmethod download_train_corpus(data_dir)View on GitHub#
Return type:

Path

classmethod download_dev_corpus(data_dir)View on GitHub#
Return type:

Path

static parse_input_file(text_file, ann_file)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.HUNER_GENE_GPRO(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the GPRO corpus containing gene annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.DECA(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original DECA corpus containing gene annotations.

For further information see Wang et al.: Disambiguating the species of biomedical named entities using natural language parsers https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2828111/

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the DECA corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Implementation of SentenceSplitter which segments documents into sentences and tokens (default BioSpacySentenceSpliiter)

classmethod download_corpus(data_dir)View on GitHub#
Return type:

Path

static parse_corpus(text_dir, gold_file)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.HUNER_GENE_DECA(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the DECA corpus containing gene annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.FSU(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

Original FSU corpus containing protein and derived annotations.

For further information see Hahn et al.: A proposal for a configurable silver standard https://www.aclweb.org/anthology/W10-1838/

__init__(base_path=None, in_memory=True)View on GitHub#

Initialize the FSU corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

classmethod download_corpus(data_dir)View on GitHub#
Return type:

Path

static parse_corpus(corpus_dir, sentence_separator)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.HUNER_GENE_FSU(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the FSU corpus containing (only) gene annotations.

static split_url()View on GitHub#
Return type:

str

get_corpus_sentence_splitter()View on GitHub#

Return the pre-defined sentence splitter if defined, otherwise return None.

Return type:

SentenceSplitter

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.CRAFT(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original CRAFT corpus (version 2.0) containing all but the coreference and sections/typography annotations.

For further information see Bada et al.: Concept annotation in the craft corpus https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-161

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the CRAFT corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Implementation of SentenceSplitter which segments documents into sentences and tokens (default SciSpacySentenceSplitter)

classmethod download_corpus(data_dir)View on GitHub#
Return type:

Path

static parse_corpus(corpus_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.BIOSEMANTICS(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original Biosemantics corpus.

For further information see Akhondi et al.: Annotated chemical patent corpus: a gold standard for text mining https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4182036/

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the Biosemantics corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Implementation of SentenceSplitter which segments documents into sentences and tokens (default SciSpacySentenceSplitter)

static download_dataset(data_dir)View on GitHub#
Return type:

Path

static parse_dataset(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.BC2GM(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original BioCreative-II-GM corpus containing gene annotations.

For further information see Smith et al.: Overview of BioCreative II gene mention recognition https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2559986/

Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the BioCreative-II-GM corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Implementation of SentenceSplitter which segments documents into sentences and tokens (default SciSpacySentenceSplitter)

static download_dataset(data_dir)View on GitHub#
Return type:

Path

classmethod parse_train_dataset(data_folder)View on GitHub#
Return type:

InternalBioNerDataset

classmethod parse_test_dataset(data_folder)View on GitHub#
Return type:

InternalBioNerDataset

static parse_dataset(text_file, ann_file)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.HUNER_GENE_BC2GM(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the BioCreative-II-GM corpus containing gene annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.CEMP(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Original CEMP corpus containing chemical annotations.

For further information see: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/cemp-detailed-task-description/

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the CEMP corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Implementation of SentenceSplitter which segments documents into sentences and tokens (default SciSpacySentenceSplitter)

classmethod download_train_corpus(data_dir)View on GitHub#
Return type:

Path

classmethod download_dev_corpus(data_dir)View on GitHub#
Return type:

Path

static parse_input_file(text_file, ann_file)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.HUNER_CHEMICAL_CEMP(*args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the CEMP corpus containing chemical annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.CHEBI(base_path=None, in_memory=True, sentence_splitter=None, annotator=0)View on GitHub#

Bases: ColumnCorpus

Original CHEBI corpus containing all annotations.

For further information see Shardlow et al.: A New Corpus to Support Text Mining for the Curation of Metabolites in the ChEBI Database http://www.lrec-conf.org/proceedings/lrec2018/pdf/229.pdf

Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

__init__(base_path=None, in_memory=True, sentence_splitter=None, annotator=0)View on GitHub#

Initialize the CHEBI corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Implementation of SentenceSplitter which segments documents into sentences and tokens (default SciSpacySentenceSplitter)

  • annotator (int) – The abstracts have been annotated by two annotators, which can be selected by choosing annotator 1 or 2. If annotator is 0, the union of both annotations is used.

static download_dataset(data_dir)View on GitHub#
Return type:

Path

static parse_dataset(data_dir, annotator)View on GitHub#
Return type:

InternalBioNerDataset

static get_entities(f)View on GitHub#
class flair.datasets.biomedical.HUNER_CHEBI(entity_type_mapping, *args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the CHEBI corpus.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir, annotator=0)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.HUNER_CHEMICAL_CHEBI(*args, **kwargs)View on GitHub#

Bases: HUNER_CHEBI

HUNER version of the CHEBI corpus containing chemical annotations.

class flair.datasets.biomedical.HUNER_GENE_CHEBI(*args, **kwargs)View on GitHub#

Bases: HUNER_CHEBI

HUNER version of the CHEBI corpus containing gene annotations.

class flair.datasets.biomedical.HUNER_SPECIES_CHEBI(*args, **kwargs)View on GitHub#

Bases: HUNER_CHEBI

HUNER version of the CHEBI corpus containing species annotations.

class flair.datasets.biomedical.HUNER_ALL_CHEBI(*args, **kwargs)View on GitHub#

Bases: HUNER_CHEBI

HUNER version of the CHEBI corpus containing chemical, gene and species annotations.

class flair.datasets.biomedical.BioNLPCorpus(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Base class for corpora from BioNLP event extraction shared tasks.

For further information see: http://2013.bionlp-st.org/Intro

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the BioNLP Corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Implementation of SentenceSplitter which segments documents into sentences and tokens (default SciSpacySentenceSplitter)

abstract static download_corpus(data_folder)View on GitHub#
Return type:

Tuple[Path, Path, Path]

static parse_input_files(input_folder)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.BIONLP2013_PC(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: BioNLPCorpus

Corpus of the BioNLP’2013 Pathway Curation shared task.

For further information see Ohta et al. Overview of the pathway curation (PC) task of bioNLP shared task 2013. https://www.aclweb.org/anthology/W13-2009/

Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

static download_corpus(download_folder)View on GitHub#
Return type:

Tuple[Path, Path, Path]

class flair.datasets.biomedical.BIONLP2013_CG(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: BioNLPCorpus

Corpus of the BioNLP’2013 Cancer Genetics shared task.

For further information see Pyysalo, Ohta & Ananiadou 2013 Overview of the Cancer Genetics (CG) task of BioNLP Shared Task 2013 https://www.aclweb.org/anthology/W13-2008/

Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

static download_corpus(download_folder)View on GitHub#
Return type:

Tuple[Path, Path, Path]

class flair.datasets.biomedical.ANAT_EM(base_path=None, in_memory=True, tokenizer=None)View on GitHub#

Bases: ColumnCorpus

Corpus for anatomical named entity mention recognition.

For further information see Pyysalo and Ananiadou: Anatomical entity mention recognition at literature scale https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3957068/ http://nactem.ac.uk/anatomytagger/#AnatEM

Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

__init__(base_path=None, in_memory=True, tokenizer=None)View on GitHub#

Initialize the anatomical named entity mention recognition Corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter – Implementation of Tokenizer which segments sentences into tokens (default SciSpacyTokenizer)

abstract static download_corpus(data_folder)View on GitHub#
static parse_input_files(input_dir, sentence_separator)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.BioBertHelper(data_folder, column_format, train_file=None, test_file=None, dev_file=None, autofind_splits=True, name=None, comment_symbol='# ', **corpusargs)View on GitHub#

Bases: ColumnCorpus

Helper class to convert corpora and the respective train, dev and test split used by BioBERT.

For further details see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

static download_corpora(download_dir)View on GitHub#
static convert_and_write(download_folder, data_folder, tag_type)View on GitHub#
class flair.datasets.biomedical.BIOBERT_CHEMICAL_BC4CHEMD(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

BC4CHEMD corpus with chemical annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.biomedical.BIOBERT_GENE_BC2GM(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

BC4CHEMD corpus with gene annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.biomedical.BIOBERT_GENE_JNLPBA(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

JNLPBA corpus with gene annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.biomedical.BIOBERT_CHEMICAL_BC5CDR(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

BC5CDR corpus with chemical annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.biomedical.BIOBERT_DISEASE_BC5CDR(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

BC5CDR corpus with disease annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.biomedical.BIOBERT_DISEASE_NCBI(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

NCBI disease corpus as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.biomedical.BIOBERT_SPECIES_LINNAEUS(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

Linneaeus corpus with species annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.biomedical.BIOBERT_SPECIES_S800(base_path=None, in_memory=True)View on GitHub#

Bases: ColumnCorpus

S800 corpus with species annotations as used in the evaluation of BioBERT.

For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert

class flair.datasets.biomedical.CRAFT_V4(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Version 4.0.1 of the CRAFT corpus containing all but the co-reference and structural annotations.

For further information see: UCDenver-ccp/CRAFT

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initializes version 4.0.1 of the CRAFT corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Implementation of SentenceSplitter which segments documents into sentences and tokens (default SciSpacySentenceSplitter)

filter_entities(corpus)View on GitHub#
Return type:

InternalBioNerDataset

classmethod download_corpus(data_dir)View on GitHub#
Return type:

Path

static prepare_splits(data_dir, corpus)View on GitHub#
Return type:

Tuple[InternalBioNerDataset, InternalBioNerDataset, InternalBioNerDataset]

static parse_corpus(corpus_dir)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.HUNER_CRAFT_V4(entity_type_mapping, *args, **kwargs)View on GitHub#

Bases: HunerDataset

HUNER version of the CRAFT corpus containing (only) chemical annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.HUNER_CHEMICAL_CRAFT_V4(*args, **kwargs)View on GitHub#

Bases: HUNER_CRAFT_V4

HUNER version of the CRAFT corpus containing (only) chemical annotations.

class flair.datasets.biomedical.HUNER_GENE_CRAFT_V4(*args, **kwargs)View on GitHub#

Bases: HUNER_CRAFT_V4

HUNER version of the CRAFT corpus containing (only) gene annotations.

class flair.datasets.biomedical.HUNER_SPECIES_CRAFT_V4(*args, **kwargs)View on GitHub#

Bases: HUNER_CRAFT_V4

HUNER version of the CRAFT corpus containing (only) species annotations.

class flair.datasets.biomedical.HUNER_ALL_CRAFT_V4(*args, **kwargs)View on GitHub#

Bases: HUNER_CRAFT_V4

HUNER version of the CRAFT corpus containing chemical, gene and species annotations.

class flair.datasets.biomedical.HUNER_BIONLP2013_CG(entity_type_mapping, *args, **kwargs)View on GitHub#

Bases: HunerDataset

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.HUNER_CHEMICAL_BIONLP2013_CG(*args, **kwargs)View on GitHub#

Bases: HUNER_BIONLP2013_CG

class flair.datasets.biomedical.HUNER_DISEASE_BIONLP2013_CG(*args, **kwargs)View on GitHub#

Bases: HUNER_BIONLP2013_CG

class flair.datasets.biomedical.HUNER_GENE_BIONLP2013_CG(*args, **kwargs)View on GitHub#

Bases: HUNER_BIONLP2013_CG

class flair.datasets.biomedical.HUNER_SPECIES_BIONLP2013_CG(*args, **kwargs)View on GitHub#

Bases: HUNER_BIONLP2013_CG

class flair.datasets.biomedical.HUNER_ALL_BIONLP2013_CG(*args, **kwargs)View on GitHub#

Bases: HUNER_BIONLP2013_CG

class flair.datasets.biomedical.AZDZ(base_path=None, in_memory=True, tokenizer=None)View on GitHub#

Bases: ColumnCorpus

Arizona Disease Corpus from the Biomedical Informatics Lab at Arizona State University.

For further information see: http://diego.asu.edu/index.php

__init__(base_path=None, in_memory=True, tokenizer=None)View on GitHub#

Initializes the Arizona Disease Corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • tokenizer (Optional[Tokenizer]) – Implementation of Tokenizer which segments sentences into tokens (default SciSpacyTokenizer)

classmethod download_corpus(data_dir)View on GitHub#
Return type:

Path

static parse_corpus(input_file)View on GitHub#
Return type:

InternalBioNerDataset

class flair.datasets.biomedical.PDR(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Bases: ColumnCorpus

Corpus of plant-disease relations.

For further information see Kim et al.: A corpus of plant-disease relations in the biomedical domain https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0221582 http://gcancer.org/pdr/

Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)

__init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#

Initialize the plant-disease relations Corpus.

Parameters:
  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Implementation of SentenceSplitter which segments documents into sentences and tokens (default SciSpacySentenceSplitter)

classmethod download_corpus(data_dir)View on GitHub#
Return type:

Path

class flair.datasets.biomedical.HUNER_DISEASE_PDR(*args, **kwargs)View on GitHub#

Bases: HunerDataset

PDR Dataset with only Disease annotations.

static split_url()View on GitHub#
Return type:

str

to_internal(data_dir)View on GitHub#
Return type:

InternalBioNerDataset

get_entity_type_mapping()View on GitHub#
Return type:

Optional[Dict]

class flair.datasets.biomedical.HunerMultiCorpus(entity_type, sentence_splitter=None)View on GitHub#

Bases: MultiCorpus

Base class to build the union of all HUNER data sets considering a particular entity type.

class flair.datasets.biomedical.HUNER_CELL_LINE(sentence_splitter=None)View on GitHub#

Bases: HunerMultiCorpus

Union of all HUNER cell line data sets.

class flair.datasets.biomedical.HUNER_CHEMICAL(sentence_splitter=None)View on GitHub#

Bases: HunerMultiCorpus

Union of all HUNER chemical data sets.

class flair.datasets.biomedical.HUNER_DISEASE(sentence_splitter=None)View on GitHub#

Bases: HunerMultiCorpus

Union of all HUNER disease data sets.

class flair.datasets.biomedical.HUNER_GENE(sentence_splitter=None)View on GitHub#

Bases: HunerMultiCorpus

Union of all HUNER gene data sets.

class flair.datasets.biomedical.HUNER_SPECIES(sentence_splitter=None)View on GitHub#

Bases: HunerMultiCorpus

Union of all HUNER species data sets.

class flair.datasets.biomedical.BIGBIO_NER_CORPUS(dataset_name, base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: ColumnCorpus

This class implements an adapter to data sets implemented in the BigBio framework.

see bigscience-workshop/biomedical

The BigBio framework harmonizes over 120 biomedical data sets and provides a uniform programming api to access them. This adapter allows to use all named entity recognition data sets by using the bigbio_kb schema.

__init__(dataset_name, base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Initialize the BigBio Corpus.

Parameters:
  • dataset_name (str) – Name of the dataset in the huggingface hub (e.g. nlmchem or bigbio/nlmchem)

  • base_path (Union[str, Path, None]) – Path to the corpus on your machine

  • in_memory (bool) – If True, keeps dataset in memory giving speedups in training.

  • sentence_splitter (Optional[SentenceSplitter]) – Custom implementation of SentenceSplitter which segments the text into sentences and tokens (default SciSpacySentenceSplitter)

  • train_split_name (Optional[str]) – Name of the training split in bigbio, usually train (default: None)

  • dev_split_name (Optional[str]) – Name of the development split in bigbio, usually validation (default: None)

  • test_split_name (Optional[str]) – Name of the test split in bigbio, usually test (default: None)

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

to_internal_dataset(dataset, split)View on GitHub#

Converts a dataset given in hugging datasets format to our internal corpus representation.

Return type:

InternalBioNerDataset

bin_search_passage(passages, low, high, entity)View on GitHub#

Helper methods to find the passage to a given entity mention (incl. offset).

The implementation uses binary search to find the passage in the ordered sequence passages.

class flair.datasets.biomedical.HUNER_GENE_NLM_GENE(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_GENE_DRUGPROT(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_CHEMICAL_DRUGPROT(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_ALL_DRUGPROT(*args, **kwargs)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_GENE_BIORED(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_CHEMICAL_BIORED(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_DISEASE_BIORED(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_SPECIES_BIORED(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_CELL_LINE_BIORED(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_ALL_BIORED(*args, **kwargs)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_GENE_CPI(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_CHEMICAL_CPI(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_ALL_CPI(*args, **kwargs)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_GENE_BIONLP_ST_2013_PC(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_CHEMICAL_BIONLP_ST_2013_PC(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_ALL_BIONLP_ST_2013_PC(*args, **kwargs)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_GENE_BIONLP_ST_2013_GE(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_GENE_BIONLP_ST_2011_GE(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_GENE_BIONLP_ST_2011_ID(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_CHEMICAL_BIONLP_ST_2011_ID(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_SPECIES_BIONLP_ST_2011_ID(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_ALL_BIONLP_ST_2011_ID(*args, **kwargs)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_GENE_BIONLP_ST_2011_REL(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_GENE_BIONLP_ST_2011_EPI(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_SPECIES_BIONLP_ST_2019_BB(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_GENE_BIOID(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_CHEMICAL_BIOID(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_SPECIES_BIOID(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_CELL_LINE_BIOID(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_ALL_BIOID(*args, **kwargs)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_GENE_GNORMPLUS(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_GENE_PROGENE(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_CHEMICAL_NLM_CHEM(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_GENE_SETH_CORPUS(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str

class flair.datasets.biomedical.HUNER_GENE_TMVAR_V3(*args, **kwargs)View on GitHub#

Bases: BIGBIO_NER_CORPUS

get_entity_type_mapping()View on GitHub#

Return the mapping of entity type given in the dataset to canonical types.

Note, if a entity type is not present in the map it is discarded.

Return type:

Optional[Dict]

build_corpus_directory_name(dataset_name)View on GitHub#

Builds the directory name for the given data set.

Return type:

str