flair.datasets.biomedical#
- class flair.datasets.biomedical.Entity(char_span, entity_type)View on GitHub#
Bases:
object
Internal class to represent entities while converting biomedical NER corpora to a standardized format.
Each entity consists of the char span it addresses in the original text as well as the type of entity (e.g. Chemical, Gene, and so on).
- is_before(other_entity)View on GitHub#
Checks whether this entity is located before the given one.
- Parameters:
other_entity – Entity to check
- Return type:
bool
- contains(other_entity)View on GitHub#
Checks whether the given entity is fully contained in this entity.
- Parameters:
other_entity – Entity to check
- Return type:
bool
- overlaps(other_entity)View on GitHub#
Checks whether this and the given entity overlap.
- Parameters:
other_entity – Entity to check
- Return type:
bool
- class flair.datasets.biomedical.InternalBioNerDataset(documents, entities_per_document)View on GitHub#
Bases:
object
Internal class to represent a corpus and it’s entities.
- class flair.datasets.biomedical.DpEntry(position_end, entity_count, entity_lengths_sum, last_entity)View on GitHub#
Bases:
tuple
-
position_end:
int
# Alias for field number 0
-
entity_count:
int
# Alias for field number 1
-
entity_lengths_sum:
int
# Alias for field number 2
-
position_end:
- flair.datasets.biomedical.merge_datasets(data_sets)View on GitHub#
- flair.datasets.biomedical.filter_and_map_entities(dataset, entity_type_to_canonical)View on GitHub#
- Return type:
- flair.datasets.biomedical.filter_nested_entities(dataset)View on GitHub#
- Return type:
None
- flair.datasets.biomedical.bioc_to_internal(bioc_file)View on GitHub#
Helper function to parse corpora that are given in BIOC format. See.
for details.
- flair.datasets.biomedical.brat_to_internal(corpus_dir, ann_file_suffixes=None)View on GitHub#
Helper function to parse corpora that are annotated using BRAT. See.
for details.
- Return type:
- class flair.datasets.biomedical.CoNLLWriter(sentence_splitter)View on GitHub#
Bases:
object
Utility class for writing InternalBioNerDataset to CoNLL files.
- __init__(sentence_splitter)View on GitHub#
Initialize CoNLLWriter.
- Parameters:
sentence_splitter (
SentenceSplitter
) – Sentence splitter which segments the text into sentences and tokens.
- process_dataset(datasets, out_dir)View on GitHub#
- write_to_conll(dataset, output_file)View on GitHub#
- class flair.datasets.biomedical.HunerDataset(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
,ABC
Base class for HUNER datasets.
- Every subclass has to implement the following methods:
“to_internal”, which reads the complete data set (incl. train, dev, test) and returns the corpus as InternalBioNerDataset
“split_url”, which returns the base url (i.e. without ‘.train’, ‘.dev’, ‘.test’) to the HUNER split files
- For further information see:
Weber et al.: ‘HUNER: improving biomedical NER with pretraining’ https://academic.oup.com/bioinformatics/article-abstract/36/1/295/5523847?redirectedFrom=fulltext
HUNER github repository: hu-ner/huner
- abstract to_internal(data_folder)View on GitHub#
- Return type:
- abstract static split_url()View on GitHub#
- Return type:
str
- get_corpus_sentence_splitter()View on GitHub#
Return the pre-defined sentence splitter if defined, otherwise return None.
- Return type:
Optional
[SentenceSplitter
]
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the HUNER corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Custom implementation ofSentenceSplitter
which segments the text into sentences and tokens (defaultSciSpacySentenceSplitter
)
- get_subset(dataset, split, split_dir)View on GitHub#
- class flair.datasets.biomedical.BIO_INFER(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
Original BioInfer corpus.
- For further information see Pyysalo et al.:
BioInfer: a corpus for information extraction in the biomedical domain https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-50
- __init__(base_path=None, in_memory=True)View on GitHub#
Initialize the BioInfer corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.
- classmethod download_dataset(data_dir)View on GitHub#
- Return type:
Path
- classmethod parse_dataset(original_file)View on GitHub#
- class flair.datasets.biomedical.HUNER_GENE_BIO_INFER(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the BioInfer corpus containing only gene/protein annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.JNLPBA(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
Original corpus of the JNLPBA shared task.
For further information see Kim et al.: Introduction to the Bio- Entity Recognition Task at JNLPBA https://www.aclweb.org/anthology/W04-1213.pdf
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- __init__(base_path=None, in_memory=True)View on GitHub#
Initialize the JNLPBA corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.
- class flair.datasets.biomedical.HunerJNLPBAView on GitHub#
Bases:
object
- classmethod download_and_prepare_train(data_folder, sentence_tag)View on GitHub#
- Return type:
- classmethod download_and_prepare_test(data_folder, sentence_tag)View on GitHub#
- Return type:
- classmethod read_file(input_iob_file, sentence_tag)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_GENE_JNLPBA(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the JNLPBA corpus containing gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- get_corpus_sentence_splitter()View on GitHub#
Return the pre-defined sentence splitter if defined, otherwise return None.
- Return type:
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_CELL_LINE_JNLPBA(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the JNLPBA corpus containing cell line annotations.
- static split_url()View on GitHub#
- Return type:
str
- get_corpus_sentence_splitter()View on GitHub#
Return the pre-defined sentence splitter if defined, otherwise return None.
- Return type:
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.CELL_FINDER(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original CellFinder corpus containing cell line, species and gene annotations.
For futher information see Neves et al.: Annotating and evaluating text for stem cell research https://pdfs.semanticscholar.org/38e3/75aeeeb1937d03c3c80128a70d8e7a74441f.pdf
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the CellFinder corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Custom implementation ofSentenceSplitter
which segments the text into sentences and tokens.
- classmethod download_and_prepare(data_folder)View on GitHub#
- Return type:
- classmethod read_folder(data_folder)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_CELL_LINE_CELL_FINDER(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CellFinder corpus containing only cell line annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_SPECIES_CELL_FINDER(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CellFinder corpus containing only species annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_GENE_CELL_FINDER(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CellFinder corpus containing only gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.MIRNA(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original miRNA corpus.
For further information see Bagewadi et al.: Detecting miRNA Mentions and Relations in Biomedical Literature https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4602280/
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the miRNA corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.tokenizer – Callable that segments a sentence into words, defaults to scispacy
sentence_splitter (
Optional
[SentenceSplitter
]) – Callable that segments a document into sentences, defaults to scispacy
- classmethod download_and_prepare_train(data_folder, sentence_separator)View on GitHub#
- classmethod download_and_prepare_test(data_folder, sentence_separator)View on GitHub#
- classmethod parse_file(input_file, split, sentence_separator)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HunerMiRNAHelperView on GitHub#
Bases:
object
- static get_mirna_subset(dataset, split_url, split_dir)View on GitHub#
- class flair.datasets.biomedical.HUNER_GENE_MIRNA(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the miRNA corpus containing protein / gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- get_subset(dataset, split, split_dir)View on GitHub#
- get_corpus_sentence_splitter()View on GitHub#
Return the pre-defined sentence splitter if defined, otherwise return None.
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_SPECIES_MIRNA(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the miRNA corpus containing species annotations.
- static split_url()View on GitHub#
- Return type:
str
- get_subset(dataset, split, split_dir)View on GitHub#
- get_corpus_sentence_splitter()View on GitHub#
Return the pre-defined sentence splitter if defined, otherwise return None.
- Return type:
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_DISEASE_MIRNA(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the miRNA corpus containing disease annotations.
- static split_url()View on GitHub#
- Return type:
str
- get_subset(dataset, split, split_dir)View on GitHub#
- get_corpus_sentence_splitter()View on GitHub#
Return the pre-defined sentence splitter if defined, otherwise return None.
- Return type:
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.KaewphanCorpusHelperView on GitHub#
Bases:
object
Helper class for the corpora from Kaewphan et al., i.e. CLL and Gellus.
- static download_cll_dataset(data_folder)View on GitHub#
- static prepare_and_save_dataset(nersuite_folder, output_file)View on GitHub#
- static download_gellus_dataset(data_folder)View on GitHub#
- static read_dataset(nersuite_folder, sentence_separator)View on GitHub#
- Return type:
- class flair.datasets.biomedical.CLL(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
Original CLL corpus containing cell line annotations.
For further information, see Kaewphan et al.: Cell line name recognition in support of the identification of synthetic lethality in cancer from text https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4708107/
- __init__(base_path=None, in_memory=True)View on GitHub#
Initialize the CLL corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training
- class flair.datasets.biomedical.HUNER_CELL_LINE_CLL(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CLL corpus containing cell line annotations.
- static split_url()View on GitHub#
- Return type:
str
- get_corpus_sentence_splitter()View on GitHub#
Return the pre-defined sentence splitter if defined, otherwise return None.
- Return type:
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.GELLUS(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
Original Gellus corpus containing cell line annotations.
For further information, see Kaewphan et al.: Cell line name recognition in support of the identification of synthetic lethality in cancer from text https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4708107/
- __init__(base_path=None, in_memory=True)View on GitHub#
Initialize the GELLUS corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training
- class flair.datasets.biomedical.HUNER_CELL_LINE_GELLUS(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the Gellus corpus containing cell line annotations.
- static split_url()View on GitHub#
- Return type:
str
- get_corpus_sentence_splitter()View on GitHub#
Return the pre-defined sentence splitter if defined, otherwise return None.
- Return type:
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.LOCTEXT(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original LOCTEXT corpus containing species annotations.
- For further information see Cejuela et al.:
LocText: relation extraction of protein localizations to assist database curation https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2021-9
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the LOCTEXT corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Custom implementation ofSentenceSplitter
that segments a document into sentences and tokens (defaultSciSpacySentenceSplitter
)
- static download_dataset(data_dir)View on GitHub#
- static parse_dataset(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_SPECIES_LOCTEXT(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the Loctext corpus containing species annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_GENE_LOCTEXT(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the Loctext corpus containing protein annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.CHEMDNER(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original corpus of the CHEMDNER shared task.
For further information see Krallinger et al.: The CHEMDNER corpus of chemicals and drugs and its annotation principles https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-7-S1-S2
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the CHEMDNER corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Custom implementation ofSentenceSplitter
which segements documents into sentences and tokens
- static download_dataset(data_dir)View on GitHub#
- class flair.datasets.biomedical.HUNER_CHEMICAL_CHEMDNER(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CHEMDNER corpus containing chemical annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.IEPA(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
IEPA corpus as provided by http://corpora.informatik.hu-berlin.de/.
For further information see Ding, Berleant, Nettleton, Wurtele: Mining MEDLINE: abstracts, sentences, or phrases? https://www.ncbi.nlm.nih.gov/pubmed/11928487
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- __init__(base_path=None, in_memory=True)View on GitHub#
Initialize the IEPA corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.
- static download_dataset(data_dir)View on GitHub#
- classmethod parse_dataset(original_file)View on GitHub#
- class flair.datasets.biomedical.HUNER_GENE_IEPA(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the IEPA corpus containing gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.LINNEAUS(base_path=None, in_memory=True, tokenizer=None)View on GitHub#
Bases:
ColumnCorpus
Original LINNEAUS corpus containing species annotations.
- For further information see Gerner et al.:
LINNAEUS: a species name identification system for biomedical literature https://www.ncbi.nlm.nih.gov/pubmed/20149233
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- __init__(base_path=None, in_memory=True, tokenizer=None)View on GitHub#
Initialize the LINNEAUS corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.tokenizer (
Optional
[Tokenizer
]) – Custom implementation ofTokenizer
which segments sentence into tokens (defaultSciSpacyTokenizer
)
- static download_and_parse_dataset(data_dir)View on GitHub#
- class flair.datasets.biomedical.HUNER_SPECIES_LINNEAUS(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the LINNEAUS corpus containing species annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.CDR(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
CDR corpus as provided by JHnlp/BioCreative-V-CDR-Corpus.
For further information see Li et al.: BioCreative V CDR task corpus: a resource for chemical disease relation extraction https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4860626/
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the CDR corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Implementation ofSentenceSplitter
which segments documents into sentences and tokens (defaultSciSpacySentenceSplitter
)
- static download_dataset(data_dir)View on GitHub#
- class flair.datasets.biomedical.HUNER_DISEASE_CDR(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the IEPA corpus containing disease annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_CHEMICAL_CDR(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the IEPA corpus containing chemical annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.VARIOME(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Variome corpus as provided by http://corpora.informatik.hu-berlin.de/corpora/brat2bioc/hvp_bioc.xml.zip.
For further information see Verspoor et al.: Annotating the biomedical literature for the human variome https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3676157/
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the Variome corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Implementation ofSentenceSplitter
which segments documents into sentences and tokens (defaultSciSpacySentenceSplitter
)
- static download_dataset(data_dir)View on GitHub#
- static parse_corpus(corpus_xml)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_GENE_VARIOME(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the Variome corpus containing gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_DISEASE_VARIOME(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the Variome corpus containing disease annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_SPECIES_VARIOME(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the Variome corpus containing species annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.NCBI_DISEASE(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original NCBI disease corpus containing disease annotations.
For further information see Dogan et al.: NCBI disease corpus: a resource for disease name recognition and concept normalization https://www.ncbi.nlm.nih.gov/pubmed/24393765
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the NCBI disease corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Implementation ofSentenceSplitter
which segments documents into sentences and tokens (defaultSciSpacySentenceSplitter
)
- classmethod download_corpus(data_dir)View on GitHub#
- Return type:
Path
- static patch_training_file(orig_train_file, patched_file)View on GitHub#
- static parse_input_file(input_file)View on GitHub#
- class flair.datasets.biomedical.HUNER_DISEASE_NCBI(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the NCBI corpus containing disease annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.ScaiCorpus(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Base class to support the SCAI chemicals and disease corpora.
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the SCAU corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Implementation ofSentenceSplitter
which segments documents into sentences and tokens (defaultSciSpacySentenceSplitter
)
- download_corpus(data_folder)View on GitHub#
- Return type:
Path
- static parse_input_file(input_file)View on GitHub#
- class flair.datasets.biomedical.SCAI_CHEMICALS(*args, **kwargs)View on GitHub#
Bases:
ScaiCorpus
Original SCAI chemicals corpus containing chemical annotations.
For further information see Kolářik et al.: Chemical Names: Terminological Resources and Corpora Annotation https://pub.uni-bielefeld.de/record/2603498
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- download_corpus(data_dir)View on GitHub#
- Return type:
Path
- static perform_corpus_download(data_dir)View on GitHub#
- Return type:
Path
- class flair.datasets.biomedical.SCAI_DISEASE(*args, **kwargs)View on GitHub#
Bases:
ScaiCorpus
Original SCAI disease corpus containing disease annotations.
For further information see Gurulingappa et al.: An Empirical Evaluation of Resources for the Identification of Diseases and Adverse Effects in Biomedical Literature https://pub.uni-bielefeld.de/record/2603398
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- download_corpus(data_dir)View on GitHub#
- Return type:
Path
- static perform_corpus_download(data_dir)View on GitHub#
- Return type:
Path
- class flair.datasets.biomedical.HUNER_CHEMICAL_SCAI(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the SCAI chemicals corpus containing chemical annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_DISEASE_SCAI(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the SCAI chemicals corpus containing chemical annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.OSIRIS(base_path=None, in_memory=True, sentence_splitter=None, load_original_unfixed_annotation=False)View on GitHub#
Bases:
ColumnCorpus
Original OSIRIS corpus containing variation and gene annotations.
For further information see Furlong et al.: Osiris v1.2: a named entity recognition system for sequence variants of genes in biomedical literature https://www.ncbi.nlm.nih.gov/pubmed/18251998
Deprecated since version 0.13.0: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- __init__(base_path=None, in_memory=True, sentence_splitter=None, load_original_unfixed_annotation=False)View on GitHub#
Initialize the OSIRIS corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Implementation ofSentenceSplitter
which segments documents into sentences and tokens (defaultSciSpacySentenceSplitter
)load_original_unfixed_annotation – The original annotation of Osiris erroneously annotates two sentences as a protein. Set to True if you don’t want the fixed version.
- classmethod download_dataset(data_dir)View on GitHub#
- Return type:
Path
- classmethod parse_dataset(corpus_folder, fix_annotation=True)View on GitHub#
- class flair.datasets.biomedical.HUNER_GENE_OSIRIS(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the OSIRIS corpus containing (only) gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.S800(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
S800 corpus.
For further information see Pafilis et al.: The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0065390.
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the S800 corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Implementation ofSentenceSplitter
which segments documents into sentences and tokens (defaultSciSpacySentenceSplitter
)
- static download_dataset(data_dir)View on GitHub#
- static parse_dataset(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_SPECIES_S800(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the S800 corpus containing species annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.GPRO(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original GPRO corpus containing gene annotations.
For further information see: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/gpro-detailed-task-description/
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the GPRO corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Implementation ofSentenceSplitter
which segments documents into sentences and tokens (defaultSciSpacySentenceSplitter
)
- classmethod download_train_corpus(data_dir)View on GitHub#
- Return type:
Path
- classmethod download_dev_corpus(data_dir)View on GitHub#
- Return type:
Path
- static parse_input_file(text_file, ann_file)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_GENE_GPRO(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the GPRO corpus containing gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.DECA(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original DECA corpus containing gene annotations.
For further information see Wang et al.: Disambiguating the species of biomedical named entities using natural language parsers https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2828111/
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the DECA corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Implementation ofSentenceSplitter
which segments documents into sentences and tokens (default BioSpacySentenceSpliiter)
- classmethod download_corpus(data_dir)View on GitHub#
- Return type:
Path
- static parse_corpus(text_dir, gold_file)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_GENE_DECA(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the DECA corpus containing gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.FSU(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
Original FSU corpus containing protein and derived annotations.
For further information see Hahn et al.: A proposal for a configurable silver standard https://www.aclweb.org/anthology/W10-1838/
- __init__(base_path=None, in_memory=True)View on GitHub#
Initialize the FSU corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.
- classmethod download_corpus(data_dir)View on GitHub#
- Return type:
Path
- static parse_corpus(corpus_dir, sentence_separator)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_GENE_FSU(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the FSU corpus containing (only) gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- get_corpus_sentence_splitter()View on GitHub#
Return the pre-defined sentence splitter if defined, otherwise return None.
- Return type:
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.CRAFT(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original CRAFT corpus (version 2.0) containing all but the coreference and sections/typography annotations.
For further information see Bada et al.: Concept annotation in the craft corpus https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-161
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the CRAFT corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Implementation ofSentenceSplitter
which segments documents into sentences and tokens (defaultSciSpacySentenceSplitter
)
- classmethod download_corpus(data_dir)View on GitHub#
- Return type:
Path
- static parse_corpus(corpus_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.BIOSEMANTICS(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original Biosemantics corpus.
For further information see Akhondi et al.: Annotated chemical patent corpus: a gold standard for text mining https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4182036/
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the Biosemantics corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Implementation ofSentenceSplitter
which segments documents into sentences and tokens (defaultSciSpacySentenceSplitter
)
- static download_dataset(data_dir)View on GitHub#
- Return type:
Path
- static parse_dataset(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.BC2GM(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original BioCreative-II-GM corpus containing gene annotations.
For further information see Smith et al.: Overview of BioCreative II gene mention recognition https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2559986/
Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the BioCreative-II-GM corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Implementation ofSentenceSplitter
which segments documents into sentences and tokens (defaultSciSpacySentenceSplitter
)
- static download_dataset(data_dir)View on GitHub#
- Return type:
Path
- classmethod parse_train_dataset(data_folder)View on GitHub#
- Return type:
- classmethod parse_test_dataset(data_folder)View on GitHub#
- Return type:
- static parse_dataset(text_file, ann_file)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_GENE_BC2GM(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the BioCreative-II-GM corpus containing gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.CEMP(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Original CEMP corpus containing chemical annotations.
For further information see: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/cemp-detailed-task-description/
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the CEMP corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Implementation ofSentenceSplitter
which segments documents into sentences and tokens (defaultSciSpacySentenceSplitter
)
- classmethod download_train_corpus(data_dir)View on GitHub#
- Return type:
Path
- classmethod download_dev_corpus(data_dir)View on GitHub#
- Return type:
Path
- static parse_input_file(text_file, ann_file)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_CHEMICAL_CEMP(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CEMP corpus containing chemical annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.CHEBI(base_path=None, in_memory=True, sentence_splitter=None, annotator=0)View on GitHub#
Bases:
ColumnCorpus
Original CHEBI corpus containing all annotations.
For further information see Shardlow et al.: A New Corpus to Support Text Mining for the Curation of Metabolites in the ChEBI Database http://www.lrec-conf.org/proceedings/lrec2018/pdf/229.pdf
Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- __init__(base_path=None, in_memory=True, sentence_splitter=None, annotator=0)View on GitHub#
Initialize the CHEBI corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Implementation ofSentenceSplitter
which segments documents into sentences and tokens (defaultSciSpacySentenceSplitter
)annotator (
int
) – The abstracts have been annotated by two annotators, which can be selected by choosing annotator 1 or 2. If annotator is 0, the union of both annotations is used.
- static download_dataset(data_dir)View on GitHub#
- Return type:
Path
- static parse_dataset(data_dir, annotator)View on GitHub#
- Return type:
- static get_entities(f)View on GitHub#
- class flair.datasets.biomedical.HUNER_CHEMICAL_CHEBI(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CHEBI corpus containing chemical annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir, annotator=0)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_GENE_CHEBI(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CHEBI corpus containing gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir, annotator=0)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_SPECIES_CHEBI(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CHEBI corpus containing species annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir, annotator=0)View on GitHub#
- Return type:
- class flair.datasets.biomedical.BioNLPCorpus(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Base class for corpora from BioNLP event extraction shared tasks.
For further information see: http://2013.bionlp-st.org/Intro
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the BioNLP Corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Implementation ofSentenceSplitter
which segments documents into sentences and tokens (defaultSciSpacySentenceSplitter
)
- abstract static download_corpus(data_folder)View on GitHub#
- Return type:
Tuple
[Path
,Path
,Path
]
- static parse_input_files(input_folder)View on GitHub#
- Return type:
- class flair.datasets.biomedical.BIONLP2013_PC(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
BioNLPCorpus
Corpus of the BioNLP’2013 Pathway Curation shared task.
For further information see Ohta et al. Overview of the pathway curation (PC) task of bioNLP shared task 2013. https://www.aclweb.org/anthology/W13-2009/
Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- static download_corpus(download_folder)View on GitHub#
- Return type:
Tuple
[Path
,Path
,Path
]
- class flair.datasets.biomedical.BIONLP2013_CG(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
BioNLPCorpus
Corpus of the BioNLP’2013 Cancer Genetics shared task.
For further information see Pyysalo, Ohta & Ananiadou 2013 Overview of the Cancer Genetics (CG) task of BioNLP Shared Task 2013 https://www.aclweb.org/anthology/W13-2008/
Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- static download_corpus(download_folder)View on GitHub#
- Return type:
Tuple
[Path
,Path
,Path
]
- class flair.datasets.biomedical.ANAT_EM(base_path=None, in_memory=True, tokenizer=None)View on GitHub#
Bases:
ColumnCorpus
Corpus for anatomical named entity mention recognition.
For further information see Pyysalo and Ananiadou: Anatomical entity mention recognition at literature scale https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3957068/ http://nactem.ac.uk/anatomytagger/#AnatEM
Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- __init__(base_path=None, in_memory=True, tokenizer=None)View on GitHub#
Initialize the anatomical named entity mention recognition Corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter – Implementation of
Tokenizer
which segments sentences into tokens (defaultSciSpacyTokenizer
)
- abstract static download_corpus(data_folder)View on GitHub#
- static parse_input_files(input_dir, sentence_separator)View on GitHub#
- Return type:
- class flair.datasets.biomedical.BioBertHelper(data_folder, column_format, train_file=None, test_file=None, dev_file=None, autofind_splits=True, name=None, comment_symbol='# ', **corpusargs)View on GitHub#
Bases:
ColumnCorpus
Helper class to convert corpora and the respective train, dev and test split used by BioBERT.
For further details see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- static download_corpora(download_dir)View on GitHub#
- static convert_and_write(download_folder, data_folder, tag_type)View on GitHub#
- class flair.datasets.biomedical.BIOBERT_CHEMICAL_BC4CHEMD(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
BC4CHEMD corpus with chemical annotations as used in the evaluation of BioBERT.
For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- class flair.datasets.biomedical.BIOBERT_GENE_BC2GM(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
BC4CHEMD corpus with gene annotations as used in the evaluation of BioBERT.
For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- class flair.datasets.biomedical.BIOBERT_GENE_JNLPBA(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
JNLPBA corpus with gene annotations as used in the evaluation of BioBERT.
For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- class flair.datasets.biomedical.BIOBERT_CHEMICAL_BC5CDR(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
BC5CDR corpus with chemical annotations as used in the evaluation of BioBERT.
For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- class flair.datasets.biomedical.BIOBERT_DISEASE_BC5CDR(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
BC5CDR corpus with disease annotations as used in the evaluation of BioBERT.
For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- class flair.datasets.biomedical.BIOBERT_DISEASE_NCBI(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
NCBI disease corpus as used in the evaluation of BioBERT.
For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- class flair.datasets.biomedical.BIOBERT_SPECIES_LINNAEUS(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
Linneaeus corpus with species annotations as used in the evaluation of BioBERT.
For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- class flair.datasets.biomedical.BIOBERT_SPECIES_S800(base_path=None, in_memory=True)View on GitHub#
Bases:
ColumnCorpus
S800 corpus with species annotations as used in the evaluation of BioBERT.
For further details regarding BioBERT and it’s evaluation, see Lee et al.: https://academic.oup.com/bioinformatics/article/36/4/1234/5566506 dmis-lab/biobert
- class flair.datasets.biomedical.CRAFT_V4(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Version 4.0.1 of the CRAFT corpus containing all but the co-reference and structural annotations.
For further information see: UCDenver-ccp/CRAFT
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initializes version 4.0.1 of the CRAFT corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Implementation ofSentenceSplitter
which segments documents into sentences and tokens (defaultSciSpacySentenceSplitter
)
- filter_entities(corpus)View on GitHub#
- Return type:
- classmethod download_corpus(data_dir)View on GitHub#
- Return type:
Path
- static prepare_splits(data_dir, corpus)View on GitHub#
- Return type:
Tuple
[InternalBioNerDataset
,InternalBioNerDataset
,InternalBioNerDataset
]
- static parse_corpus(corpus_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_CHEMICAL_CRAFT_V4(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CRAFT corpus containing (only) chemical annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_GENE_CRAFT_V4(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CRAFT corpus containing (only) gene annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_SPECIES_CRAFT_V4(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
HUNER version of the CRAFT corpus containing (only) species annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_CHEMICAL_BIONLP2013_CG(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_DISEASE_BIONLP2013_CG(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_GENE_BIONLP2013_CG(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HUNER_SPECIES_BIONLP2013_CG(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.AZDZ(base_path=None, in_memory=True, tokenizer=None)View on GitHub#
Bases:
ColumnCorpus
Arizona Disease Corpus from the Biomedical Informatics Lab at Arizona State University.
For further information see: http://diego.asu.edu/index.php
- __init__(base_path=None, in_memory=True, tokenizer=None)View on GitHub#
Initializes the Arizona Disease Corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.tokenizer (
Optional
[Tokenizer
]) – Implementation ofTokenizer
which segments sentences into tokens (defaultSciSpacyTokenizer
)
- classmethod download_corpus(data_dir)View on GitHub#
- Return type:
Path
- static parse_corpus(input_file)View on GitHub#
- Return type:
- class flair.datasets.biomedical.PDR(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Bases:
ColumnCorpus
Corpus of plant-disease relations.
For further information see Kim et al.: A corpus of plant-disease relations in the biomedical domain https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0221582 http://gcancer.org/pdr/
Deprecated since version 0.13: Please use data set implementation from BigBio instead (see BIGBIO_NER_CORPUS)
- __init__(base_path=None, in_memory=True, sentence_splitter=None)View on GitHub#
Initialize the plant-disease relations Corpus.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Implementation ofSentenceSplitter
which segments documents into sentences and tokens (defaultSciSpacySentenceSplitter
)
- classmethod download_corpus(data_dir)View on GitHub#
- Return type:
Path
- class flair.datasets.biomedical.HUNER_DISEASE_PDR(*args, **kwargs)View on GitHub#
Bases:
HunerDataset
PDR Dataset with only Disease annotations.
- static split_url()View on GitHub#
- Return type:
str
- to_internal(data_dir)View on GitHub#
- Return type:
- class flair.datasets.biomedical.HunerMultiCorpus(entity_type, sentence_splitter=None)View on GitHub#
Bases:
MultiCorpus
Base class to build the union of all HUNER data sets considering a particular entity type.
- class flair.datasets.biomedical.HUNER_CELL_LINE(sentence_splitter=None)View on GitHub#
Bases:
HunerMultiCorpus
Union of all HUNER cell line data sets.
- class flair.datasets.biomedical.HUNER_CHEMICAL(sentence_splitter=None)View on GitHub#
Bases:
HunerMultiCorpus
Union of all HUNER chemical data sets.
- class flair.datasets.biomedical.HUNER_DISEASE(sentence_splitter=None)View on GitHub#
Bases:
HunerMultiCorpus
Union of all HUNER disease data sets.
- class flair.datasets.biomedical.HUNER_GENE(sentence_splitter=None)View on GitHub#
Bases:
HunerMultiCorpus
Union of all HUNER gene data sets.
- class flair.datasets.biomedical.HUNER_SPECIES(sentence_splitter=None)View on GitHub#
Bases:
HunerMultiCorpus
Union of all HUNER species data sets.
- class flair.datasets.biomedical.BIGBIO_NER_CORPUS(dataset_name, base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
ColumnCorpus
This class implements an adapter to data sets implemented in the BigBio framework.
see bigscience-workshop/biomedical
The BigBio framework harmonizes over 120 biomedical data sets and provides a uniform programming api to access them. This adapter allows to use all named entity recognition data sets by using the bigbio_kb schema.
- __init__(dataset_name, base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Initialize the BigBio Corpus.
- Parameters:
dataset_name (
str
) – Name of the dataset in the huggingface hub (e.g. nlmchem or bigbio/nlmchem)base_path (
Union
[str
,Path
,None
]) – Path to the corpus on your machinein_memory (
bool
) – If True, keeps dataset in memory giving speedups in training.sentence_splitter (
Optional
[SentenceSplitter
]) – Custom implementation ofSentenceSplitter
which segments the text into sentences and tokens (defaultSciSpacySentenceSplitter
)train_split_name (
Optional
[str
]) – Name of the training split in bigbio, usually train (default: None)dev_split_name (
Optional
[str
]) – Name of the development split in bigbio, usually validation (default: None)test_split_name (
Optional
[str
]) – Name of the test split in bigbio, usually test (default: None)
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- to_internal_dataset(dataset, split)View on GitHub#
Converts a dataset given in hugging datasets format to our internal corpus representation.
- Return type:
- bin_search_passage(passages, low, high, entity)View on GitHub#
Helper methods to find the passage to a given entity mention inclusive offset.
The implementation uses binary search to find the passage in the ordered sequence passages.
- class flair.datasets.biomedical.HUNER_GENE_NLM_GENE(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_GENE_DRUGPROT(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_CHEMICAL_DRUGPROT(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_GENE_BIORED(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_CHEMICAL_BIORED(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_DISEASE_BIORED(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_SPECIES_BIORED(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_CELL_LINE_BIORED(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_GENE_CPI(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_CHEMICAL_CPI(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_GENE_BIONLP_ST_2013_PC(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_CHEMICAL_BIONLP_ST_2013_PC(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_GENE_BIONLP_ST_2013_GE(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_GENE_BIONLP_ST_2011_GE(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_GENE_BIONLP_ST_2011_ID(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_CHEMICAL_BIONLP_ST_2011_ID(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_SPECIES_BIONLP_ST_2011_ID(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_GENE_BIONLP_ST_2011_REL(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_GENE_BIONLP_ST_2011_EPI(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_SPECIES_BIONLP_ST_2019_BB(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_GENE_BIOID(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_CHEMICAL_BIOID(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_SPECIES_BIOID(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_CELL_LINE_BIOID(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_GENE_GNORMPLUS(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_GENE_PROGENE(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_CHEMICAL_NLM_CHEM(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_GENE_SETH_CORPUS(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_GENE_TMVAR_V3(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_SPECIES_TMVAR_V3(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str
- class flair.datasets.biomedical.HUNER_CELL_LINE_TMVAR_V3(base_path=None, in_memory=True, sentence_splitter=None, train_split_name=None, dev_split_name=None, test_split_name=None)View on GitHub#
Bases:
BIGBIO_NER_CORPUS
- get_entity_type_mapping()View on GitHub#
Return the mapping of entity type given in the dataset to canonical types.
Note, if a entity type is not present in the map it is discarded.
- Return type:
Optional
[Dict
]
- build_corpus_directory_name(dataset_name)View on GitHub#
Builds the directory name for the given data set.
- Return type:
str