flair.datasets.entity_linking.BIGBIO_EL_NCBI_DISEASE#
- class flair.datasets.entity_linking.BIGBIO_EL_NCBI_DISEASE(base_path=None, label_type='el-diseases', **kwargs)View on GitHub#
Bases:
BigBioEntityLinkingCorpus
This class implents the adapter for the NCBI Disease corpus.
See: - Reference: https://www.sciencedirect.com/science/article/pii/S1532046413001974 - Link: https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/
- __init__(base_path=None, label_type='el-diseases', **kwargs)View on GitHub#
Initializes a Corpus, potentially sampling missing dev/test splits from train.
You can define the train, dev and test split by passing the corresponding Dataset object to the constructor. At least one split should be defined. If the option sample_missing_splits is set to True, missing splits will be randomly sampled from the train split. In most cases, you will not use the constructor yourself. Rather, you will create a corpus using one of our helper methods that read common NLP filetypes. For instance, you can use
flair.datasets.sequence_labeling.ColumnCorpus
to read CoNLL-formatted files directly into aCorpus
.- Parameters:
train (Optional[Dataset[T_co]], optional) – Training data. Defaults to None.
dev (Optional[Dataset[T_co]], optional) – Development data. Defaults to None.
test (Optional[Dataset[T_co]], optional) – Testing data. Defaults to None.
name (str, optional) – Corpus name. Defaults to “corpus”.
sample_missing_splits (Union[bool, str], optional) – Policy for handling missing splits. True (default): sample dev(10%)/test(10%) from train. False: keep None. “only_dev”: sample only dev. “only_test”: sample only test.
random_seed (Optional[int], optional) – Seed for reproducible sampling. Defaults to None.
Methods
__init__
([base_path, label_type])Initializes a Corpus, potentially sampling missing dev/test splits from train.
add_label_noise
(label_type, labels[, ...])Adds artificial label noise to a specified split (in-place).
downsample
([percentage, downsample_train, ...])Randomly downsample the corpus to the given percentage (by removing data points).
filter_empty_sentences
()A method that filters all sentences consisting of 0 tokens.
filter_long_sentences
(max_charlength)A method that filters all sentences for which the plain text is longer than a specified number of characters.
get_all_sentences
()Returns all sentences (spanning all three splits) in the
Corpus
.get_label_distribution
()Counts occurrences of each label in the corpus and returns them as a dictionary object.
make_label_dictionary
(label_type[, ...])Creates a Dictionary for a specific label type from the corpus.
make_tag_dictionary
(tag_type)DEPRECATED: Creates tag dictionary ensuring 'O', '<START>', '<STOP>'.
make_vocab_dictionary
([max_tokens, min_freq])Creates a
Dictionary
of all tokens contained in the corpus.obtain_statistics
([label_type, pretty_print])Print statistics about the corpus, including the length of the sentences and the labels in the corpus.
Attributes
dev
The dev split as a
torch.utils.data.Dataset
object.test
The test split as a
torch.utils.data.Dataset
object.train
The training split as a
torch.utils.data.Dataset
object.