flair.data.Corpus#
- class flair.data.Corpus(train=None, dev=None, test=None, name='corpus', sample_missing_splits=True, random_seed=None)View on GitHub#
Bases:
Generic
[T_co
]The main container for holding train, dev, and test datasets for a task.
A corpus consists of three splits: A train split used for training, a dev split used for model selection or early stopping and a test split used for testing. All three splits are optional, so it is possible to create a corpus only using one or two splits. If the option sample_missing_splits is set to True, missing splits will be randomly sampled from the training split. Provides methods for sampling, filtering, and creating dictionaries.
- Generics:
T_co: The covariant type of DataPoint in the datasets (e.g., Sentence).
- train#
Training data split.
- Type:
Optional[Dataset[T_co]]
- dev#
Development (validation) data split.
- Type:
Optional[Dataset[T_co]]
- test#
Testing data split.
- Type:
Optional[Dataset[T_co]]
- name#
Name of the corpus.
- Type:
str
- __init__(train=None, dev=None, test=None, name='corpus', sample_missing_splits=True, random_seed=None)View on GitHub#
Initializes a Corpus, potentially sampling missing dev/test splits from train.
You can define the train, dev and test split by passing the corresponding Dataset object to the constructor. At least one split should be defined. If the option sample_missing_splits is set to True, missing splits will be randomly sampled from the train split. In most cases, you will not use the constructor yourself. Rather, you will create a corpus using one of our helper methods that read common NLP filetypes. For instance, you can use
flair.datasets.sequence_labeling.ColumnCorpus
to read CoNLL-formatted files directly into aCorpus
.- Parameters:
train (Optional[Dataset[T_co]], optional) – Training data. Defaults to None.
dev (Optional[Dataset[T_co]], optional) – Development data. Defaults to None.
test (Optional[Dataset[T_co]], optional) – Testing data. Defaults to None.
name (str, optional) – Corpus name. Defaults to “corpus”.
sample_missing_splits (Union[bool, str], optional) – Policy for handling missing splits. True (default): sample dev(10%)/test(10%) from train. False: keep None. “only_dev”: sample only dev. “only_test”: sample only test.
random_seed (Optional[int], optional) – Seed for reproducible sampling. Defaults to None.
Methods
__init__
([train, dev, test, name, ...])Initializes a Corpus, potentially sampling missing dev/test splits from train.
add_label_noise
(label_type, labels[, ...])Adds artificial label noise to a specified split (in-place).
downsample
([percentage, downsample_train, ...])Randomly downsample the corpus to the given percentage (by removing data points).
A method that filters all sentences consisting of 0 tokens.
filter_long_sentences
(max_charlength)A method that filters all sentences for which the plain text is longer than a specified number of characters.
Returns all sentences (spanning all three splits) in the
Corpus
.Counts occurrences of each label in the corpus and returns them as a dictionary object.
make_label_dictionary
(label_type[, ...])Creates a Dictionary for a specific label type from the corpus.
make_tag_dictionary
(tag_type)DEPRECATED: Creates tag dictionary ensuring 'O', '<START>', '<STOP>'.
make_vocab_dictionary
([max_tokens, min_freq])Creates a
Dictionary
of all tokens contained in the corpus.obtain_statistics
([label_type, pretty_print])Print statistics about the corpus, including the length of the sentences and the labels in the corpus.
Attributes
The dev split as a
torch.utils.data.Dataset
object.The test split as a
torch.utils.data.Dataset
object.The training split as a
torch.utils.data.Dataset
object.- property train: Dataset[T_co] | None#
The training split as a
torch.utils.data.Dataset
object.
- property dev: Dataset[T_co] | None#
The dev split as a
torch.utils.data.Dataset
object.
- property test: Dataset[T_co] | None#
The test split as a
torch.utils.data.Dataset
object.
- downsample(percentage=0.1, downsample_train=True, downsample_dev=True, downsample_test=True, random_seed=None)View on GitHub#
Randomly downsample the corpus to the given percentage (by removing data points).
This method is an in-place operation, meaning that the Corpus object itself is modified by removing data points. It additionally returns a pointer to itself for use in method chaining.
- Parameters:
percentage (
float
) – A float value between 0. and 1. that indicates to which percentage the corpus should be downsampled. Default value is 0.1, meaning it gets downsampled to 10%.downsample_train (
bool
) – Whether or not to include the training split in downsampling. Default is True.downsample_dev (
bool
) – Whether or not to include the dev split in downsampling. Default is True.downsample_test (
bool
) – Whether or not to include the test split in downsampling. Default is True.random_seed (
Optional
[int
]) – An optional random seed to make downsampling reproducible.
- Returns:
Returns self for chaining.
- Return type:
- filter_empty_sentences()View on GitHub#
A method that filters all sentences consisting of 0 tokens.
This is an in-place operation that directly modifies the Corpus object itself by removing these sentences.
- filter_long_sentences(max_charlength)View on GitHub#
A method that filters all sentences for which the plain text is longer than a specified number of characters.
This is an in-place operation that directly modifies the Corpus object itself by removing these sentences.
- Parameters:
max_charlength (int) – Maximum allowed character length.
- make_vocab_dictionary(max_tokens=-1, min_freq=1)View on GitHub#
Creates a
Dictionary
of all tokens contained in the corpus.By defining max_tokens you can set the maximum number of tokens that should be contained in the dictionary. If there are more than max_tokens tokens in the corpus, the most frequent tokens are added first. If min_freq is set to a value greater than 1 only tokens occurring more than min_freq times are considered to be added to the dictionary.
- Parameters:
max_tokens (
int
) – The maximum number of tokens that should be added to the dictionary (providing a value of “-1” means that there is no maximum in this regard).min_freq (
int
) – A token needs to occur at least min_freq times to be added to the dictionary (providing a value of “-1” means that there is no limitation in this regard).
- Returns:
Vocabulary Dictionary mapping tokens to IDs (includes <unk>).
- Return type:
- obtain_statistics(label_type=None, pretty_print=True)View on GitHub#
Print statistics about the corpus, including the length of the sentences and the labels in the corpus.
- Parameters:
label_type (
Optional
[str
]) – Optionally set this value to obtain statistics only for one specific type of label (such as “ner” or “pos”). If not set, statistics for all labels will be returned.pretty_print (
bool
) – If set to True, returns pretty json (indented for readabilty). If not, the json is returned as a single line. Default: True.
- Return type:
Union
[dict
,str
]- Returns:
- If pretty_print is True, returns a pretty print formatted string in json format. Otherwise, returns a
dictionary holding a json.
- make_label_dictionary(label_type, min_count=1, add_unk=True, add_dev_test=False)View on GitHub#
Creates a Dictionary for a specific label type from the corpus.
- Parameters:
label_type (
str
) – The name of the label type for which the dictionary should be created. Some corpora have multiple layers of annotation, such as “pos” and “ner”. In this case, you should choose the label type you are interested in.min_count (
int
) – Optionally set this to exclude rare labels from the dictionary (i.e., labels seen fewer than the provided integer value).add_unk (
bool
) – Optionally set this to True to include a “UNK” value in the dictionary. In most cases, this is not needed since the label dictionary is well-defined, but some use cases might have open classes and require this.add_dev_test (
bool
) – Optionally set this to True to construct the label dictionary not only from the train split, but also from dev and test. This is only necessary if some labels never appear in train but do appear in one of the other splits.
- Returns:
Dictionary mapping label values to IDs.
- Return type:
- Raises:
ValueError – If label_type is not found.
AssertionError – If no data splits are available to scan.
- add_label_noise(label_type, labels, noise_share=0.2, split='train', noise_transition_matrix=None)View on GitHub#
Adds artificial label noise to a specified split (in-place).
Stores original labels under {label_type}_clean.
- Parameters:
label_type (str) – Target label type.
labels (list[str]) – List of all possible valid labels for the type.
noise_share (float, optional) – Target proportion for uniform noise (0.0-1.0). Ignored if matrix is given. Defaults to 0.2.
split (str, optional) – Split to modify (‘train’, ‘dev’, ‘test’). Defaults to “train”.
noise_transition_matrix (Optional[dict[str, list[float]]], optional) – Matrix for class-dependent noise. Defaults to None (use uniform noise).
- Return type:
None
- get_label_distribution()View on GitHub#
Counts occurrences of each label in the corpus and returns them as a dictionary object.
This allows you to get an idea of which label appears how often in the Corpus.
- Returns:
Dictionary with labels as keys and their occurrences as values.
- get_all_sentences()View on GitHub#
Returns all sentences (spanning all three splits) in the
Corpus
.- Return type:
ConcatDataset
- Returns:
A
torch.utils.data.Dataset
object that includes all sentences of this corpus.
- make_tag_dictionary(tag_type)View on GitHub#
DEPRECATED: Creates tag dictionary ensuring ‘O’, ‘<START>’, ‘<STOP>’. :rtype:
Dictionary
Deprecated since version 0.8: Use ‘make_label_dictionary(add_unk=False)’ instead.