flair.data.Corpus#

class flair.data.Corpus(train=None, dev=None, test=None, name='corpus', sample_missing_splits=True, random_seed=None)View on GitHub#

Bases: Generic[T_co]

The main object in Flair for holding a dataset used for training and testing.

A corpus consists of three splits: A train split used for training, a dev split used for model selection and/or early stopping and a test split used for testing. All three splits are optional, so it is possible to create a corpus only using one or two splits. If the option sample_missing_splits is set to True, missing splits will be randomly sampled from the training split.

__init__(train=None, dev=None, test=None, name='corpus', sample_missing_splits=True, random_seed=None)View on GitHub#

Constructor method to initialize a Corpus. You can define the train, dev and test split by passing the corresponding Dataset object to the constructor. At least one split should be defined. If the option sample_missing_splits is set to True, missing splits will be randomly sampled from the train split.

In most cases, you will not use the constructor yourself. Rather, you will create a corpus using one of our helper methods that read common NLP filetypes. For instance, you can use flair.datasets.sequence_labeling.ColumnCorpus to read CoNLL-formatted files directly into a Corpus.

Parameters:
  • train (Optional[Dataset[TypeVar(T_co, covariant=True)]]) – The split you use for model training.

  • dev (Optional[Dataset[TypeVar(T_co, covariant=True)]]) – A holdout split typically used for model selection or early stopping.

  • test (Optional[Dataset[TypeVar(T_co, covariant=True)]]) – The final test data to compute the score of the model.

  • name (str) – A name that identifies the corpus.

  • sample_missing_splits (Union[bool, str]) – If set to True, missing splits are sampled from train. If set to False, missing splits are not sampled and left empty. Default: True.

  • random_seed (Optional[int]) – Set a random seed to control the sampling of missing splits.

Methods

__init__([train, dev, test, name, ...])

Constructor method to initialize a Corpus.

add_label_noise(label_type, labels[, ...])

Generates uniform label noise distribution in the chosen dataset split.

downsample([percentage, downsample_train, ...])

Randomly downsample the corpus to the given percentage (by removing data points).

filter_empty_sentences()

A method that filters all sentences consisting of 0 tokens.

filter_long_sentences(max_charlength)

A method that filters all sentences for which the plain text is longer than a specified number of characters.

get_all_sentences()

Returns all sentences (spanning all three splits) in the Corpus.

get_label_distribution()

Counts occurrences of each label in the corpus and returns them as a dictionary object.

make_label_dictionary(label_type[, ...])

Creates a dictionary of all labels assigned to the sentences in the corpus.

make_tag_dictionary(tag_type)

Create a tag dictionary of a given label type.

make_vocab_dictionary([max_tokens, min_freq])

Creates a Dictionary of all tokens contained in the corpus.

obtain_statistics([label_type, pretty_print])

Print statistics about the corpus, including the length of the sentences and the labels in the corpus.

Attributes

dev

The dev split as a torch.utils.data.Dataset object.

test

The test split as a torch.utils.data.Dataset object.

train

The training split as a torch.utils.data.Dataset object.

property train: Dataset[T_co] | None#

The training split as a torch.utils.data.Dataset object.

property dev: Dataset[T_co] | None#

The dev split as a torch.utils.data.Dataset object.

property test: Dataset[T_co] | None#

The test split as a torch.utils.data.Dataset object.

downsample(percentage=0.1, downsample_train=True, downsample_dev=True, downsample_test=True, random_seed=None)View on GitHub#

Randomly downsample the corpus to the given percentage (by removing data points).

This method is an in-place operation, meaning that the Corpus object itself is modified by removing data points. It additionally returns a pointer to itself for use in method chaining.

Parameters:
  • percentage (float) – A float value between 0. and 1. that indicates to which percentage the corpus should be downsampled. Default value is 0.1, meaning it gets downsampled to 10%.

  • downsample_train (bool) – Whether or not to include the training split in downsampling. Default is True.

  • downsample_dev (bool) – Whether or not to include the dev split in downsampling. Default is True.

  • downsample_test (bool) – Whether or not to include the test split in downsampling. Default is True.

  • random_seed (Optional[int]) – An optional random seed to make downsampling reproducible.

Return type:

Corpus

Returns:

A pointer to itself for optional use in method chaining.

filter_empty_sentences()View on GitHub#

A method that filters all sentences consisting of 0 tokens.

This is an in-place operation that directly modifies the Corpus object itself by removing these sentences.

filter_long_sentences(max_charlength)View on GitHub#

A method that filters all sentences for which the plain text is longer than a specified number of characters.

This is an in-place operation that directly modifies the Corpus object itself by removing these sentences.

Parameters:

max_charlength (int) – The maximum permissible character length of a sentence.

make_vocab_dictionary(max_tokens=-1, min_freq=1)View on GitHub#

Creates a Dictionary of all tokens contained in the corpus.

By defining max_tokens you can set the maximum number of tokens that should be contained in the dictionary. If there are more than max_tokens tokens in the corpus, the most frequent tokens are added first. If min_freq is set to a value greater than 1 only tokens occurring more than min_freq times are considered to be added to the dictionary.

Parameters:
  • max_tokens (int) – The maximum number of tokens that should be added to the dictionary (providing a value of “-1” means that there is no maximum in this regard).

  • min_freq (int) – A token needs to occur at least min_freq times to be added to the dictionary (providing a value of “-1” means that there is no limitation in this regard).

Return type:

Dictionary

Returns:

A Dictionary of all unique tokens in the corpus.

obtain_statistics(label_type=None, pretty_print=True)View on GitHub#

Print statistics about the corpus, including the length of the sentences and the labels in the corpus.

Parameters:
  • label_type (Optional[str]) – Optionally set this value to obtain statistics only for one specific type of label (such as “ner” or “pos”). If not set, statistics for all labels will be returned.

  • pretty_print (bool) – If set to True, returns pretty json (indented for readabilty). If not, the json is returned as a single line. Default: True.

Return type:

Union[dict, str]

Returns:

If pretty_print is True, returns a pretty print formatted string in json format. Otherwise, returns a

dictionary holding a json.

make_label_dictionary(label_type, min_count=-1, add_unk=False, add_dev_test=False)View on GitHub#

Creates a dictionary of all labels assigned to the sentences in the corpus.

Parameters:
  • label_type (str) – The name of the label type for which the dictionary should be created. Some corpora have multiple layers of annotation, such as “pos” and “ner”. In this case, you should choose the label type you are interested in.

  • min_count (int) – Optionally set this to exclude rare labels from the dictionary (i.e., labels seen fewer than the provided integer value).

  • add_unk (bool) – Optionally set this to True to include a “UNK” value in the dictionary. In most cases, this is not needed since the label dictionary is well-defined, but some use cases might have open classes and require this.

  • add_dev_test (bool) – Optionally set this to True to construct the label dictionary not only from the train split, but also from dev and test. This is only necessary if some labels never appear in train but do appear in one of the other splits.

Return type:

Dictionary

Returns:

A Dictionary of all unique labels in the corpus.

add_label_noise(label_type, labels, noise_share=0.2, split='train', noise_transition_matrix=None)View on GitHub#

Generates uniform label noise distribution in the chosen dataset split.

Parameters:
  • label_type (str) – the type of labels for which the noise should be simulated.

  • labels (list[str]) – an array with unique labels of said type (retrievable from label dictionary).

  • noise_share (float) – the desired share of noise in the train split.

  • split (str) – in which dataset split the noise is to be simulated.

  • noise_transition_matrix (Optional[dict[str, list[float]]]) – provides pre-defined probabilities for label flipping based on the initial label value (relevant for class-dependent label noise simulation).

get_label_distribution()View on GitHub#

Counts occurrences of each label in the corpus and returns them as a dictionary object.

This allows you to get an idea of which label appears how often in the Corpus.

Returns:

Dictionary with labels as keys and their occurrences as values.

get_all_sentences()View on GitHub#

Returns all sentences (spanning all three splits) in the Corpus.

Return type:

ConcatDataset

Returns:

A torch.utils.data.Dataset object that includes all sentences of this corpus.

make_tag_dictionary(tag_type)View on GitHub#

Create a tag dictionary of a given label type.

Parameters:

tag_type (str) – the label type to gather the tag labels

Return type:

Dictionary

Returns:

A Dictionary containing the labeled tags, including “O” and “<START>” and “<STOP>”

Deprecated since version 0.8: Use ‘make_label_dictionary’ instead.