flair.data#

Functions

get_spans_from_bio(bioes_tags[, bioes_scores])

randomly_split_into_two_datasets(dataset, ...)

Shuffles a dataset and splits into two subsets.

Classes

BoundingBox(left, top, right, bottom)

ConcatFlairDataset(datasets, ids)

Dataset as a concatenation of multiple datasets.

Corpus([train, dev, test, name, ...])

The main object in Flair for holding a dataset used for training and testing.

DataPair(first, second)

DataPoint()

This is the parent class of all data points in Flair.

DataTriple(first, second, third)

Dictionary([add_unk])

This class holds a dictionary that maps strings to IDs, used to generate one-hot encodings of strings.

EntityCandidate(concept_id, concept_name, ...)

A Concept as part of a knowledgebase or ontology.

FlairDataset()

Image([data, imageURL])

Label(data_point, value[, score])

This class represents a label.

MultiCorpus(corpora[, task_ids, name])

Relation(first, second)

Sentence(text[, use_tokenizer, ...])

A Sentence is a central object in Flair that represents either a single sentence or a whole text.

Span(tokens)

This class represents one textual span consisting of Tokens.

Token(text[, head_id, whitespace_after, ...])

This class represents one word in a tokenized sentence.

class flair.data.BoundingBox(left, top, right, bottom)View on GitHub#

Bases: NamedTuple

left: str#

Alias for field number 0

top: int#

Alias for field number 1

right: int#

Alias for field number 2

bottom: int#

Alias for field number 3

class flair.data.Dictionary(add_unk=True)View on GitHub#

Bases: object

This class holds a dictionary that maps strings to IDs, used to generate one-hot encodings of strings.

remove_item(item)View on GitHub#
add_item(item)View on GitHub#

Add string - if already in dictionary returns its ID. if not in dictionary, it will get a new ID.

Parameters:

item (str) – a string for which to assign an id.

Return type:

int

Returns:

ID of string

get_idx_for_item(item)View on GitHub#

Returns the ID of the string, otherwise 0.

Parameters:

item (str) – string for which ID is requested

Return type:

int

Returns:

ID of string, otherwise 0

get_idx_for_items(items)View on GitHub#

Returns the IDs for each item of the list of string, otherwise 0 if not found.

Parameters:

items (list[str]) – List of string for which IDs are requested

Return type:

list[int]

Returns:

List of ID of strings

get_items()View on GitHub#
Return type:

list[str]

get_item_for_index(idx)View on GitHub#
Return type:

str

has_item(item)View on GitHub#
Return type:

bool

set_start_stop_tags()View on GitHub#
Return type:

None

is_span_prediction_problem()View on GitHub#
Return type:

bool

start_stop_tags_are_set()View on GitHub#
Return type:

bool

save(savefile)View on GitHub#
classmethod load_from_file(filename)View on GitHub#
Return type:

Dictionary

classmethod load(name)View on GitHub#
Return type:

Dictionary

class flair.data.Label(data_point, value, score=1.0, **metadata)View on GitHub#

Bases: object

This class represents a label.

Each label has a value and optionally a confidence score. The score needs to be between 0.0 and 1.0. Default value for the score is 1.0.

set_value(value, score=1.0)View on GitHub#
property value: str#
property score: float#
to_dict()View on GitHub#
property shortstring#
property metadata_str: str#
property labeled_identifier#
property unlabeled_identifier#
class flair.data.DataPointView on GitHub#

Bases: object

This is the parent class of all data points in Flair.

Examples for data points are Token, Sentence, Image, etc. Each DataPoint must be embeddable (hence the abstract property embedding() and methods to() and clear_embeddings()). Also, each DataPoint may have Labels in several layers of annotation (hence the functions add_label(), get_labels() and the property ‘label’)

abstract property embedding: Tensor#
set_embedding(name, vector)View on GitHub#
get_embedding(names=None)View on GitHub#
Return type:

Tensor

get_each_embedding(embedding_names=None)View on GitHub#
Return type:

list[Tensor]

to(device, pin_memory=False)View on GitHub#
Return type:

None

clear_embeddings(embedding_names=None)View on GitHub#
Return type:

None

has_label(type)View on GitHub#
Return type:

bool

add_metadata(key, value)View on GitHub#
Return type:

None

get_metadata(key)View on GitHub#
Return type:

Any

has_metadata(key)View on GitHub#
Return type:

bool

add_label(typename, value, score=1.0, **metadata)View on GitHub#

Adds a label to the DataPoint by internally creating a Label object.

Parameters:
  • typename (str) – A string that identifies the layer of annotation, such as “ner” for named entity labels or “sentiment” for sentiment labels

  • value (str) – A string that sets the value of the label.

  • score (float) – Optional value setting the confidence level of the label (between 0 and 1). If not set, a default confidence of 1 is used.

  • **metadata – Additional metadata information.

Return type:

DataPoint

Returns:

A pointer to itself (DataPoint object, now with an added label).

set_label(typename, value, score=1.0, **metadata)View on GitHub#
remove_labels(typename)View on GitHub#
Return type:

None

get_label(label_type=None, zero_tag_value='O')View on GitHub#
Return type:

Label

get_labels(typename=None)View on GitHub#

Returns all labels of this datapoint belonging to a specific annotation layer.

For instance, if a data point has been labeled with “sentiment”-labels, you can call this function as get_labels(“sentiment”) to return a list of all sentiment labels.

Parameters:

typename (Optional[str]) – The string identifier of the annotation layer, like “sentiment” or “ner”.

Return type:

list[Label]

Returns:

A list of Label objects belonging to this annotation layer for this data point.

property labels: list[Label]#
abstract property unlabeled_identifier#
abstract property start_position: int#
abstract property end_position: int#
abstract property text#
property tag#
property score#
class flair.data.EntityCandidate(concept_id, concept_name, database_name, additional_ids=None, synonyms=None, description=None)View on GitHub#

Bases: object

A Concept as part of a knowledgebase or ontology.

to_dict()View on GitHub#
Return type:

dict[str, Any]

class flair.data.Token(text, head_id=None, whitespace_after=1, start_position=0, sentence=None)View on GitHub#

Bases: _PartOfSentence

This class represents one word in a tokenized sentence.

Each token may have any number of tags. It may also point to its head in a dependency tree.

property idx: int#
property text: str#
property unlabeled_identifier: str#
add_tags_proba_dist(tag_type, tags)View on GitHub#
Return type:

None

get_tags_proba_dist(tag_type)View on GitHub#
Return type:

list[Label]

get_head()View on GitHub#
property start_position: int#
property end_position: int#
property embedding#
add_label(typename, value, score=1.0, **metadata)View on GitHub#

Adds a label to the DataPoint by internally creating a Label object.

Parameters:
  • typename (str) – A string that identifies the layer of annotation, such as “ner” for named entity labels or “sentiment” for sentiment labels

  • value (str) – A string that sets the value of the label.

  • score (float) – Optional value setting the confidence level of the label (between 0 and 1). If not set, a default confidence of 1 is used.

  • **metadata – Additional metadata information.

Returns:

A pointer to itself (DataPoint object, now with an added label).

set_label(typename, value, score=1.0, **metadata)View on GitHub#
to_dict(tag_type=None)View on GitHub#
Return type:

dict[str, Any]

class flair.data.Span(tokens)View on GitHub#

Bases: _PartOfSentence

This class represents one textual span consisting of Tokens.

property start_position: int#
property end_position: int#
property text: str#
property unlabeled_identifier: str#
property embedding#
to_dict(tag_type=None)View on GitHub#
class flair.data.Relation(first, second)View on GitHub#

Bases: _PartOfSentence

property tag#
property text#
property unlabeled_identifier: str#
property start_position: int#
property end_position: int#
property embedding#
to_dict(tag_type=None)View on GitHub#
class flair.data.Sentence(text, use_tokenizer=True, language_code=None, start_position=0)View on GitHub#

Bases: DataPoint

A Sentence is a central object in Flair that represents either a single sentence or a whole text.

Internally, it consists of a list of Token objects that represent each word in the text. Additionally, this object stores all metadata related to a text such as labels, language code, etc.

property unlabeled_identifier#
get_relations(label_type=None)View on GitHub#
Return type:

list[Relation]

get_spans(label_type=None)View on GitHub#
Return type:

list[Span]

get_token(token_id)View on GitHub#
Return type:

Optional[Token]

property embedding#
to(device, pin_memory=False)View on GitHub#
clear_embeddings(embedding_names=None)View on GitHub#
left_context(context_length, respect_document_boundaries=True)View on GitHub#
Return type:

list[Token]

right_context(context_length, respect_document_boundaries=True)View on GitHub#
Return type:

list[Token]

to_tagged_string(main_label=None)View on GitHub#
Return type:

str

property text: str#
to_tokenized_string()View on GitHub#
Return type:

str

to_plain_string()View on GitHub#
Return type:

str

infer_space_after()View on GitHub#

Heuristics in case you wish to infer whitespace_after values for tokenized text.

This is useful for some old NLP tasks (such as CoNLL-03 and CoNLL-2000) that provide only tokenized data with no info of original whitespacing. :return:

to_original_text()View on GitHub#
Return type:

str

to_dict(tag_type=None)View on GitHub#
Return type:

dict[str, Any]

get_span(start, stop)View on GitHub#
Return type:

Span

property start_position: int#
property end_position: int#
get_language_code()View on GitHub#
Return type:

str

next_sentence()View on GitHub#

Get the next sentence in the document.

This only works if context is set through dataloader or elsewhere :return: next Sentence in document if set, otherwise None

previous_sentence()View on GitHub#

Get the previous sentence in the document.

works only if context is set through dataloader or elsewhere :return: previous Sentence in document if set, otherwise None

is_context_set()View on GitHub#

Determines if this sentence has a context of sentences before or after set.

Return True or False depending on whether context is set (for instance in dataloader or elsewhere) :rtype: bool :return: True if context is set, else False

copy_context_from_sentence(sentence)View on GitHub#
Return type:

None

classmethod set_context_for_sentences(sentences)View on GitHub#
Return type:

None

get_labels(label_type=None)View on GitHub#

Returns all labels of this datapoint belonging to a specific annotation layer.

For instance, if a data point has been labeled with “sentiment”-labels, you can call this function as get_labels(“sentiment”) to return a list of all sentiment labels.

Parameters:

typename – The string identifier of the annotation layer, like “sentiment” or “ner”.

Returns:

A list of Label objects belonging to this annotation layer for this data point.

remove_labels(typename)View on GitHub#
class flair.data.DataPair(first, second)View on GitHub#

Bases: DataPoint, Generic[DT, DT2]

to(device, pin_memory=False)View on GitHub#
clear_embeddings(embedding_names=None)View on GitHub#
property embedding#
property unlabeled_identifier#
property start_position: int#
property end_position: int#
property text#
class flair.data.DataTriple(first, second, third)View on GitHub#

Bases: DataPoint, Generic[DT, DT2, DT3]

to(device, pin_memory=False)View on GitHub#
clear_embeddings(embedding_names=None)View on GitHub#
property embedding#
property unlabeled_identifier#
property start_position: int#
property end_position: int#
property text#
class flair.data.Image(data=None, imageURL=None)View on GitHub#

Bases: DataPoint

property embedding#
property start_position: int#
property end_position: int#
property text: str#
property unlabeled_identifier: str#
class flair.data.Corpus(train=None, dev=None, test=None, name='corpus', sample_missing_splits=True, random_seed=None)View on GitHub#

Bases: Generic[T_co]

The main object in Flair for holding a dataset used for training and testing.

A corpus consists of three splits: A train split used for training, a dev split used for model selection and/or early stopping and a test split used for testing. All three splits are optional, so it is possible to create a corpus only using one or two splits. If the option sample_missing_splits is set to True, missing splits will be randomly sampled from the training split.

property train: Dataset[T_co] | None#

The training split as a torch.utils.data.Dataset object.

property dev: Dataset[T_co] | None#

The dev split as a torch.utils.data.Dataset object.

property test: Dataset[T_co] | None#

The test split as a torch.utils.data.Dataset object.

downsample(percentage=0.1, downsample_train=True, downsample_dev=True, downsample_test=True, random_seed=None)View on GitHub#

Randomly downsample the corpus to the given percentage (by removing data points).

This method is an in-place operation, meaning that the Corpus object itself is modified by removing data points. It additionally returns a pointer to itself for use in method chaining.

Parameters:
  • percentage (float) – A float value between 0. and 1. that indicates to which percentage the corpus should be downsampled. Default value is 0.1, meaning it gets downsampled to 10%.

  • downsample_train (bool) – Whether or not to include the training split in downsampling. Default is True.

  • downsample_dev (bool) – Whether or not to include the dev split in downsampling. Default is True.

  • downsample_test (bool) – Whether or not to include the test split in downsampling. Default is True.

  • random_seed (Optional[int]) – An optional random seed to make downsampling reproducible.

Return type:

Corpus

Returns:

A pointer to itself for optional use in method chaining.

filter_empty_sentences()View on GitHub#

A method that filters all sentences consisting of 0 tokens.

This is an in-place operation that directly modifies the Corpus object itself by removing these sentences.

filter_long_sentences(max_charlength)View on GitHub#

A method that filters all sentences for which the plain text is longer than a specified number of characters.

This is an in-place operation that directly modifies the Corpus object itself by removing these sentences.

Parameters:

max_charlength (int) – The maximum permissible character length of a sentence.

make_vocab_dictionary(max_tokens=-1, min_freq=1)View on GitHub#

Creates a Dictionary of all tokens contained in the corpus.

By defining max_tokens you can set the maximum number of tokens that should be contained in the dictionary. If there are more than max_tokens tokens in the corpus, the most frequent tokens are added first. If min_freq is set to a value greater than 1 only tokens occurring more than min_freq times are considered to be added to the dictionary.

Parameters:
  • max_tokens (int) – The maximum number of tokens that should be added to the dictionary (providing a value of “-1” means that there is no maximum in this regard).

  • min_freq (int) – A token needs to occur at least min_freq times to be added to the dictionary (providing a value of “-1” means that there is no limitation in this regard).

Return type:

Dictionary

Returns:

A Dictionary of all unique tokens in the corpus.

obtain_statistics(label_type=None, pretty_print=True)View on GitHub#

Print statistics about the corpus, including the length of the sentences and the labels in the corpus.

Parameters:
  • label_type (Optional[str]) – Optionally set this value to obtain statistics only for one specific type of label (such as “ner” or “pos”). If not set, statistics for all labels will be returned.

  • pretty_print (bool) – If set to True, returns pretty json (indented for readabilty). If not, the json is returned as a single line. Default: True.

Return type:

Union[dict, str]

Returns:

If pretty_print is True, returns a pretty print formatted string in json format. Otherwise, returns a

dictionary holding a json.

make_label_dictionary(label_type, min_count=-1, add_unk=False, add_dev_test=False)View on GitHub#

Creates a dictionary of all labels assigned to the sentences in the corpus.

Parameters:
  • label_type (str) – The name of the label type for which the dictionary should be created. Some corpora have multiple layers of annotation, such as “pos” and “ner”. In this case, you should choose the label type you are interested in.

  • min_count (int) – Optionally set this to exclude rare labels from the dictionary (i.e., labels seen fewer than the provided integer value).

  • add_unk (bool) – Optionally set this to True to include a “UNK” value in the dictionary. In most cases, this is not needed since the label dictionary is well-defined, but some use cases might have open classes and require this.

  • add_dev_test (bool) – Optionally set this to True to construct the label dictionary not only from the train split, but also from dev and test. This is only necessary if some labels never appear in train but do appear in one of the other splits.

Return type:

Dictionary

Returns:

A Dictionary of all unique labels in the corpus.

add_label_noise(label_type, labels, noise_share=0.2, split='train', noise_transition_matrix=None)View on GitHub#

Generates uniform label noise distribution in the chosen dataset split.

Parameters:
  • label_type (str) – the type of labels for which the noise should be simulated.

  • labels (list[str]) – an array with unique labels of said type (retrievable from label dictionary).

  • noise_share (float) – the desired share of noise in the train split.

  • split (str) – in which dataset split the noise is to be simulated.

  • noise_transition_matrix (Optional[dict[str, list[float]]]) – provides pre-defined probabilities for label flipping based on the initial label value (relevant for class-dependent label noise simulation).

get_label_distribution()View on GitHub#

Counts occurrences of each label in the corpus and returns them as a dictionary object.

This allows you to get an idea of which label appears how often in the Corpus.

Returns:

Dictionary with labels as keys and their occurrences as values.

get_all_sentences()View on GitHub#

Returns all sentences (spanning all three splits) in the Corpus.

Return type:

ConcatDataset

Returns:

A torch.utils.data.Dataset object that includes all sentences of this corpus.

make_tag_dictionary(tag_type)View on GitHub#

Create a tag dictionary of a given label type.

Parameters:

tag_type (str) – the label type to gather the tag labels

Return type:

Dictionary

Returns:

A Dictionary containing the labeled tags, including “O” and “<START>” and “<STOP>”

Deprecated since version 0.8: Use ‘make_label_dictionary’ instead.

class flair.data.MultiCorpus(corpora, task_ids=None, name='multicorpus', **corpusargs)View on GitHub#

Bases: Corpus

class flair.data.FlairDatasetView on GitHub#

Bases: Dataset

abstract is_in_memory()View on GitHub#
Return type:

bool

class flair.data.ConcatFlairDataset(datasets, ids)View on GitHub#

Bases: Dataset

Dataset as a concatenation of multiple datasets.

This class is useful to assemble different existing datasets.

Parameters:

datasets (sequence) – List of datasets to be concatenated

static cumsum(sequence)View on GitHub#
datasets: list[Dataset]#
cumulative_sizes: list[int]#
property cummulative_sizes: list[int]#
flair.data.randomly_split_into_two_datasets(dataset, length_of_first, random_seed=None)View on GitHub#

Shuffles a dataset and splits into two subsets.

The length of the first is specified and the remaining samples go into the second subset.

Return type:

tuple[Subset, Subset]

flair.data.get_spans_from_bio(bioes_tags, bioes_scores=None)View on GitHub#
Return type:

list[tuple[list[int], float, str]]