flair.data#

Module Attributes

TextPair

Type alias for a DataPair consisting of two Sentences.

TextTriple

Type alias for a DataTriple consisting of three Sentences.

Functions

get_spans_from_bio(bioes_tags[, bioes_scores])

Decodes a sequence of BIOES/BIO tags into labeled spans with scores.

randomly_split_into_two_datasets(dataset, ...)

Shuffles and splits a dataset into two Subsets.

Classes

BoundingBox(left, top, right, bottom)

Represents a bounding box with left, top, right, and bottom coordinates.

ConcatFlairDataset(datasets, ids)

Concatenates multiple datasets, adding a multitask_id label to each sentence.

Corpus([train, dev, test, name, ...])

The main container for holding train, dev, and test datasets for a task.

DataPair(first, second)

Represents a pair of DataPoints, often used for sentence-pair tasks.

DataPoint()

Abstract base class for all data points in Flair (e.g., Token, Sentence, Image).

DataTriple(first, second, third)

Represents a triplet of DataPoints.

Dictionary([add_unk])

This class holds a dictionary that maps strings to unique integer IDs.

EntityCandidate(concept_id, concept_name, ...)

Represents a potential candidate entity from a knowledge base for entity linking.

FlairDataset()

Abstract base class for Flair datasets, adding an in-memory check.

Image([data, imageURL])

Represents an image as a data point, holding image data or a URL.

Label(data_point, value[, score])

Represents a label assigned to a DataPoint (e.g., Token, Span, Sentence).

MultiCorpus(corpora[, task_ids, name])

A Corpus composed of multiple individual Corpus objects, often for multi-task learning.

Relation(first, second)

Represents a directed relationship between two Spans in the same Sentence.

Sentence(text[, use_tokenizer, ...])

A central data structure representing a sentence or text passage as Tokens.

Span(tokens)

Represents a contiguous sequence of Tokens within a Sentence.

Token(text[, head_id, whitespace_after, ...])

Represents a single token (word, punctuation) within a Sentence.

class flair.data.BoundingBox(left: str, top: int, right: int, bottom: int)View on GitHub#

Bases: NamedTuple

Represents a bounding box with left, top, right, and bottom coordinates.

left: str#

Alias for field number 0

top: int#

Alias for field number 1

right: int#

Alias for field number 2

bottom: int#

Alias for field number 3

class flair.data.Dictionary(add_unk=True)View on GitHub#

Bases: object

This class holds a dictionary that maps strings to unique integer IDs.

Used throughout Flair for representing words, tags, characters, etc. Handles unknown items (<unk>) and flags for multi-label or span tasks. Items are stored internally as bytes for efficiency.

remove_item(item)View on GitHub#

Removes an item from the dictionary.

Note: This operation might be slow for large dictionaries as it involves list removal. It currently doesn’t re-index subsequent items.

Parameters:

item (str) – The string item to remove.

add_item(item)View on GitHub#

Adds a string item to the dictionary.

If the item exists, returns its ID. Otherwise, adds it and returns the new ID.

Parameters:

item (str) – The string item to add.

Returns:

The integer ID of the item.

Return type:

int

get_idx_for_item(item)View on GitHub#

Retrieves the integer ID for a given string item.

Parameters:

item (str) – The string item.

Returns:

The integer ID. Returns 0 if item is not found and add_unk is True.

Return type:

int

Raises:

IndexError – If the item is not found and add_unk is False.

get_idx_for_items(items)View on GitHub#

Retrieves the integer IDs for a list of string items. (No cache version)

Return type:

list[int]

get_items()View on GitHub#

Returns a list of all items in the dictionary in order of their IDs.

Return type:

list[str]

get_item_for_index(idx)View on GitHub#

Retrieves the string item corresponding to a given integer ID.

Parameters:

idx (int) – The integer ID.

Returns:

The string item.

Return type:

str

Raises:

IndexError – If the index is out of bounds.

has_item(item)View on GitHub#

Checks if a given string item exists in the dictionary.

Return type:

bool

set_start_stop_tags()View on GitHub#

Adds special <START> and <STOP> tags to the dictionary (often used for CRFs).

Return type:

None

is_span_prediction_problem()View on GitHub#

Checks if the dictionary likely represents BIOES/BIO span labels.

Returns True if span_labels flag is set or any item starts with ‘B-’, ‘I-’, ‘S-‘.

Returns:

True if likely span labels, False otherwise.

Return type:

bool

start_stop_tags_are_set()View on GitHub#

Checks if <START> and <STOP> tags have been added.

Return type:

bool

save(savefile)View on GitHub#

Saves the dictionary mapping to a file using pickle.

Parameters:

savefile (PathLike) – The path to the output file.

classmethod load_from_file(filename)View on GitHub#

Loads a Dictionary previously saved using the .save() method.

Parameters:

filename (Union[str, Path]) – Path to the saved dictionary file.

Returns:

The loaded Dictionary object.

Return type:

Dictionary

classmethod load(name)View on GitHub#

Loads a pre-built character dictionary or a dictionary from a file path.

Parameters:

name (str) – The name of the pre-built dictionary (e.g., ‘chars’) or a path to a dictionary file.

Returns:

The loaded Dictionary object.

Return type:

Dictionary

Raises:

ValueError – If the name is not recognized or the path is invalid.

class flair.data.Label(data_point, value, score=1.0, **metadata)View on GitHub#

Bases: object

Represents a label assigned to a DataPoint (e.g., Token, Span, Sentence).

data_point#

The data point this label is attached to.

Type:

DataPoint

value#

The string value of the label (e.g., “PERSON”, “POSITIVE”).

Type:

str

score#

The confidence score of the label (0.0 to 1.0).

Type:

float

metadata#

A dictionary for storing arbitrary additional metadata.

Type:

dict

typename#

The name of the annotation layer (set via DataPoint.add_label).

Type:

Optional[str]

set_value(value, score=1.0)View on GitHub#

Updates the value and score of the label.

property value: str#

The string value of the label.

property score: float#

The confidence score of the label (between 0.0 and 1.0).

to_dict()View on GitHub#
property shortstring#
property metadata_str: str#
property labeled_identifier#
property unlabeled_identifier#
property typename: str | None#

The name of the annotation layer this label belongs to (e.g., “ner”).

class flair.data.DataPointView on GitHub#

Bases: ABC

Abstract base class for all data points in Flair (e.g., Token, Sentence, Image).

Defines core functionalities like holding embeddings, managing labels across different annotation layers, and providing basic positional/textual info.

abstract property embedding: Tensor#

Provides the primary embedding representation of the data point.

set_embedding(name, vector)View on GitHub#

Stores an embedding tensor under a given name.

Parameters:
  • name (str) – The name to identify this embedding (e.g., “word”, “flair”).

  • vector (torch.Tensor) – The embedding tensor.

get_embedding(names=None)View on GitHub#

Retrieves embeddings, concatenating if multiple names are given or if names is None.

Parameters:

names (Optional[list[str]], optional) – Specific embedding names to retrieve. If None, concatenates all stored embeddings sorted by name. Defaults to None.

Returns:

A single tensor representing the requested embedding(s).

Returns an empty tensor if no relevant embeddings are found.

Return type:

torch.Tensor

get_each_embedding(embedding_names=None)View on GitHub#

Retrieves a list of individual embedding tensors.

Parameters:

embedding_names (Optional[list[str]], optional) – If provided, filters by these names. Otherwise, returns all stored embeddings. Defaults to None.

Returns:

List of embedding tensors, sorted by name.

Return type:

list[torch.Tensor]

to(device, pin_memory=False)View on GitHub#

Moves all stored embedding tensors to the specified device.

Parameters:
  • device (Union[str, torch.device]) – Target device (e.g., ‘cpu’, ‘cuda:0’).

  • pin_memory (bool, optional) – If True and moving to CUDA, attempts to pin memory. Defaults to False.

Return type:

None

clear_embeddings(embedding_names=None)View on GitHub#

Removes stored embeddings to free memory.

Parameters:

embedding_names (Optional[list[str]], optional) – Specific names to remove. If None, removes all embeddings. Defaults to None.

Return type:

None

has_label(typename)View on GitHub#

Checks if the data point has at least one label for the given annotation type.

Return type:

bool

add_metadata(key, value)View on GitHub#

Adds a key-value pair to the data point’s metadata.

Return type:

None

get_metadata(key)View on GitHub#

Retrieves metadata associated with the given key.

Parameters:

key (str) – The metadata key.

Returns:

The metadata value.

Return type:

Any

Raises:

KeyError – If the key is not found.

has_metadata(key)View on GitHub#

Checks if the data point has metadata for the given key.

Return type:

bool

add_label(typename, value, score=1.0, **metadata)View on GitHub#

Adds a new label to a specific annotation layer.

Parameters:
  • typename (str) – Name of the annotation layer (e.g., “ner”, “sentiment”).

  • value (str) – String value of the label (e.g., “PERSON”, “POSITIVE”).

  • score (float, optional) – Confidence score (0.0-1.0). Defaults to 1.0.

  • **metadata – Additional keyword arguments stored as metadata on the Label.

Returns:

Returns self for chaining.

Return type:

DataPoint

set_label(typename, value, score=1.0, **metadata)View on GitHub#

Sets the label(s) for an annotation layer, overwriting any existing ones.

Parameters:
  • typename (str) – The name of the annotation layer.

  • value (str) – The string value of the new label.

  • score (float, optional) – Confidence score (0.0-1.0). Defaults to 1.0.

  • **metadata – Additional keyword arguments for the new Label’s metadata.

Returns:

Returns self for chaining.

Return type:

DataPoint

remove_labels(typename)View on GitHub#

Removes all labels associated with a specific annotation layer.

Parameters:

typename (str) – The name of the annotation layer to clear.

Return type:

None

get_label(label_type=None, zero_tag_value='O')View on GitHub#

Retrieves the primary label for a given type, or a default ‘O’ label.

Parameters:
  • label_type (Optional[str], optional) – The annotation layer name. Defaults to None (uses first overall label).

  • zero_tag_value (str, optional) – Value for the default label if none found. Defaults to “O”.

Returns:

The primary label, or a default label with score 0.0.

Return type:

Label

get_labels(typename=None)View on GitHub#

Retrieves all labels for a specific annotation layer.

Parameters:

typename (Optional[str], optional) – The layer name. If None, returns all labels from all layers. Defaults to None.

Returns:

List of Label objects, or empty list if none found.

Return type:

list[Label]

property labels: list[Label]#

Returns a list of all labels from all annotation layers.

abstract property unlabeled_identifier: str#

A string identifier for the data point itself, without label info.

abstract property start_position: int#

The starting character offset within the original text.

abstract property end_position: int#

The ending character offset (exclusive) within the original text.

abstract property text: str#

The textual representation of this data point.

property tag: str#

Shortcut property for the value of the first label added.

property score: float#

Shortcut property for the score of the first label added.

class flair.data.EntityCandidate(concept_id, concept_name, database_name, additional_ids=None, synonyms=None, description=None)View on GitHub#

Bases: object

Represents a potential candidate entity from a knowledge base for entity linking.

to_dict()View on GitHub#
Return type:

dict[str, Any]

class flair.data.Token(text, head_id=None, whitespace_after=1, start_position=0, sentence=None)View on GitHub#

Bases: _PartOfSentence

Represents a single token (word, punctuation) within a Sentence.

form#

The textual content of the token.

Type:

str

idx#

The 1-based index within the sentence (-1 if not attached).

Type:

int

head_id#

1-based index of the dependency head.

Type:

Optional[int]

whitespace_after#

Number of spaces following this token.

Type:

int

start_position#

Character offset where this token begins.

Type:

int

tags_proba_dist#

Stores full probability distributions over tags.

Type:

dict[str, list[Label]]

property idx: int#

The 1-based index within the sentence (-1 if not attached).

property text: str#

The text content of the token.

property unlabeled_identifier: str#

“<text>”’.

Type:

String identifier

Type:

‘Token[<idx>]

add_tags_proba_dist(tag_type, tags)View on GitHub#

Stores a list of Labels representing a probability distribution for a tag type.

Parameters:
  • tag_type (str) – The annotation layer name (e.g., “pos”).

  • tags (list[Label]) – List of Labels, each with a tag value and probability score.

Return type:

None

get_tags_proba_dist(tag_type)View on GitHub#

Retrieves the stored probability distribution for a given tag type.

Parameters:

tag_type (str) – The annotation layer name.

Returns:

List of Labels representing the distribution,

or empty list if none stored.

Return type:

list[Label]

get_head()View on GitHub#

Returns the head Token in the dependency parse, if available.

Return type:

Optional[Token]

property start_position: int#

Character offset where the token begins in the Sentence text.

property end_position: int#

Character offset where the token ends (exclusive).

property embedding: Tensor#

Returns the concatenated embeddings stored for this token.

add_label(typename, value, score=1.0, **metadata)View on GitHub#

Adds a label, propagating it to the parent Sentence’s layer.

set_label(typename, value, score=1.0, **metadata)View on GitHub#

Sets a label (overwriting), propagating the change to the parent Sentence.

to_dict(tag_type=None)View on GitHub#
Return type:

dict[str, Any]

class flair.data.Span(tokens)View on GitHub#

Bases: _PartOfSentence

Represents a contiguous sequence of Tokens within a Sentence.

Used for entities, phrases, etc. Implements caching via __new__ within Sentence.

tokens#

The list of tokens constituting the span.

Type:

list[Token]

property start_position: int#

Character offset where the span begins (start of the first token).

property end_position: int#

Character offset where the span ends (end of the last token, exclusive).

property text: str#

The combined text of tokens in the span, respecting whitespace offsets.

property unlabeled_identifier: str#

<end_idx>]: “<text_preview>”’.

Type:

String identifier

Type:

‘Span[<start_idx>

property embedding: Tensor#

Returns embeddings stored directly on the Span object (if any).

to_dict(tag_type=None)View on GitHub#
class flair.data.Relation(first, second)View on GitHub#

Bases: _PartOfSentence

Represents a directed relationship between two Spans in the same Sentence.

Used for Relation Extraction. Caching via __new__ ensures uniqueness.

first#

The head Span of the relation.

Type:

Span

second#

The tail Span of the relation.

Type:

Span

property tag#

Shortcut property for the value of the first label added.

property text: str#

‘<head_text_preview> -> <tail_text_preview>’.

Type:

A simple textual representation

property unlabeled_identifier: str#

String identifier including span indices and text previews.

property start_position: int#

Character offset of the earliest start position of the two spans.

property end_position: int#

Character offset of the latest end position of the two spans.

property embedding: Tensor#

Placeholder for relation embedding (usually computed on the fly).

to_dict(tag_type=None)View on GitHub#
class flair.data.Sentence(text, use_tokenizer=True, language_code=None, start_position=0)View on GitHub#

Bases: DataPoint

A central data structure representing a sentence or text passage as Tokens.

Holds text, tokens, labels (sentence/token/span/relation levels), embeddings, and document context information.

tokens#

List of tokens (lazy tokenization if initialized with str).

Type:

list[Token]

text#

Original, untokenized text.

Type:

str

language_code#

ISO 639-1 language code.

Type:

Optional[str]

start_position#

Character offset in a larger document.

Type:

int

property tokens: list[Token]#

The list of Token objects (triggers tokenization if needed).

property unlabeled_identifier#

A string identifier for the data point itself, without label info.

property text: str#

Returns the original text of this sentence. Does not trigger tokenization.

to_original_text()View on GitHub#

Returns the original text of this sentence.

Return type:

str

to_tagged_string(main_label=None)View on GitHub#
Return type:

str

get_relations(label_type=None)View on GitHub#

Retrieves all Relation objects associated with this sentence.

Return type:

list[Relation]

get_spans(label_type=None)View on GitHub#
Return type:

list[Span]

get_token(token_id)View on GitHub#
Return type:

Optional[Token]

property embedding#

Provides the primary embedding representation of the data point.

to(device, pin_memory=False)View on GitHub#

Moves all stored embedding tensors to the specified device.

Parameters:
  • device (Union[str, torch.device]) – Target device (e.g., ‘cpu’, ‘cuda:0’).

  • pin_memory (bool, optional) – If True and moving to CUDA, attempts to pin memory. Defaults to False.

clear_embeddings(embedding_names=None)View on GitHub#

Removes stored embeddings to free memory.

Parameters:

embedding_names (Optional[list[str]], optional) – Specific names to remove. If None, removes all embeddings. Defaults to None.

left_context(context_length, respect_document_boundaries=True)View on GitHub#
Return type:

list[Token]

right_context(context_length, respect_document_boundaries=True)View on GitHub#
Return type:

list[Token]

to_tokenized_string()View on GitHub#
Return type:

str

to_plain_string()View on GitHub#
Return type:

str

infer_space_after()View on GitHub#

Heuristics in case you wish to infer whitespace_after values for tokenized text.

This is useful for some old NLP tasks (such as CoNLL-03 and CoNLL-2000) that provide only tokenized data with no info of original whitespacing. :return:

to_dict(tag_type=None)View on GitHub#
Return type:

dict[str, Any]

get_span(start, stop)View on GitHub#
Return type:

Span

property start_position: int#

The starting character offset within the original text.

property end_position: int#

The ending character offset (exclusive) within the original text.

get_language_code()View on GitHub#
Return type:

str

next_sentence()View on GitHub#

Get the next sentence in the document.

This only works if context is set through dataloader or elsewhere :return: next Sentence in document if set, otherwise None

previous_sentence()View on GitHub#

Get the previous sentence in the document.

works only if context is set through dataloader or elsewhere :return: previous Sentence in document if set, otherwise None

is_context_set()View on GitHub#

Determines if this sentence has a context of sentences before or after set.

Return True or False depending on whether context is set (for instance in dataloader or elsewhere) :rtype: bool :return: True if context is set, else False

copy_context_from_sentence(sentence)View on GitHub#
Return type:

None

classmethod set_context_for_sentences(sentences)View on GitHub#
Return type:

None

get_labels(label_type=None)View on GitHub#

Retrieves all labels for a specific annotation layer.

Parameters:

typename (Optional[str], optional) – The layer name. If None, returns all labels from all layers. Defaults to None.

Returns:

List of Label objects, or empty list if none found.

Return type:

list[Label]

remove_labels(typename)View on GitHub#

Removes all labels associated with a specific annotation layer.

Parameters:

typename (str) – The name of the annotation layer to clear.

truncate(max_tokens)View on GitHub#

Truncates the sentence to max_tokens, cleaning up associated annotations.

Return type:

None

retokenize(tokenizer)View on GitHub#

Retokenizes the sentence using the provided tokenizer while preserving span labels.

Parameters:

tokenizer – The tokenizer to use for retokenization

Example:

# Create a sentence with default tokenization
sentence = Sentence("01-03-2025 New York")

# Add span labels
sentence.get_span(1, 3).add_label('ner', "LOC")
sentence.get_span(0, 1).add_label('ner', "DATE")

# Retokenize with a different tokenizer while preserving labels
sentence.retokenize(StaccatoTokenizer())
class flair.data.DataPair(first, second)View on GitHub#

Bases: DataPoint, Generic[DT, DT2]

Represents a pair of DataPoints, often used for sentence-pair tasks.

to(device, pin_memory=False)View on GitHub#

Moves all stored embedding tensors to the specified device.

Parameters:
  • device (Union[str, torch.device]) – Target device (e.g., ‘cpu’, ‘cuda:0’).

  • pin_memory (bool, optional) – If True and moving to CUDA, attempts to pin memory. Defaults to False.

clear_embeddings(embedding_names=None)View on GitHub#

Removes stored embeddings to free memory.

Parameters:

embedding_names (Optional[list[str]], optional) – Specific names to remove. If None, removes all embeddings. Defaults to None.

property embedding#

Provides the primary embedding representation of the data point.

property unlabeled_identifier#

A string identifier for the data point itself, without label info.

property start_position: int#

The starting character offset within the original text.

property end_position: int#

The ending character offset (exclusive) within the original text.

property text#

The textual representation of this data point.

flair.data.TextPair#

Type alias for a DataPair consisting of two Sentences.

alias of DataPair[Sentence, Sentence]

class flair.data.DataTriple(first, second, third)View on GitHub#

Bases: DataPoint, Generic[DT, DT2, DT3]

Represents a triplet of DataPoints.

to(device, pin_memory=False)View on GitHub#

Moves all stored embedding tensors to the specified device.

Parameters:
  • device (Union[str, torch.device]) – Target device (e.g., ‘cpu’, ‘cuda:0’).

  • pin_memory (bool, optional) – If True and moving to CUDA, attempts to pin memory. Defaults to False.

clear_embeddings(embedding_names=None)View on GitHub#

Removes stored embeddings to free memory.

Parameters:

embedding_names (Optional[list[str]], optional) – Specific names to remove. If None, removes all embeddings. Defaults to None.

property embedding#

Provides the primary embedding representation of the data point.

property unlabeled_identifier#

A string identifier for the data point itself, without label info.

property start_position: int#

The starting character offset within the original text.

property end_position: int#

The ending character offset (exclusive) within the original text.

property text#

The textual representation of this data point.

flair.data.TextTriple#

Type alias for a DataTriple consisting of three Sentences.

alias of DataTriple[Sentence, Sentence, Sentence]

class flair.data.Image(data=None, imageURL=None)View on GitHub#

Bases: DataPoint

Represents an image as a data point, holding image data or a URL.

property embedding: Tensor#

Returns the concatenated embeddings stored for this image.

property start_position: int#

The starting character offset within the original text.

property end_position: int#

The ending character offset (exclusive) within the original text.

property text: str#

The textual representation of this data point.

property unlabeled_identifier: str#

A string identifier for the data point itself, without label info.

class flair.data.Corpus(train=None, dev=None, test=None, name='corpus', sample_missing_splits=True, random_seed=None)View on GitHub#

Bases: Generic[T_co]

The main container for holding train, dev, and test datasets for a task.

A corpus consists of three splits: A train split used for training, a dev split used for model selection or early stopping and a test split used for testing. All three splits are optional, so it is possible to create a corpus only using one or two splits. If the option sample_missing_splits is set to True, missing splits will be randomly sampled from the training split. Provides methods for sampling, filtering, and creating dictionaries.

Generics:

T_co: The covariant type of DataPoint in the datasets (e.g., Sentence).

train#

Training data split.

Type:

Optional[Dataset[T_co]]

dev#

Development (validation) data split.

Type:

Optional[Dataset[T_co]]

test#

Testing data split.

Type:

Optional[Dataset[T_co]]

name#

Name of the corpus.

Type:

str

property train: Dataset[T_co] | None#

The training split as a torch.utils.data.Dataset object.

property dev: Dataset[T_co] | None#

The dev split as a torch.utils.data.Dataset object.

property test: Dataset[T_co] | None#

The test split as a torch.utils.data.Dataset object.

downsample(percentage=0.1, downsample_train=True, downsample_dev=True, downsample_test=True, random_seed=None)View on GitHub#

Randomly downsample the corpus to the given percentage (by removing data points).

This method is an in-place operation, meaning that the Corpus object itself is modified by removing data points. It additionally returns a pointer to itself for use in method chaining.

Parameters:
  • percentage (float) – A float value between 0. and 1. that indicates to which percentage the corpus should be downsampled. Default value is 0.1, meaning it gets downsampled to 10%.

  • downsample_train (bool) – Whether or not to include the training split in downsampling. Default is True.

  • downsample_dev (bool) – Whether or not to include the dev split in downsampling. Default is True.

  • downsample_test (bool) – Whether or not to include the test split in downsampling. Default is True.

  • random_seed (Optional[int]) – An optional random seed to make downsampling reproducible.

Returns:

Returns self for chaining.

Return type:

Corpus

filter_empty_sentences()View on GitHub#

A method that filters all sentences consisting of 0 tokens.

This is an in-place operation that directly modifies the Corpus object itself by removing these sentences.

filter_long_sentences(max_charlength)View on GitHub#

A method that filters all sentences for which the plain text is longer than a specified number of characters.

This is an in-place operation that directly modifies the Corpus object itself by removing these sentences.

Parameters:

max_charlength (int) – Maximum allowed character length.

make_vocab_dictionary(max_tokens=-1, min_freq=1)View on GitHub#

Creates a Dictionary of all tokens contained in the corpus.

By defining max_tokens you can set the maximum number of tokens that should be contained in the dictionary. If there are more than max_tokens tokens in the corpus, the most frequent tokens are added first. If min_freq is set to a value greater than 1 only tokens occurring more than min_freq times are considered to be added to the dictionary.

Parameters:
  • max_tokens (int) – The maximum number of tokens that should be added to the dictionary (providing a value of “-1” means that there is no maximum in this regard).

  • min_freq (int) – A token needs to occur at least min_freq times to be added to the dictionary (providing a value of “-1” means that there is no limitation in this regard).

Returns:

Vocabulary Dictionary mapping tokens to IDs (includes <unk>).

Return type:

Dictionary

obtain_statistics(label_type=None, pretty_print=True)View on GitHub#

Print statistics about the corpus, including the length of the sentences and the labels in the corpus.

Parameters:
  • label_type (Optional[str]) – Optionally set this value to obtain statistics only for one specific type of label (such as “ner” or “pos”). If not set, statistics for all labels will be returned.

  • pretty_print (bool) – If set to True, returns pretty json (indented for readabilty). If not, the json is returned as a single line. Default: True.

Return type:

Union[dict, str]

Returns:

If pretty_print is True, returns a pretty print formatted string in json format. Otherwise, returns a

dictionary holding a json.

make_label_dictionary(label_type, min_count=1, add_unk=True, add_dev_test=False)View on GitHub#

Creates a Dictionary for a specific label type from the corpus.

Parameters:
  • label_type (str) – The name of the label type for which the dictionary should be created. Some corpora have multiple layers of annotation, such as “pos” and “ner”. In this case, you should choose the label type you are interested in.

  • min_count (int) – Optionally set this to exclude rare labels from the dictionary (i.e., labels seen fewer than the provided integer value).

  • add_unk (bool) – Optionally set this to True to include a “UNK” value in the dictionary. In most cases, this is not needed since the label dictionary is well-defined, but some use cases might have open classes and require this.

  • add_dev_test (bool) – Optionally set this to True to construct the label dictionary not only from the train split, but also from dev and test. This is only necessary if some labels never appear in train but do appear in one of the other splits.

Returns:

Dictionary mapping label values to IDs.

Return type:

Dictionary

Raises:
  • ValueError – If label_type is not found.

  • AssertionError – If no data splits are available to scan.

add_label_noise(label_type, labels, noise_share=0.2, split='train', noise_transition_matrix=None)View on GitHub#

Adds artificial label noise to a specified split (in-place).

Stores original labels under {label_type}_clean.

Parameters:
  • label_type (str) – Target label type.

  • labels (list[str]) – List of all possible valid labels for the type.

  • noise_share (float, optional) – Target proportion for uniform noise (0.0-1.0). Ignored if matrix is given. Defaults to 0.2.

  • split (str, optional) – Split to modify (‘train’, ‘dev’, ‘test’). Defaults to “train”.

  • noise_transition_matrix (Optional[dict[str, list[float]]], optional) – Matrix for class-dependent noise. Defaults to None (use uniform noise).

Return type:

None

get_label_distribution()View on GitHub#

Counts occurrences of each label in the corpus and returns them as a dictionary object.

This allows you to get an idea of which label appears how often in the Corpus.

Returns:

Dictionary with labels as keys and their occurrences as values.

get_all_sentences()View on GitHub#

Returns all sentences (spanning all three splits) in the Corpus.

Return type:

ConcatDataset

Returns:

A torch.utils.data.Dataset object that includes all sentences of this corpus.

make_tag_dictionary(tag_type)View on GitHub#

DEPRECATED: Creates tag dictionary ensuring ‘O’, ‘<START>’, ‘<STOP>’. :rtype: Dictionary

Deprecated since version 0.8: Use ‘make_label_dictionary(add_unk=False)’ instead.

class flair.data.MultiCorpus(corpora, task_ids=None, name='multicorpus', **corpusargs)View on GitHub#

Bases: Corpus

A Corpus composed of multiple individual Corpus objects, often for multi-task learning.

class flair.data.FlairDatasetView on GitHub#

Bases: Dataset

Abstract base class for Flair datasets, adding an in-memory check.

abstract is_in_memory()View on GitHub#

Returns True if the entire dataset is currently loaded in memory, False otherwise.

Return type:

bool

class flair.data.ConcatFlairDataset(datasets, ids)View on GitHub#

Bases: Dataset

Concatenates multiple datasets, adding a multitask_id label to each sentence.

Parameters:
  • datasets (Iterable[Dataset]) – List of datasets to concatenate.

  • ids (Iterable[str]) – List of task IDs corresponding to each dataset.

static cumsum(sequence)View on GitHub#
datasets: list[Dataset]#
cumulative_sizes: list[int]#
property cummulative_sizes: list[int]#
flair.data.randomly_split_into_two_datasets(dataset, length_of_first, random_seed=None)View on GitHub#

Shuffles and splits a dataset into two Subsets.

Parameters:
  • dataset (Dataset) – Input dataset.

  • length_of_first (int) – Desired number of samples in the first subset.

  • random_seed (Optional[int], optional) – Seed for reproducible shuffle. Defaults to None.

Returns:

The two dataset subsets.

Return type:

tuple[Subset, Subset]

Raises:

ValueError – If length_of_first is invalid.

flair.data.get_spans_from_bio(bioes_tags, bioes_scores=None)View on GitHub#

Decodes a sequence of BIOES/BIO tags into labeled spans with scores.

Parameters:
  • bioes_tags (list[str]) – List of predicted tags (e.g., “B-PER”, “I-PER”).

  • bioes_scores (Optional[list[float]], optional) – Confidence scores for each tag. Defaults to 1.0 if None.

Returns:

List of found spans:

(token_indices, avg_score, label_type).

Return type:

list[tuple[list[int], float, str]]