flair.data#

class flair.data.BoundingBox(left, top, right, bottom)View on GitHub#

Bases: tuple

left: str#

Alias for field number 0

top: int#

Alias for field number 1

right: int#

Alias for field number 2

bottom: int#

Alias for field number 3

class flair.data.Dictionary(add_unk=True)View on GitHub#

Bases: object

This class holds a dictionary that maps strings to IDs, used to generate one-hot encodings of strings.

remove_item(item)View on GitHub#
add_item(item)View on GitHub#

Add string - if already in dictionary returns its ID. if not in dictionary, it will get a new ID.

Parameters:

item (str) – a string for which to assign an id.

Return type:

int

Returns: ID of string

get_idx_for_item(item)View on GitHub#

Returns the ID of the string, otherwise 0.

Parameters:

item (str) – string for which ID is requested

Return type:

int

Returns: ID of string, otherwise 0

get_idx_for_items(items)View on GitHub#

Returns the IDs for each item of the list of string, otherwise 0 if not found.

Parameters:

items (List[str]) – List of string for which IDs are requested

Return type:

List[int]

Returns: List of ID of strings

get_items()View on GitHub#
Return type:

List[str]

get_item_for_index(idx)View on GitHub#
set_start_stop_tags()View on GitHub#
is_span_prediction_problem()View on GitHub#
Return type:

bool

start_stop_tags_are_set()View on GitHub#
Return type:

bool

save(savefile)View on GitHub#
classmethod load_from_file(filename)View on GitHub#
classmethod load(name)View on GitHub#
class flair.data.Label(data_point, value, score=1.0, **metadata)View on GitHub#

Bases: object

This class represents a label.

Each label has a value and optionally a confidence score. The score needs to be between 0.0 and 1.0. Default value for the score is 1.0.

set_value(value, score=1.0)View on GitHub#
property value: str#
property score: float#
to_dict()View on GitHub#
property shortstring#
property metadata_str: str#
property labeled_identifier#
property unlabeled_identifier#
class flair.data.DataPointView on GitHub#

Bases: object

This is the parent class of all data points in Flair.

Examples for data points are Token, Sentence, Image, etc. Each DataPoint must be embeddable (hence the abstract property embedding() and methods to() and clear_embeddings()). Also, each DataPoint may have Labels in several layers of annotation (hence the functions add_label(), get_labels() and the property ‘label’)

abstract property embedding#
set_embedding(name, vector)View on GitHub#
get_embedding(names=None)View on GitHub#
Return type:

Tensor

get_each_embedding(embedding_names=None)View on GitHub#
Return type:

List[Tensor]

to(device, pin_memory=False)View on GitHub#
clear_embeddings(embedding_names=None)View on GitHub#
has_label(type)View on GitHub#
Return type:

bool

add_metadata(key, value)View on GitHub#
Return type:

None

get_metadata(key)View on GitHub#
Return type:

Any

has_metadata(key)View on GitHub#
Return type:

bool

add_label(typename, value, score=1.0, **metadata)View on GitHub#
set_label(typename, value, score=1.0, **metadata)View on GitHub#
remove_labels(typename)View on GitHub#
get_label(label_type=None, zero_tag_value='O')View on GitHub#
get_labels(typename=None)View on GitHub#
property labels: List[Label]#
abstract property unlabeled_identifier#
abstract property start_position: int#
abstract property end_position: int#
abstract property text#
property tag#
property score#
class flair.data.EntityCandidate(concept_id, concept_name, database_name, additional_ids=None, synonyms=None, description=None)View on GitHub#

Bases: object

A Concept as part of a knowledgebase or ontology.

__init__(concept_id, concept_name, database_name, additional_ids=None, synonyms=None, description=None)View on GitHub#

A Concept as part of a knowledgebase or ontology.

Parameters:
  • concept_id (str) – Identifier of the concept from the knowledgebase / ontology

  • concept_name (str) – (Canonical) name of the concept from the knowledgebase / ontology

  • additional_ids (Optional[List[str]]) – List of additional identifiers for the concept / entity in the KB / ontology

  • database_name (str) – Name of the knowledgebase / ontology

  • synonyms (Optional[List[str]]) – A list of synonyms for this entry

  • description (Optional[str]) – A description about the Concept to describe

to_dict()View on GitHub#
Return type:

Dict[str, Any]

class flair.data.Token(text, head_id=None, whitespace_after=1, start_position=0, sentence=None)View on GitHub#

Bases: _PartOfSentence

This class represents one word in a tokenized sentence.

Each token may have any number of tags. It may also point to its head in a dependency tree.

property idx: int#
property text: str#
property unlabeled_identifier: str#
add_tags_proba_dist(tag_type, tags)View on GitHub#
get_tags_proba_dist(tag_type)View on GitHub#
Return type:

List[Label]

get_head()View on GitHub#
property start_position: int#
property end_position: int#
property embedding#
add_label(typename, value, score=1.0, **metadata)View on GitHub#
set_label(typename, value, score=1.0, **metadata)View on GitHub#
to_dict(tag_type=None)View on GitHub#
class flair.data.Span(tokens)View on GitHub#

Bases: _PartOfSentence

This class represents one textual span consisting of Tokens.

property start_position: int#
property end_position: int#
property text: str#
property unlabeled_identifier: str#
property embedding#
to_dict(tag_type=None)View on GitHub#
class flair.data.Relation(first, second)View on GitHub#

Bases: _PartOfSentence

property tag#
property text#
property unlabeled_identifier: str#
property start_position: int#
property end_position: int#
property embedding#
to_dict(tag_type=None)View on GitHub#
class flair.data.Sentence(text, use_tokenizer=True, language_code=None, start_position=0)View on GitHub#

Bases: DataPoint

A Sentence is a list of tokens and is used to represent a sentence or text fragment.

__init__(text, use_tokenizer=True, language_code=None, start_position=0)View on GitHub#

Class to hold all metadata related to a text.

Metadata can be tokens, labels, predictions, language code, etc.

Parameters:
  • text (Union[str, List[str], List[Token]]) – original string (sentence), or a pre tokenized list of tokens.

  • use_tokenizer (Union[bool, Tokenizer]) – Specify a custom tokenizer to split the text into tokens. The Default is flair.tokenization.SegTokTokenizer. If use_tokenizer is set to False, flair.tokenization.SpaceTokenizer will be used instead. The tokenizer will be ignored, if text refers to pretokenized tokens.

  • language_code (Optional[str]) – Language of the sentence. If not provided, langdetect will be called when the language_code is accessed for the first time.

  • start_position (int) – Start char offset of the sentence in the superordinate document.

property unlabeled_identifier#
get_relations(label_type=None)View on GitHub#
Return type:

List[Relation]

get_spans(label_type=None)View on GitHub#
Return type:

List[Span]

get_token(token_id)View on GitHub#
Return type:

Optional[Token]

property embedding#
to(device, pin_memory=False)View on GitHub#
clear_embeddings(embedding_names=None)View on GitHub#
left_context(context_length, respect_document_boundaries=True)View on GitHub#
Return type:

List[Token]

right_context(context_length, respect_document_boundaries=True)View on GitHub#
Return type:

List[Token]

to_tagged_string(main_label=None)View on GitHub#
Return type:

str

property text#
to_tokenized_string()View on GitHub#
Return type:

str

to_plain_string()View on GitHub#
infer_space_after()View on GitHub#

Heuristics in case you wish to infer whitespace_after values for tokenized text.

This is useful for some old NLP tasks (such as CoNLL-03 and CoNLL-2000) that provide only tokenized data with no info of original whitespacing. :return:

to_original_text()View on GitHub#
Return type:

str

to_dict(tag_type=None)View on GitHub#
get_span(start, stop)View on GitHub#
property start_position: int#
property end_position: int#
get_language_code()View on GitHub#
Return type:

str

next_sentence()View on GitHub#

Get the next sentence in the document.

This only works if context is set through dataloader or elsewhere :return: next Sentence in document if set, otherwise None

previous_sentence()View on GitHub#

Get the previous sentence in the document.

works only if context is set through dataloader or elsewhere :return: previous Sentence in document if set, otherwise None

is_context_set()View on GitHub#

Determines if this sentence has a context of sentences before or after set.

Return True or False depending on whether context is set (for instance in dataloader or elsewhere) :rtype: bool :return: True if context is set, else False

copy_context_from_sentence(sentence)View on GitHub#
Return type:

None

classmethod set_context_for_sentences(sentences)View on GitHub#
Return type:

None

get_labels(label_type=None)View on GitHub#
remove_labels(typename)View on GitHub#
class flair.data.DataPair(first, second)View on GitHub#

Bases: DataPoint, Generic[DT, DT2]

to(device, pin_memory=False)View on GitHub#
clear_embeddings(embedding_names=None)View on GitHub#
property embedding#
property unlabeled_identifier#
property start_position: int#
property end_position: int#
property text#
class flair.data.Image(data=None, imageURL=None)View on GitHub#

Bases: DataPoint

property embedding#
property start_position: int#
property end_position: int#
property text: str#
property unlabeled_identifier: str#
class flair.data.Corpus(train=None, dev=None, test=None, name='corpus', sample_missing_splits=True)View on GitHub#

Bases: Generic[T_co]

property train: Dataset[T_co] | None#
property dev: Dataset[T_co] | None#
property test: Dataset[T_co] | None#
downsample(percentage=0.1, downsample_train=True, downsample_dev=True, downsample_test=True)View on GitHub#
filter_empty_sentences()View on GitHub#
filter_long_sentences(max_charlength)View on GitHub#
make_vocab_dictionary(max_tokens=-1, min_freq=1)View on GitHub#

Creates a dictionary of all tokens contained in the corpus.

By defining max_tokens you can set the maximum number of tokens that should be contained in the dictionary. If there are more than max_tokens tokens in the corpus, the most frequent tokens are added first. If min_freq is set to a value greater than 1 only tokens occurring more than min_freq times are considered to be added to the dictionary.

Parameters:
  • max_tokens – the maximum number of tokens that should be added to the dictionary (-1 = take all tokens)

  • min_freq – a token needs to occur at least min_freq times to be added to the dictionary (-1 = there is no limitation)

Return type:

Dictionary

Returns: dictionary of tokens

obtain_statistics(label_type=None, pretty_print=True)View on GitHub#

Print statistics about the class distribution and sentence sizes.

only labels of sentences are taken into account

Return type:

Union[dict, str]

make_label_dictionary(label_type, min_count=-1, add_unk=False, add_dev_test=False)View on GitHub#

Creates a dictionary of all labels assigned to the sentences in the corpus.

Return type:

Dictionary

Returns:

dictionary of labels

add_label_noise(label_type, labels, noise_share=0.2, split='train', noise_transition_matrix=None)View on GitHub#

Generates uniform label noise distribution in the chosen dataset split.

Parameters:
  • label_type (str) – the type of labels for which the noise should be simulated.

  • labels (List[str]) – an array with unique labels of said type (retrievable from label dictionary).

  • noise_share (float) – the desired share of noise in the train split.

  • split (str) – in which dataset split the noise is to be simulated.

  • noise_transition_matrix (Optional[Dict[str, List[float]]]) – provides pre-defined probabilities for label flipping based on the initial label value (relevant for class-dependent label noise simulation).

get_label_distribution()View on GitHub#
get_all_sentences()View on GitHub#
Return type:

ConcatDataset

make_tag_dictionary(tag_type)View on GitHub#

Create a tag dictionary of a given label type.

Parameters:

tag_type (str) – the label type to gather the tag labels

Return type:

Dictionary

Returns: A Dictionary containing the labeled tags, including “O” and “<START>” and “<STOP>”

Deprecated since version 0.8: Use ‘make_label_dictionary’ instead.

class flair.data.MultiCorpus(corpora, task_ids=None, name='multicorpus', **corpusargs)View on GitHub#

Bases: Corpus

class flair.data.FlairDataset(*args, **kwds)View on GitHub#

Bases: Dataset

abstract is_in_memory()View on GitHub#
Return type:

bool

class flair.data.ConcatFlairDataset(datasets, ids)View on GitHub#

Bases: Dataset

Dataset as a concatenation of multiple datasets.

This class is useful to assemble different existing datasets.

Parameters:

datasets (sequence) – List of datasets to be concatenated

static cumsum(sequence)View on GitHub#
datasets: List[Dataset]#
cumulative_sizes: List[int]#
property cummulative_sizes#
flair.data.iob2(tags)View on GitHub#

Converts the tags to the IOB2 format.

Check that tags have a valid IOB format. Tags in IOB1 format are converted to IOB2.

flair.data.randomly_split_into_two_datasets(dataset, length_of_first)View on GitHub#
flair.data.get_spans_from_bio(bioes_tags, bioes_scores=None)View on GitHub#
Return type:

List[Tuple[List[int], float, str]]