flair.data#
- class flair.data.BoundingBox(left, top, right, bottom)View on GitHub#
Bases:
tuple
-
left:
str
# Alias for field number 0
-
top:
int
# Alias for field number 1
-
right:
int
# Alias for field number 2
-
bottom:
int
# Alias for field number 3
-
left:
- class flair.data.Dictionary(add_unk=True)View on GitHub#
Bases:
object
This class holds a dictionary that maps strings to IDs, used to generate one-hot encodings of strings.
- remove_item(item)View on GitHub#
- add_item(item)View on GitHub#
Add string - if already in dictionary returns its ID. if not in dictionary, it will get a new ID.
- Parameters:
item (
str
) – a string for which to assign an id.- Return type:
int
Returns: ID of string
- get_idx_for_item(item)View on GitHub#
Returns the ID of the string, otherwise 0.
- Parameters:
item (
str
) – string for which ID is requested- Return type:
int
Returns: ID of string, otherwise 0
- get_idx_for_items(items)View on GitHub#
Returns the IDs for each item of the list of string, otherwise 0 if not found.
- Parameters:
items (
List
[str
]) – List of string for which IDs are requested- Return type:
List
[int
]
Returns: List of ID of strings
- get_items()View on GitHub#
- Return type:
List
[str
]
- get_item_for_index(idx)View on GitHub#
- set_start_stop_tags()View on GitHub#
- is_span_prediction_problem()View on GitHub#
- Return type:
bool
- start_stop_tags_are_set()View on GitHub#
- Return type:
bool
- save(savefile)View on GitHub#
- classmethod load_from_file(filename)View on GitHub#
- classmethod load(name)View on GitHub#
- class flair.data.Label(data_point, value, score=1.0)View on GitHub#
Bases:
object
This class represents a label.
Each label has a value and optionally a confidence score. The score needs to be between 0.0 and 1.0. Default value for the score is 1.0.
- set_value(value, score=1.0)View on GitHub#
- property value: str#
- property score: float#
- to_dict()View on GitHub#
- property shortstring#
- property labeled_identifier#
- property unlabeled_identifier#
- class flair.data.DataPointView on GitHub#
Bases:
object
This is the parent class of all data points in Flair.
Examples for data points are Token, Sentence, Image, etc. Each DataPoint must be embeddable (hence the abstract property embedding() and methods to() and clear_embeddings()). Also, each DataPoint may have Labels in several layers of annotation (hence the functions add_label(), get_labels() and the property ‘label’)
- abstract property embedding#
- set_embedding(name, vector)View on GitHub#
- get_embedding(names=None)View on GitHub#
- Return type:
Tensor
- get_each_embedding(embedding_names=None)View on GitHub#
- Return type:
List
[Tensor
]
- to(device, pin_memory=False)View on GitHub#
- clear_embeddings(embedding_names=None)View on GitHub#
- has_label(type)View on GitHub#
- Return type:
bool
- add_metadata(key, value)View on GitHub#
- Return type:
None
- get_metadata(key)View on GitHub#
- Return type:
Any
- has_metadata(key)View on GitHub#
- Return type:
bool
- add_label(typename, value, score=1.0)View on GitHub#
- set_label(typename, value, score=1.0)View on GitHub#
- remove_labels(typename)View on GitHub#
- get_label(label_type=None, zero_tag_value='O')View on GitHub#
- get_labels(typename=None)View on GitHub#
- abstract property unlabeled_identifier#
- abstract property start_position: int#
- abstract property end_position: int#
- abstract property text#
- property tag#
- property score#
- class flair.data.Token(text, head_id=None, whitespace_after=1, start_position=0, sentence=None)View on GitHub#
Bases:
_PartOfSentence
This class represents one word in a tokenized sentence.
Each token may have any number of tags. It may also point to its head in a dependency tree.
- property idx: int#
- property text: str#
- property unlabeled_identifier: str#
- add_tags_proba_dist(tag_type, tags)View on GitHub#
- get_tags_proba_dist(tag_type)View on GitHub#
- Return type:
List
[Label
]
- get_head()View on GitHub#
- property start_position: int#
- property end_position: int#
- property embedding#
- add_label(typename, value, score=1.0)View on GitHub#
- set_label(typename, value, score=1.0)View on GitHub#
- to_dict(tag_type=None)View on GitHub#
- class flair.data.Span(tokens)View on GitHub#
Bases:
_PartOfSentence
This class represents one textual span consisting of Tokens.
- property start_position: int#
- property end_position: int#
- property text: str#
- property unlabeled_identifier: str#
- property embedding#
- to_dict(tag_type=None)View on GitHub#
- class flair.data.Relation(first, second)View on GitHub#
Bases:
_PartOfSentence
- property tag#
- property text#
- property unlabeled_identifier: str#
- property start_position: int#
- property end_position: int#
- property embedding#
- to_dict(tag_type=None)View on GitHub#
- class flair.data.Sentence(text, use_tokenizer=True, language_code=None, start_position=0)View on GitHub#
Bases:
DataPoint
A Sentence is a list of tokens and is used to represent a sentence or text fragment.
- __init__(text, use_tokenizer=True, language_code=None, start_position=0)View on GitHub#
Class to hold all metadata related to a text.
Metadata can be tokens, labels, predictions, language code, etc.
- Parameters:
text (
Union
[str
,List
[str
],List
[Token
]]) – original string (sentence), or a pre tokenized list of tokens.use_tokenizer (
Union
[bool
,Tokenizer
]) – Specify a custom tokenizer to split the text into tokens. The Default isflair.tokenization.SegTokTokenizer
. If use_tokenizer is set to False,flair.tokenization.SpaceTokenizer
will be used instead. The tokenizer will be ignored, if text refers to pretokenized tokens.language_code (
Optional
[str
]) – Language of the sentence. If not provided, langdetect will be called when the language_code is accessed for the first time.start_position (
int
) – Start char offset of the sentence in the superordinate document.
- property unlabeled_identifier#
- get_relations(label_type=None)View on GitHub#
- Return type:
List
[Relation
]
- get_spans(label_type=None)View on GitHub#
- Return type:
List
[Span
]
- get_token(token_id)View on GitHub#
- Return type:
Optional
[Token
]
- property embedding#
- to(device, pin_memory=False)View on GitHub#
- clear_embeddings(embedding_names=None)View on GitHub#
- left_context(context_length, respect_document_boundaries=True)View on GitHub#
- Return type:
List
[Token
]
- right_context(context_length, respect_document_boundaries=True)View on GitHub#
- Return type:
List
[Token
]
- to_tagged_string(main_label=None)View on GitHub#
- Return type:
str
- property text#
- to_tokenized_string()View on GitHub#
- Return type:
str
- to_plain_string()View on GitHub#
- infer_space_after()View on GitHub#
Heuristics in case you wish to infer whitespace_after values for tokenized text.
This is useful for some old NLP tasks (such as CoNLL-03 and CoNLL-2000) that provide only tokenized data with no info of original whitespacing. :return:
- to_original_text()View on GitHub#
- Return type:
str
- to_dict(tag_type=None)View on GitHub#
- get_span(start, stop)View on GitHub#
- property start_position: int#
- property end_position: int#
- get_language_code()View on GitHub#
- Return type:
str
- next_sentence()View on GitHub#
Get the next sentence in the document.
This only works if context is set through dataloader or elsewhere :return: next Sentence in document if set, otherwise None
- previous_sentence()View on GitHub#
Get the previous sentence in the document.
works only if context is set through dataloader or elsewhere :return: previous Sentence in document if set, otherwise None
- is_context_set()View on GitHub#
Determines if this sentence has a context of sentences before or after set.
Return True or False depending on whether context is set (for instance in dataloader or elsewhere) :rtype:
bool
:return: True if context is set, else False
- copy_context_from_sentence(sentence)View on GitHub#
- Return type:
None
- classmethod set_context_for_sentences(sentences)View on GitHub#
- Return type:
None
- get_labels(label_type=None)View on GitHub#
- remove_labels(typename)View on GitHub#
- class flair.data.DataPair(first, second)View on GitHub#
Bases:
DataPoint
,Generic
[DT
,DT2
]- to(device, pin_memory=False)View on GitHub#
- clear_embeddings(embedding_names=None)View on GitHub#
- property embedding#
- property unlabeled_identifier#
- property start_position: int#
- property end_position: int#
- property text#
- class flair.data.Image(data=None, imageURL=None)View on GitHub#
Bases:
DataPoint
- property embedding#
- property start_position: int#
- property end_position: int#
- property text: str#
- property unlabeled_identifier: str#
- class flair.data.Corpus(train=None, dev=None, test=None, name='corpus', sample_missing_splits=True)View on GitHub#
Bases:
Generic
[T_co
]- property train: Dataset[T_co] | None#
- property dev: Dataset[T_co] | None#
- property test: Dataset[T_co] | None#
- downsample(percentage=0.1, downsample_train=True, downsample_dev=True, downsample_test=True)View on GitHub#
- filter_empty_sentences()View on GitHub#
- filter_long_sentences(max_charlength)View on GitHub#
- make_vocab_dictionary(max_tokens=-1, min_freq=1)View on GitHub#
Creates a dictionary of all tokens contained in the corpus.
By defining max_tokens you can set the maximum number of tokens that should be contained in the dictionary. If there are more than max_tokens tokens in the corpus, the most frequent tokens are added first. If min_freq is set to a value greater than 1 only tokens occurring more than min_freq times are considered to be added to the dictionary.
- Parameters:
max_tokens – the maximum number of tokens that should be added to the dictionary (-1 = take all tokens)
min_freq – a token needs to occur at least min_freq times to be added to the dictionary (-1 = there is no limitation)
- Return type:
Returns: dictionary of tokens
- obtain_statistics(label_type=None, pretty_print=True)View on GitHub#
Print statistics about the class distribution and sentence sizes.
only labels of sentences are taken into account
- Return type:
Union
[dict
,str
]
- make_label_dictionary(label_type, min_count=-1, add_unk=False, add_dev_test=False)View on GitHub#
Creates a dictionary of all labels assigned to the sentences in the corpus.
- Return type:
- Returns:
dictionary of labels
- add_label_noise(label_type, labels, noise_share=0.2, split='train', noise_transition_matrix=None)View on GitHub#
Generates uniform label noise distribution in the chosen dataset split.
- Parameters:
label_type (
str
) – the type of labels for which the noise should be simulated.labels (
List
[str
]) – an array with unique labels of said type (retrievable from label dictionary).noise_share (
float
) – the desired share of noise in the train split.split (
str
) – in which dataset split the noise is to be simulated.noise_transition_matrix (
Optional
[Dict
[str
,List
[float
]]]) – provides pre-defined probabilities for label flipping based on the initial label value (relevant for class-dependent label noise simulation).
- get_label_distribution()View on GitHub#
- get_all_sentences()View on GitHub#
- Return type:
ConcatDataset
- make_tag_dictionary(tag_type)View on GitHub#
Create a tag dictionary of a given label type.
- Parameters:
tag_type (
str
) – the label type to gather the tag labels- Return type:
Returns: A Dictionary containing the labeled tags, including “O” and “<START>” and “<STOP>”
Deprecated since version 0.8: Use ‘make_label_dictionary’ instead.
- class flair.data.MultiCorpus(corpora, task_ids=None, name='multicorpus', **corpusargs)View on GitHub#
Bases:
Corpus
- class flair.data.FlairDataset(*args, **kwds)View on GitHub#
Bases:
Dataset
- abstract is_in_memory()View on GitHub#
- Return type:
bool
- class flair.data.ConcatFlairDataset(datasets, ids)View on GitHub#
Bases:
Dataset
Dataset as a concatenation of multiple datasets.
This class is useful to assemble different existing datasets.
- Parameters:
datasets (sequence) – List of datasets to be concatenated
- static cumsum(sequence)View on GitHub#
-
datasets:
List
[Dataset
]#
-
cumulative_sizes:
List
[int
]#
- property cummulative_sizes#
- flair.data.iob2(tags)View on GitHub#
Converts the tags to the IOB2 format.
Check that tags have a valid IOB format. Tags in IOB1 format are converted to IOB2.
- flair.data.randomly_split_into_two_datasets(dataset, length_of_first)View on GitHub#
- flair.data.get_spans_from_bio(bioes_tags, bioes_scores=None)View on GitHub#
- Return type:
List
[Tuple
[List
[int
],float
,str
]]