flair.data#
Functions
|
|
|
Shuffles a dataset and splits into two subsets. |
Classes
|
|
|
Dataset as a concatenation of multiple datasets. |
|
The main object in Flair for holding a dataset used for training and testing. |
|
|
This is the parent class of all data points in Flair. |
|
|
|
|
This class holds a dictionary that maps strings to IDs, used to generate one-hot encodings of strings. |
|
A Concept as part of a knowledgebase or ontology. |
|
|
|
This class represents a label. |
|
|
|
|
|
A Sentence is a central object in Flair that represents either a single sentence or a whole text. |
|
This class represents one textual span consisting of Tokens. |
|
This class represents one word in a tokenized sentence. |
- class flair.data.BoundingBox(left, top, right, bottom)View on GitHub#
Bases:
NamedTuple
-
left:
str
# Alias for field number 0
-
top:
int
# Alias for field number 1
-
right:
int
# Alias for field number 2
-
bottom:
int
# Alias for field number 3
-
left:
- class flair.data.Dictionary(add_unk=True)View on GitHub#
Bases:
object
This class holds a dictionary that maps strings to IDs, used to generate one-hot encodings of strings.
- remove_item(item)View on GitHub#
- add_item(item)View on GitHub#
Add string - if already in dictionary returns its ID. if not in dictionary, it will get a new ID.
- Parameters:
item (
str
) – a string for which to assign an id.- Return type:
int
- Returns:
ID of string
- get_idx_for_item(item)View on GitHub#
Returns the ID of the string, otherwise 0.
- Parameters:
item (
str
) – string for which ID is requested- Return type:
int
- Returns:
ID of string, otherwise 0
- get_idx_for_items(items)View on GitHub#
Returns the IDs for each item of the list of string, otherwise 0 if not found.
- Parameters:
items (
list
[str
]) – List of string for which IDs are requested- Return type:
list
[int
]- Returns:
List of ID of strings
- get_items()View on GitHub#
- Return type:
list
[str
]
- get_item_for_index(idx)View on GitHub#
- Return type:
str
- has_item(item)View on GitHub#
- Return type:
bool
- set_start_stop_tags()View on GitHub#
- Return type:
None
- is_span_prediction_problem()View on GitHub#
- Return type:
bool
- start_stop_tags_are_set()View on GitHub#
- Return type:
bool
- save(savefile)View on GitHub#
- classmethod load_from_file(filename)View on GitHub#
- Return type:
- classmethod load(name)View on GitHub#
- Return type:
- class flair.data.Label(data_point, value, score=1.0, **metadata)View on GitHub#
Bases:
object
This class represents a label.
Each label has a value and optionally a confidence score. The score needs to be between 0.0 and 1.0. Default value for the score is 1.0.
- set_value(value, score=1.0)View on GitHub#
- property value: str#
- property score: float#
- to_dict()View on GitHub#
- property shortstring#
- property metadata_str: str#
- property labeled_identifier#
- property unlabeled_identifier#
- class flair.data.DataPointView on GitHub#
Bases:
object
This is the parent class of all data points in Flair.
Examples for data points are Token, Sentence, Image, etc. Each DataPoint must be embeddable (hence the abstract property embedding() and methods to() and clear_embeddings()). Also, each DataPoint may have Labels in several layers of annotation (hence the functions add_label(), get_labels() and the property ‘label’)
- abstract property embedding: Tensor#
- set_embedding(name, vector)View on GitHub#
- get_embedding(names=None)View on GitHub#
- Return type:
Tensor
- get_each_embedding(embedding_names=None)View on GitHub#
- Return type:
list
[Tensor
]
- to(device, pin_memory=False)View on GitHub#
- Return type:
None
- clear_embeddings(embedding_names=None)View on GitHub#
- Return type:
None
- has_label(type)View on GitHub#
- Return type:
bool
- add_metadata(key, value)View on GitHub#
- Return type:
None
- get_metadata(key)View on GitHub#
- Return type:
Any
- has_metadata(key)View on GitHub#
- Return type:
bool
- add_label(typename, value, score=1.0, **metadata)View on GitHub#
Adds a label to the
DataPoint
by internally creating aLabel
object.- Parameters:
typename (
str
) – A string that identifies the layer of annotation, such as “ner” for named entity labels or “sentiment” for sentiment labelsvalue (
str
) – A string that sets the value of the label.score (
float
) – Optional value setting the confidence level of the label (between 0 and 1). If not set, a default confidence of 1 is used.**metadata – Additional metadata information.
- Return type:
- Returns:
A pointer to itself (DataPoint object, now with an added label).
- set_label(typename, value, score=1.0, **metadata)View on GitHub#
- remove_labels(typename)View on GitHub#
- Return type:
None
- get_label(label_type=None, zero_tag_value='O')View on GitHub#
- Return type:
- get_labels(typename=None)View on GitHub#
Returns all labels of this datapoint belonging to a specific annotation layer.
For instance, if a data point has been labeled with “sentiment”-labels, you can call this function as get_labels(“sentiment”) to return a list of all sentiment labels.
- abstract property unlabeled_identifier#
- abstract property start_position: int#
- abstract property end_position: int#
- abstract property text#
- property tag#
- property score#
- class flair.data.EntityCandidate(concept_id, concept_name, database_name, additional_ids=None, synonyms=None, description=None)View on GitHub#
Bases:
object
A Concept as part of a knowledgebase or ontology.
- to_dict()View on GitHub#
- Return type:
dict
[str
,Any
]
- class flair.data.Token(text, head_id=None, whitespace_after=1, start_position=0, sentence=None)View on GitHub#
Bases:
_PartOfSentence
This class represents one word in a tokenized sentence.
Each token may have any number of tags. It may also point to its head in a dependency tree.
- property idx: int#
- property text: str#
- property unlabeled_identifier: str#
- add_tags_proba_dist(tag_type, tags)View on GitHub#
- Return type:
None
- get_tags_proba_dist(tag_type)View on GitHub#
- Return type:
list
[Label
]
- get_head()View on GitHub#
- property start_position: int#
- property end_position: int#
- property embedding#
- add_label(typename, value, score=1.0, **metadata)View on GitHub#
Adds a label to the
DataPoint
by internally creating aLabel
object.- Parameters:
typename (
str
) – A string that identifies the layer of annotation, such as “ner” for named entity labels or “sentiment” for sentiment labelsvalue (
str
) – A string that sets the value of the label.score (
float
) – Optional value setting the confidence level of the label (between 0 and 1). If not set, a default confidence of 1 is used.**metadata – Additional metadata information.
- Returns:
A pointer to itself (DataPoint object, now with an added label).
- set_label(typename, value, score=1.0, **metadata)View on GitHub#
- to_dict(tag_type=None)View on GitHub#
- Return type:
dict
[str
,Any
]
- class flair.data.Span(tokens)View on GitHub#
Bases:
_PartOfSentence
This class represents one textual span consisting of Tokens.
- property start_position: int#
- property end_position: int#
- property text: str#
- property unlabeled_identifier: str#
- property embedding#
- to_dict(tag_type=None)View on GitHub#
- class flair.data.Relation(first, second)View on GitHub#
Bases:
_PartOfSentence
- property tag#
- property text#
- property unlabeled_identifier: str#
- property start_position: int#
- property end_position: int#
- property embedding#
- to_dict(tag_type=None)View on GitHub#
- class flair.data.Sentence(text, use_tokenizer=True, language_code=None, start_position=0)View on GitHub#
Bases:
DataPoint
A Sentence is a central object in Flair that represents either a single sentence or a whole text.
Internally, it consists of a list of Token objects that represent each word in the text. Additionally, this object stores all metadata related to a text such as labels, language code, etc.
- property unlabeled_identifier#
- get_relations(label_type=None)View on GitHub#
- Return type:
list
[Relation
]
- get_spans(label_type=None)View on GitHub#
- Return type:
list
[Span
]
- get_token(token_id)View on GitHub#
- Return type:
Optional
[Token
]
- property embedding#
- to(device, pin_memory=False)View on GitHub#
- clear_embeddings(embedding_names=None)View on GitHub#
- left_context(context_length, respect_document_boundaries=True)View on GitHub#
- Return type:
list
[Token
]
- right_context(context_length, respect_document_boundaries=True)View on GitHub#
- Return type:
list
[Token
]
- to_tagged_string(main_label=None)View on GitHub#
- Return type:
str
- property text: str#
- to_tokenized_string()View on GitHub#
- Return type:
str
- to_plain_string()View on GitHub#
- Return type:
str
- infer_space_after()View on GitHub#
Heuristics in case you wish to infer whitespace_after values for tokenized text.
This is useful for some old NLP tasks (such as CoNLL-03 and CoNLL-2000) that provide only tokenized data with no info of original whitespacing. :return:
- to_original_text()View on GitHub#
- Return type:
str
- to_dict(tag_type=None)View on GitHub#
- Return type:
dict
[str
,Any
]
- get_span(start, stop)View on GitHub#
- Return type:
- property start_position: int#
- property end_position: int#
- get_language_code()View on GitHub#
- Return type:
str
- next_sentence()View on GitHub#
Get the next sentence in the document.
This only works if context is set through dataloader or elsewhere :return: next Sentence in document if set, otherwise None
- previous_sentence()View on GitHub#
Get the previous sentence in the document.
works only if context is set through dataloader or elsewhere :return: previous Sentence in document if set, otherwise None
- is_context_set()View on GitHub#
Determines if this sentence has a context of sentences before or after set.
Return True or False depending on whether context is set (for instance in dataloader or elsewhere) :rtype:
bool
:return: True if context is set, else False
- copy_context_from_sentence(sentence)View on GitHub#
- Return type:
None
- classmethod set_context_for_sentences(sentences)View on GitHub#
- Return type:
None
- get_labels(label_type=None)View on GitHub#
Returns all labels of this datapoint belonging to a specific annotation layer.
For instance, if a data point has been labeled with “sentiment”-labels, you can call this function as get_labels(“sentiment”) to return a list of all sentiment labels.
- Parameters:
typename – The string identifier of the annotation layer, like “sentiment” or “ner”.
- Returns:
A list of
Label
objects belonging to this annotation layer for this data point.
- remove_labels(typename)View on GitHub#
- class flair.data.DataPair(first, second)View on GitHub#
Bases:
DataPoint
,Generic
[DT
,DT2
]- to(device, pin_memory=False)View on GitHub#
- clear_embeddings(embedding_names=None)View on GitHub#
- property embedding#
- property unlabeled_identifier#
- property start_position: int#
- property end_position: int#
- property text#
- class flair.data.DataTriple(first, second, third)View on GitHub#
Bases:
DataPoint
,Generic
[DT
,DT2
,DT3
]- to(device, pin_memory=False)View on GitHub#
- clear_embeddings(embedding_names=None)View on GitHub#
- property embedding#
- property unlabeled_identifier#
- property start_position: int#
- property end_position: int#
- property text#
- class flair.data.Image(data=None, imageURL=None)View on GitHub#
Bases:
DataPoint
- property embedding#
- property start_position: int#
- property end_position: int#
- property text: str#
- property unlabeled_identifier: str#
- class flair.data.Corpus(train=None, dev=None, test=None, name='corpus', sample_missing_splits=True, random_seed=None)View on GitHub#
Bases:
Generic
[T_co
]The main object in Flair for holding a dataset used for training and testing.
A corpus consists of three splits: A train split used for training, a dev split used for model selection and/or early stopping and a test split used for testing. All three splits are optional, so it is possible to create a corpus only using one or two splits. If the option sample_missing_splits is set to True, missing splits will be randomly sampled from the training split.
- property train: Dataset[T_co] | None#
The training split as a
torch.utils.data.Dataset
object.
- property dev: Dataset[T_co] | None#
The dev split as a
torch.utils.data.Dataset
object.
- property test: Dataset[T_co] | None#
The test split as a
torch.utils.data.Dataset
object.
- downsample(percentage=0.1, downsample_train=True, downsample_dev=True, downsample_test=True, random_seed=None)View on GitHub#
Randomly downsample the corpus to the given percentage (by removing data points).
This method is an in-place operation, meaning that the Corpus object itself is modified by removing data points. It additionally returns a pointer to itself for use in method chaining.
- Parameters:
percentage (
float
) – A float value between 0. and 1. that indicates to which percentage the corpus should be downsampled. Default value is 0.1, meaning it gets downsampled to 10%.downsample_train (
bool
) – Whether or not to include the training split in downsampling. Default is True.downsample_dev (
bool
) – Whether or not to include the dev split in downsampling. Default is True.downsample_test (
bool
) – Whether or not to include the test split in downsampling. Default is True.random_seed (
Optional
[int
]) – An optional random seed to make downsampling reproducible.
- Return type:
- Returns:
A pointer to itself for optional use in method chaining.
- filter_empty_sentences()View on GitHub#
A method that filters all sentences consisting of 0 tokens.
This is an in-place operation that directly modifies the Corpus object itself by removing these sentences.
- filter_long_sentences(max_charlength)View on GitHub#
A method that filters all sentences for which the plain text is longer than a specified number of characters.
This is an in-place operation that directly modifies the Corpus object itself by removing these sentences.
- Parameters:
max_charlength (
int
) – The maximum permissible character length of a sentence.
- make_vocab_dictionary(max_tokens=-1, min_freq=1)View on GitHub#
Creates a
Dictionary
of all tokens contained in the corpus.By defining max_tokens you can set the maximum number of tokens that should be contained in the dictionary. If there are more than max_tokens tokens in the corpus, the most frequent tokens are added first. If min_freq is set to a value greater than 1 only tokens occurring more than min_freq times are considered to be added to the dictionary.
- Parameters:
max_tokens (
int
) – The maximum number of tokens that should be added to the dictionary (providing a value of “-1” means that there is no maximum in this regard).min_freq (
int
) – A token needs to occur at least min_freq times to be added to the dictionary (providing a value of “-1” means that there is no limitation in this regard).
- Return type:
- Returns:
A
Dictionary
of all unique tokens in the corpus.
- obtain_statistics(label_type=None, pretty_print=True)View on GitHub#
Print statistics about the corpus, including the length of the sentences and the labels in the corpus.
- Parameters:
label_type (
Optional
[str
]) – Optionally set this value to obtain statistics only for one specific type of label (such as “ner” or “pos”). If not set, statistics for all labels will be returned.pretty_print (
bool
) – If set to True, returns pretty json (indented for readabilty). If not, the json is returned as a single line. Default: True.
- Return type:
Union
[dict
,str
]- Returns:
- If pretty_print is True, returns a pretty print formatted string in json format. Otherwise, returns a
dictionary holding a json.
- make_label_dictionary(label_type, min_count=-1, add_unk=False, add_dev_test=False)View on GitHub#
Creates a dictionary of all labels assigned to the sentences in the corpus.
- Parameters:
label_type (
str
) – The name of the label type for which the dictionary should be created. Some corpora have multiple layers of annotation, such as “pos” and “ner”. In this case, you should choose the label type you are interested in.min_count (
int
) – Optionally set this to exclude rare labels from the dictionary (i.e., labels seen fewer than the provided integer value).add_unk (
bool
) – Optionally set this to True to include a “UNK” value in the dictionary. In most cases, this is not needed since the label dictionary is well-defined, but some use cases might have open classes and require this.add_dev_test (
bool
) – Optionally set this to True to construct the label dictionary not only from the train split, but also from dev and test. This is only necessary if some labels never appear in train but do appear in one of the other splits.
- Return type:
- Returns:
A Dictionary of all unique labels in the corpus.
- add_label_noise(label_type, labels, noise_share=0.2, split='train', noise_transition_matrix=None)View on GitHub#
Generates uniform label noise distribution in the chosen dataset split.
- Parameters:
label_type (
str
) – the type of labels for which the noise should be simulated.labels (
list
[str
]) – an array with unique labels of said type (retrievable from label dictionary).noise_share (
float
) – the desired share of noise in the train split.split (
str
) – in which dataset split the noise is to be simulated.noise_transition_matrix (
Optional
[dict
[str
,list
[float
]]]) – provides pre-defined probabilities for label flipping based on the initial label value (relevant for class-dependent label noise simulation).
- get_label_distribution()View on GitHub#
Counts occurrences of each label in the corpus and returns them as a dictionary object.
This allows you to get an idea of which label appears how often in the Corpus.
- Returns:
Dictionary with labels as keys and their occurrences as values.
- get_all_sentences()View on GitHub#
Returns all sentences (spanning all three splits) in the
Corpus
.- Return type:
ConcatDataset
- Returns:
A
torch.utils.data.Dataset
object that includes all sentences of this corpus.
- make_tag_dictionary(tag_type)View on GitHub#
Create a tag dictionary of a given label type.
- Parameters:
tag_type (
str
) – the label type to gather the tag labels- Return type:
- Returns:
A Dictionary containing the labeled tags, including “O” and “<START>” and “<STOP>”
Deprecated since version 0.8: Use ‘make_label_dictionary’ instead.
- class flair.data.MultiCorpus(corpora, task_ids=None, name='multicorpus', **corpusargs)View on GitHub#
Bases:
Corpus
- class flair.data.FlairDatasetView on GitHub#
Bases:
Dataset
- abstract is_in_memory()View on GitHub#
- Return type:
bool
- class flair.data.ConcatFlairDataset(datasets, ids)View on GitHub#
Bases:
Dataset
Dataset as a concatenation of multiple datasets.
This class is useful to assemble different existing datasets.
- Parameters:
datasets (sequence) – List of datasets to be concatenated
- static cumsum(sequence)View on GitHub#
-
datasets:
list
[Dataset
]#
-
cumulative_sizes:
list
[int
]#
- property cummulative_sizes: list[int]#
- flair.data.randomly_split_into_two_datasets(dataset, length_of_first, random_seed=None)View on GitHub#
Shuffles a dataset and splits into two subsets.
The length of the first is specified and the remaining samples go into the second subset.
- Return type:
tuple
[Subset
,Subset
]
- flair.data.get_spans_from_bio(bioes_tags, bioes_scores=None)View on GitHub#
- Return type:
list
[tuple
[list
[int
],float
,str
]]