flair.data.Sentence#
- class flair.data.Sentence(text, use_tokenizer=True, language_code=None, start_position=0)View on GitHub#
Bases:
DataPoint
A central data structure representing a sentence or text passage as Tokens.
Holds text, tokens, labels (sentence/token/span/relation levels), embeddings, and document context information.
- text#
Original, untokenized text.
- Type:
str
- language_code#
ISO 639-1 language code.
- Type:
Optional[str]
- start_position#
Character offset in a larger document.
- Type:
int
- __init__(text, use_tokenizer=True, language_code=None, start_position=0)View on GitHub#
Initializes a Sentence.
- Parameters:
text (
Union
[str
,list
[str
],list
[Token
]]) – Either pass the text as a string, or provide an already tokenized text as either a list of strings or a list ofToken
objects.use_tokenizer (
Union
[bool
,Tokenizer
]) – You can optionally specify a custom tokenizer to split the text into tokens. By default we useflair.tokenization.SegtokTokenizer
. If use_tokenizer is set to False,flair.tokenization.SpaceTokenizer
will be used instead. The tokenizer will be ignored, if text refers to pretokenized tokens.language_code (
Optional
[str
]) – Language of the sentence. If not provided, langdetect will be called when the language_code is accessed for the first time.start_position (
int
) – Start char offset of the sentence in the superordinate document.
Methods
__init__
(text[, use_tokenizer, ...])Initializes a Sentence.
add_label
(typename, value[, score])Adds a new label to a specific annotation layer.
add_metadata
(key, value)Adds a key-value pair to the data point's metadata.
clear_embeddings
([embedding_names])Removes stored embeddings to free memory.
copy_context_from_sentence
(sentence)from_dict
(sentence_dict)Creates a Sentence from a dictionary.
get_each_embedding
([embedding_names])Retrieves a list of individual embedding tensors.
get_embedding
([names])Retrieves embeddings, concatenating if multiple names are given or if names is None.
get_label
([label_type, zero_tag_value])Retrieves the primary label for a given type, or a default 'O' label.
get_labels
([label_type])Retrieves all labels for a specific annotation layer.
get_metadata
(key)Retrieves metadata associated with the given key.
get_relations
([label_type])Retrieves all Relation objects associated with this sentence.
get_span
(start, stop)get_spans
([label_type])get_token
(token_id)has_label
(typename)Checks if the data point has at least one label for the given annotation type.
has_metadata
(key)Checks if the data point has metadata for the given key.
Heuristics in case you wish to infer whitespace_after values for tokenized text.
Determines if this sentence has a context of sentences before or after set.
left_context
(context_length[, ...])Get the next sentence in the document.
Get the previous sentence in the document.
remove_labels
(typename)Removes all labels associated with a specific annotation layer.
retokenize
(new_tokenizer)Eagerly retokenizes the sentence using the provided tokenizer.
right_context
(context_length[, ...])set_context_for_sentences
(sentences)set_embedding
(name, vector)Stores an embedding tensor under a given name.
set_label
(typename, value[, score])Sets the label(s) for an annotation layer, overwriting any existing ones.
to
(device[, pin_memory])Moves all stored embedding tensors to the specified device.
to_dict
()Creates a dictionary representation of the Sentence.
Returns the original text of this sentence.
to_tagged_string
([main_label])truncate
(max_tokens)Truncates the sentence to max_tokens, cleaning up associated annotations.
Attributes
Provides the primary embedding representation of the data point.
The ending character offset (exclusive) within the original text.
labels
Returns a list of all labels from all annotation layers.
score
Shortcut property for the score of the first label added.
The starting character offset within the original text.
tag
Shortcut property for the value of the first label added.
Returns the original text of this sentence.
Gets the tokenizer currently intended for this sentence.
The list of Token objects.
A string identifier for the data point itself, without label info.
- property tokens: list[Token]#
The list of Token objects. Triggers tokenization if needed (lazy evaluation).
- property unlabeled_identifier#
A string identifier for the data point itself, without label info.
- property text: str#
Returns the original text of this sentence. Does not trigger tokenization.
- to_original_text()View on GitHub#
Returns the original text of this sentence.
- Return type:
str
- to_tagged_string(main_label=None)View on GitHub#
- Return type:
str
- get_relations(label_type=None)View on GitHub#
Retrieves all Relation objects associated with this sentence.
- Return type:
list
[Relation
]
- get_spans(label_type=None)View on GitHub#
- Return type:
list
[Span
]
- get_token(token_id)View on GitHub#
- Return type:
Optional
[Token
]
- property embedding#
Provides the primary embedding representation of the data point.
- to(device, pin_memory=False)View on GitHub#
Moves all stored embedding tensors to the specified device.
- Parameters:
device (Union[str, torch.device]) – Target device (e.g., ‘cpu’, ‘cuda:0’).
pin_memory (bool, optional) – If True and moving to CUDA, attempts to pin memory. Defaults to False.
- clear_embeddings(embedding_names=None)View on GitHub#
Removes stored embeddings to free memory.
- Parameters:
embedding_names (Optional[list[str]], optional) – Specific names to remove. If None, removes all embeddings. Defaults to None.
- left_context(context_length, respect_document_boundaries=True)View on GitHub#
- Return type:
list
[Token
]
- right_context(context_length, respect_document_boundaries=True)View on GitHub#
- Return type:
list
[Token
]
- to_tokenized_string()View on GitHub#
- Return type:
str
- to_plain_string()View on GitHub#
- Return type:
str
- infer_space_after()View on GitHub#
Heuristics in case you wish to infer whitespace_after values for tokenized text.
This is useful for some old NLP tasks (such as CoNLL-03 and CoNLL-2000) that provide only tokenized data with no info of original whitespacing. :return:
- to_dict()View on GitHub#
Creates a dictionary representation of the Sentence. This dictionary can be used to recreate the sentence with from_dict(). :rtype:
dict
[str
,Any
] :returns: A dictionary containing the sentence’s data and annotations.
- classmethod from_dict(sentence_dict)View on GitHub#
Creates a Sentence from a dictionary. :type sentence_dict:
dict
[str
,Any
] :param sentence_dict: A dictionary in the format produced by to_dict().- Return type:
- Returns:
The reconstructed Sentence object.
- get_span(start, stop)View on GitHub#
- Return type:
- property start_position: int#
The starting character offset within the original text.
- property end_position: int#
The ending character offset (exclusive) within the original text.
- get_language_code()View on GitHub#
- Return type:
str
- next_sentence()View on GitHub#
Get the next sentence in the document.
This only works if context is set through dataloader or elsewhere :return: next Sentence in document if set, otherwise None
- previous_sentence()View on GitHub#
Get the previous sentence in the document.
works only if context is set through dataloader or elsewhere :return: previous Sentence in document if set, otherwise None
- is_context_set()View on GitHub#
Determines if this sentence has a context of sentences before or after set.
Return True or False depending on whether context is set (for instance in dataloader or elsewhere) :rtype:
bool
:return: True if context is set, else False
- copy_context_from_sentence(sentence)View on GitHub#
- Return type:
None
- classmethod set_context_for_sentences(sentences)View on GitHub#
- Return type:
None
- get_labels(label_type=None)View on GitHub#
Retrieves all labels for a specific annotation layer.
- Parameters:
typename (Optional[str], optional) – The layer name. If None, returns all labels from all layers. Defaults to None.
- Returns:
List of Label objects, or empty list if none found.
- Return type:
list[Label]
- remove_labels(typename)View on GitHub#
Removes all labels associated with a specific annotation layer.
- Parameters:
typename (str) – The name of the annotation layer to clear.
- truncate(max_tokens)View on GitHub#
Truncates the sentence to max_tokens, cleaning up associated annotations.
- Return type:
None
- retokenize(new_tokenizer)View on GitHub#
Eagerly retokenizes the sentence using the provided tokenizer. This attempts to preserve span, relation, and sentence labels. Token-level labels are generally discarded as their basis (the tokens themselves) changes.
- Parameters:
new_tokenizer (
Tokenizer
) – The tokenizer to use for retokenization.- Return type:
None