flair.data.Sentence#
- class flair.data.Sentence(text, use_tokenizer=True, language_code=None, start_position=0)View on GitHub#
Bases:
DataPoint
A central data structure representing a sentence or text passage as Tokens.
Holds text, tokens, labels (sentence/token/span/relation levels), embeddings, and document context information.
- text#
Original, untokenized text.
- Type:
str
- language_code#
ISO 639-1 language code.
- Type:
Optional[str]
- start_position#
Character offset in a larger document.
- Type:
int
- __init__(text, use_tokenizer=True, language_code=None, start_position=0)View on GitHub#
Initializes a Sentence.
- Parameters:
text (
Union
[str
,list
[str
],list
[Token
]]) – Either pass the text as a string, or provide an already tokenized text as either a list of strings or a list ofToken
objects.use_tokenizer (
Union
[bool
,Tokenizer
]) – You can optionally specify a custom tokenizer to split the text into tokens. By default we useflair.tokenization.SegtokTokenizer
. If use_tokenizer is set to False,flair.tokenization.SpaceTokenizer
will be used instead. The tokenizer will be ignored, if text refers to pretokenized tokens.language_code (
Optional
[str
]) – Language of the sentence. If not provided, langdetect will be called when the language_code is accessed for the first time.start_position (
int
) – Start char offset of the sentence in the superordinate document.
Methods
__init__
(text[, use_tokenizer, ...])Initializes a Sentence.
add_label
(typename, value[, score])Adds a new label to a specific annotation layer.
add_metadata
(key, value)Adds a key-value pair to the data point's metadata.
clear_embeddings
([embedding_names])Removes stored embeddings to free memory.
copy_context_from_sentence
(sentence)get_each_embedding
([embedding_names])Retrieves a list of individual embedding tensors.
get_embedding
([names])Retrieves embeddings, concatenating if multiple names are given or if names is None.
get_label
([label_type, zero_tag_value])Retrieves the primary label for a given type, or a default 'O' label.
get_labels
([label_type])Retrieves all labels for a specific annotation layer.
get_metadata
(key)Retrieves metadata associated with the given key.
get_relations
([label_type])Retrieves all Relation objects associated with this sentence.
get_span
(start, stop)get_spans
([label_type])get_token
(token_id)has_label
(typename)Checks if the data point has at least one label for the given annotation type.
has_metadata
(key)Checks if the data point has metadata for the given key.
Heuristics in case you wish to infer whitespace_after values for tokenized text.
Determines if this sentence has a context of sentences before or after set.
left_context
(context_length[, ...])Get the next sentence in the document.
Get the previous sentence in the document.
remove_labels
(typename)Removes all labels associated with a specific annotation layer.
retokenize
(tokenizer)Retokenizes the sentence using the provided tokenizer while preserving span labels.
right_context
(context_length[, ...])set_context_for_sentences
(sentences)set_embedding
(name, vector)Stores an embedding tensor under a given name.
set_label
(typename, value[, score])Sets the label(s) for an annotation layer, overwriting any existing ones.
to
(device[, pin_memory])Moves all stored embedding tensors to the specified device.
to_dict
([tag_type])Returns the original text of this sentence.
to_tagged_string
([main_label])truncate
(max_tokens)Truncates the sentence to max_tokens, cleaning up associated annotations.
Attributes
Provides the primary embedding representation of the data point.
The ending character offset (exclusive) within the original text.
labels
Returns a list of all labels from all annotation layers.
score
Shortcut property for the score of the first label added.
The starting character offset within the original text.
tag
Shortcut property for the value of the first label added.
Returns the original text of this sentence.
The list of Token objects (triggers tokenization if needed).
A string identifier for the data point itself, without label info.
- property unlabeled_identifier#
A string identifier for the data point itself, without label info.
- property text: str#
Returns the original text of this sentence. Does not trigger tokenization.
- to_original_text()View on GitHub#
Returns the original text of this sentence.
- Return type:
str
- to_tagged_string(main_label=None)View on GitHub#
- Return type:
str
- get_relations(label_type=None)View on GitHub#
Retrieves all Relation objects associated with this sentence.
- Return type:
list
[Relation
]
- get_spans(label_type=None)View on GitHub#
- Return type:
list
[Span
]
- get_token(token_id)View on GitHub#
- Return type:
Optional
[Token
]
- property embedding#
Provides the primary embedding representation of the data point.
- to(device, pin_memory=False)View on GitHub#
Moves all stored embedding tensors to the specified device.
- Parameters:
device (Union[str, torch.device]) – Target device (e.g., ‘cpu’, ‘cuda:0’).
pin_memory (bool, optional) – If True and moving to CUDA, attempts to pin memory. Defaults to False.
- clear_embeddings(embedding_names=None)View on GitHub#
Removes stored embeddings to free memory.
- Parameters:
embedding_names (Optional[list[str]], optional) – Specific names to remove. If None, removes all embeddings. Defaults to None.
- left_context(context_length, respect_document_boundaries=True)View on GitHub#
- Return type:
list
[Token
]
- right_context(context_length, respect_document_boundaries=True)View on GitHub#
- Return type:
list
[Token
]
- to_tokenized_string()View on GitHub#
- Return type:
str
- to_plain_string()View on GitHub#
- Return type:
str
- infer_space_after()View on GitHub#
Heuristics in case you wish to infer whitespace_after values for tokenized text.
This is useful for some old NLP tasks (such as CoNLL-03 and CoNLL-2000) that provide only tokenized data with no info of original whitespacing. :return:
- to_dict(tag_type=None)View on GitHub#
- Return type:
dict
[str
,Any
]
- get_span(start, stop)View on GitHub#
- Return type:
- property start_position: int#
The starting character offset within the original text.
- property end_position: int#
The ending character offset (exclusive) within the original text.
- get_language_code()View on GitHub#
- Return type:
str
- next_sentence()View on GitHub#
Get the next sentence in the document.
This only works if context is set through dataloader or elsewhere :return: next Sentence in document if set, otherwise None
- previous_sentence()View on GitHub#
Get the previous sentence in the document.
works only if context is set through dataloader or elsewhere :return: previous Sentence in document if set, otherwise None
- is_context_set()View on GitHub#
Determines if this sentence has a context of sentences before or after set.
Return True or False depending on whether context is set (for instance in dataloader or elsewhere) :rtype:
bool
:return: True if context is set, else False
- copy_context_from_sentence(sentence)View on GitHub#
- Return type:
None
- classmethod set_context_for_sentences(sentences)View on GitHub#
- Return type:
None
- get_labels(label_type=None)View on GitHub#
Retrieves all labels for a specific annotation layer.
- Parameters:
typename (Optional[str], optional) – The layer name. If None, returns all labels from all layers. Defaults to None.
- Returns:
List of Label objects, or empty list if none found.
- Return type:
list[Label]
- remove_labels(typename)View on GitHub#
Removes all labels associated with a specific annotation layer.
- Parameters:
typename (str) – The name of the annotation layer to clear.
- truncate(max_tokens)View on GitHub#
Truncates the sentence to max_tokens, cleaning up associated annotations.
- Return type:
None
- retokenize(tokenizer)View on GitHub#
Retokenizes the sentence using the provided tokenizer while preserving span labels.
- Parameters:
tokenizer – The tokenizer to use for retokenization
Example:
# Create a sentence with default tokenization sentence = Sentence("01-03-2025 New York") # Add span labels sentence.get_span(1, 3).add_label('ner', "LOC") sentence.get_span(0, 1).add_label('ner', "DATE") # Retokenize with a different tokenizer while preserving labels sentence.retokenize(StaccatoTokenizer())