flair.data.Sentence#

class flair.data.Sentence(text, use_tokenizer=True, language_code=None, start_position=0)View on GitHub #

Bases: DataPoint

A central data structure representing a sentence or text passage as Tokens.

Holds text, tokens, labels (sentence/token/span/relation levels), embeddings, and document context information.

tokens#

List of tokens (lazy tokenization if initialized with str).

Type:: list[Token]

text#

Original, untokenized text.

Type:: str

language_code#

ISO 639-1 language code.

Type:: Optional[str]

start_position#

Character offset in a larger document.

Type:: int

__init__(text, use_tokenizer=True, language_code=None, start_position=0)View on GitHub #

Initializes a Sentence.

Parameters:

text (Union[str, list[str], list[Token]]) – Either pass the text as a string, or provide an already tokenized text as either a list of strings or a list of Token objects.
use_tokenizer (Union[bool, Tokenizer]) – You can optionally specify a custom tokenizer to split the text into tokens. By default we use flair.tokenization.SegtokTokenizer. If use_tokenizer is set to False, flair.tokenization.SpaceTokenizer will be used instead. The tokenizer will be ignored, if text refers to pretokenized tokens.
language_code (Optional[str]) – Language of the sentence. If not provided, langdetect will be called when the language_code is accessed for the first time.
start_position (int) – Start char offset of the sentence in the superordinate document.

Methods

`__init__`(text[, use_tokenizer, ...])	Initializes a Sentence.
`add_label`(typename, value[, score])	Adds a new label to a specific annotation layer.
`add_metadata`(key, value)	Adds a key-value pair to the data point's metadata.
`clear_embeddings`([embedding_names])	Removes stored embeddings to free memory.
`copy_context_from_sentence`(sentence)
`get_each_embedding`([embedding_names])	Retrieves a list of individual embedding tensors.
`get_embedding`([names])	Retrieves embeddings, concatenating if multiple names are given or if names is None.
`get_label`([label_type, zero_tag_value])	Retrieves the primary label for a given type, or a default 'O' label.
`get_labels`([label_type])	Retrieves all labels for a specific annotation layer.
`get_language_code`()
`get_metadata`(key)	Retrieves metadata associated with the given key.
`get_relations`([label_type])	Retrieves all Relation objects associated with this sentence.
`get_span`(start, stop)
`get_spans`([label_type])
`get_token`(token_id)
`has_label`(typename)	Checks if the data point has at least one label for the given annotation type.
`has_metadata`(key)	Checks if the data point has metadata for the given key.
`infer_space_after`()	Heuristics in case you wish to infer whitespace_after values for tokenized text.
`is_context_set`()	Determines if this sentence has a context of sentences before or after set.
`left_context`(context_length[, ...])
`next_sentence`()	Get the next sentence in the document.
`previous_sentence`()	Get the previous sentence in the document.
`remove_labels`(typename)	Removes all labels associated with a specific annotation layer.
`retokenize`(tokenizer)	Retokenizes the sentence using the provided tokenizer while attempting to preserve span, relation, and sentence labels.
`right_context`(context_length[, ...])
`set_context_for_sentences`(sentences)
`set_embedding`(name, vector)	Stores an embedding tensor under a given name.
`set_label`(typename, value[, score])	Sets the label(s) for an annotation layer, overwriting any existing ones.
`to`(device[, pin_memory])	Moves all stored embedding tensors to the specified device.
`to_dict`([tag_type])
`to_original_text`()	Returns the original text of this sentence.
`to_plain_string`()
`to_tagged_string`([main_label])
`to_tokenized_string`()
`truncate`(max_tokens)	Truncates the sentence to max_tokens, cleaning up associated annotations.

Attributes

`embedding`	Provides the primary embedding representation of the data point.
`end_position`	The ending character offset (exclusive) within the original text.
`labels`	Returns a list of all labels from all annotation layers.
`score`	Shortcut property for the score of the first label added.
`start_position`	The starting character offset within the original text.
`tag`	Shortcut property for the value of the first label added.
`text`	Returns the original text of this sentence.
`tokens`	The list of Token objects (triggers tokenization if needed).
`unlabeled_identifier`	A string identifier for the data point itself, without label info.

property tokens: list[Token]#: The list of Token objects (triggers tokenization if needed).

property unlabeled_identifier#: A string identifier for the data point itself, without label info.

property text: str#: Returns the original text of this sentence. Does not trigger tokenization.

to_original_text()View on GitHub #

Returns the original text of this sentence.

Return type:: str

to_tagged_string(main_label=None)View on GitHub #

Return type:: str

get_relations(label_type=None)View on GitHub #

Retrieves all Relation objects associated with this sentence.

Return type:: list[Relation]

get_spans(label_type=None)View on GitHub #

Return type:: list[Span]

get_token(token_id)View on GitHub #

Return type:: Optional[Token]

property embedding#: Provides the primary embedding representation of the data point.

to(device, pin_memory=False)View on GitHub #

Moves all stored embedding tensors to the specified device.

Parameters:

device (Union[str, torch.device]) – Target device (e.g., ‘cpu’, ‘cuda:0’).
pin_memory (bool, optional) – If True and moving to CUDA, attempts to pin memory. Defaults to False.

clear_embeddings(embedding_names=None)View on GitHub #

Removes stored embeddings to free memory.

Parameters:: embedding_names (Optional[list[str]], optional) – Specific names to remove. If None, removes all embeddings. Defaults to None.

left_context(context_length, respect_document_boundaries=True)View on GitHub #

Return type:: list[Token]

right_context(context_length, respect_document_boundaries=True)View on GitHub #

Return type:: list[Token]

to_tokenized_string()View on GitHub #

Return type:: str

to_plain_string()View on GitHub #

Return type:: str

infer_space_after()View on GitHub #

Heuristics in case you wish to infer whitespace_after values for tokenized text.

This is useful for some old NLP tasks (such as CoNLL-03 and CoNLL-2000) that provide only tokenized data with no info of original whitespacing. :return:

to_dict(tag_type=None)View on GitHub #

Return type:: dict[str, Any]

get_span(start, stop)View on GitHub #

Return type:: Span

property start_position: int#: The starting character offset within the original text.

property end_position: int#: The ending character offset (exclusive) within the original text.

get_language_code()View on GitHub #

Return type:: str

next_sentence()View on GitHub #

Get the next sentence in the document.

This only works if context is set through dataloader or elsewhere :return: next Sentence in document if set, otherwise None

previous_sentence()View on GitHub #

Get the previous sentence in the document.

works only if context is set through dataloader or elsewhere :return: previous Sentence in document if set, otherwise None

is_context_set()View on GitHub #

Determines if this sentence has a context of sentences before or after set.

Return True or False depending on whether context is set (for instance in dataloader or elsewhere) :rtype: bool :return: True if context is set, else False

copy_context_from_sentence(sentence)View on GitHub #

Return type:: None

classmethod set_context_for_sentences(sentences)View on GitHub #

Return type:: None

get_labels(label_type=None)View on GitHub #

Retrieves all labels for a specific annotation layer.

Parameters:: typename (Optional[str], optional) – The layer name. If None, returns all labels from all layers. Defaults to None.
Returns:: List of Label objects, or empty list if none found.
Return type:: list[Label]

remove_labels(typename)View on GitHub #

Removes all labels associated with a specific annotation layer.

Parameters:: typename (str) – The name of the annotation layer to clear.

truncate(max_tokens)View on GitHub #

Truncates the sentence to max_tokens, cleaning up associated annotations.

Return type:: None

retokenize(tokenizer)View on GitHub #

Retokenizes the sentence using the provided tokenizer while attempting to preserve span, relation, and sentence labels. Token-level labels are discarded.

Note: Relation preservation depends on successfully re-mapping both constituent spans based on character offsets, which might fail if tokenization changes boundaries significantly.

Parameters:: tokenizer – The tokenizer to use for retokenization

Table of Contents

flair.data.Sentence#