flair.data.Sentence#

class flair.data.Sentence(text, use_tokenizer=True, language_code=None, start_position=0)View on GitHub#

Bases: DataPoint

A central data structure representing a sentence or text passage as Tokens.

Holds text, tokens, labels (sentence/token/span/relation levels), embeddings, and document context information.

tokens#

List of tokens (lazy tokenization if initialized with str).

Type:

list[Token]

text#

Original, untokenized text.

Type:

str

language_code#

ISO 639-1 language code.

Type:

Optional[str]

start_position#

Character offset in a larger document.

Type:

int

__init__(text, use_tokenizer=True, language_code=None, start_position=0)View on GitHub#

Initializes a Sentence.

Parameters:
  • text (Union[str, list[str], list[Token]]) – Either pass the text as a string, or provide an already tokenized text as either a list of strings or a list of Token objects.

  • use_tokenizer (Union[bool, Tokenizer]) – You can optionally specify a custom tokenizer to split the text into tokens. By default we use flair.tokenization.SegtokTokenizer. If use_tokenizer is set to False, flair.tokenization.SpaceTokenizer will be used instead. The tokenizer will be ignored, if text refers to pretokenized tokens.

  • language_code (Optional[str]) – Language of the sentence. If not provided, langdetect will be called when the language_code is accessed for the first time.

  • start_position (int) – Start char offset of the sentence in the superordinate document.

Methods

__init__(text[, use_tokenizer, ...])

Initializes a Sentence.

add_label(typename, value[, score])

Adds a new label to a specific annotation layer.

add_metadata(key, value)

Adds a key-value pair to the data point's metadata.

clear_embeddings([embedding_names])

Removes stored embeddings to free memory.

copy_context_from_sentence(sentence)

get_each_embedding([embedding_names])

Retrieves a list of individual embedding tensors.

get_embedding([names])

Retrieves embeddings, concatenating if multiple names are given or if names is None.

get_label([label_type, zero_tag_value])

Retrieves the primary label for a given type, or a default 'O' label.

get_labels([label_type])

Retrieves all labels for a specific annotation layer.

get_language_code()

get_metadata(key)

Retrieves metadata associated with the given key.

get_relations([label_type])

Retrieves all Relation objects associated with this sentence.

get_span(start, stop)

get_spans([label_type])

get_token(token_id)

has_label(typename)

Checks if the data point has at least one label for the given annotation type.

has_metadata(key)

Checks if the data point has metadata for the given key.

infer_space_after()

Heuristics in case you wish to infer whitespace_after values for tokenized text.

is_context_set()

Determines if this sentence has a context of sentences before or after set.

left_context(context_length[, ...])

next_sentence()

Get the next sentence in the document.

previous_sentence()

Get the previous sentence in the document.

remove_labels(typename)

Removes all labels associated with a specific annotation layer.

retokenize(tokenizer)

Retokenizes the sentence using the provided tokenizer while preserving span labels.

right_context(context_length[, ...])

set_context_for_sentences(sentences)

set_embedding(name, vector)

Stores an embedding tensor under a given name.

set_label(typename, value[, score])

Sets the label(s) for an annotation layer, overwriting any existing ones.

to(device[, pin_memory])

Moves all stored embedding tensors to the specified device.

to_dict([tag_type])

to_original_text()

Returns the original text of this sentence.

to_plain_string()

to_tagged_string([main_label])

to_tokenized_string()

truncate(max_tokens)

Truncates the sentence to max_tokens, cleaning up associated annotations.

Attributes

embedding

Provides the primary embedding representation of the data point.

end_position

The ending character offset (exclusive) within the original text.

labels

Returns a list of all labels from all annotation layers.

score

Shortcut property for the score of the first label added.

start_position

The starting character offset within the original text.

tag

Shortcut property for the value of the first label added.

text

Returns the original text of this sentence.

tokens

The list of Token objects (triggers tokenization if needed).

unlabeled_identifier

A string identifier for the data point itself, without label info.

property tokens: list[Token]#

The list of Token objects (triggers tokenization if needed).

property unlabeled_identifier#

A string identifier for the data point itself, without label info.

property text: str#

Returns the original text of this sentence. Does not trigger tokenization.

to_original_text()View on GitHub#

Returns the original text of this sentence.

Return type:

str

to_tagged_string(main_label=None)View on GitHub#
Return type:

str

get_relations(label_type=None)View on GitHub#

Retrieves all Relation objects associated with this sentence.

Return type:

list[Relation]

get_spans(label_type=None)View on GitHub#
Return type:

list[Span]

get_token(token_id)View on GitHub#
Return type:

Optional[Token]

property embedding#

Provides the primary embedding representation of the data point.

to(device, pin_memory=False)View on GitHub#

Moves all stored embedding tensors to the specified device.

Parameters:
  • device (Union[str, torch.device]) – Target device (e.g., ‘cpu’, ‘cuda:0’).

  • pin_memory (bool, optional) – If True and moving to CUDA, attempts to pin memory. Defaults to False.

clear_embeddings(embedding_names=None)View on GitHub#

Removes stored embeddings to free memory.

Parameters:

embedding_names (Optional[list[str]], optional) – Specific names to remove. If None, removes all embeddings. Defaults to None.

left_context(context_length, respect_document_boundaries=True)View on GitHub#
Return type:

list[Token]

right_context(context_length, respect_document_boundaries=True)View on GitHub#
Return type:

list[Token]

to_tokenized_string()View on GitHub#
Return type:

str

to_plain_string()View on GitHub#
Return type:

str

infer_space_after()View on GitHub#

Heuristics in case you wish to infer whitespace_after values for tokenized text.

This is useful for some old NLP tasks (such as CoNLL-03 and CoNLL-2000) that provide only tokenized data with no info of original whitespacing. :return:

to_dict(tag_type=None)View on GitHub#
Return type:

dict[str, Any]

get_span(start, stop)View on GitHub#
Return type:

Span

property start_position: int#

The starting character offset within the original text.

property end_position: int#

The ending character offset (exclusive) within the original text.

get_language_code()View on GitHub#
Return type:

str

next_sentence()View on GitHub#

Get the next sentence in the document.

This only works if context is set through dataloader or elsewhere :return: next Sentence in document if set, otherwise None

previous_sentence()View on GitHub#

Get the previous sentence in the document.

works only if context is set through dataloader or elsewhere :return: previous Sentence in document if set, otherwise None

is_context_set()View on GitHub#

Determines if this sentence has a context of sentences before or after set.

Return True or False depending on whether context is set (for instance in dataloader or elsewhere) :rtype: bool :return: True if context is set, else False

copy_context_from_sentence(sentence)View on GitHub#
Return type:

None

classmethod set_context_for_sentences(sentences)View on GitHub#
Return type:

None

get_labels(label_type=None)View on GitHub#

Retrieves all labels for a specific annotation layer.

Parameters:

typename (Optional[str], optional) – The layer name. If None, returns all labels from all layers. Defaults to None.

Returns:

List of Label objects, or empty list if none found.

Return type:

list[Label]

remove_labels(typename)View on GitHub#

Removes all labels associated with a specific annotation layer.

Parameters:

typename (str) – The name of the annotation layer to clear.

truncate(max_tokens)View on GitHub#

Truncates the sentence to max_tokens, cleaning up associated annotations.

Return type:

None

retokenize(tokenizer)View on GitHub#

Retokenizes the sentence using the provided tokenizer while preserving span labels.

Parameters:

tokenizer – The tokenizer to use for retokenization

Example:

# Create a sentence with default tokenization
sentence = Sentence("01-03-2025 New York")

# Add span labels
sentence.get_span(1, 3).add_label('ner', "LOC")
sentence.get_span(0, 1).add_label('ner', "DATE")

# Retokenize with a different tokenizer while preserving labels
sentence.retokenize(StaccatoTokenizer())