flair.embeddings.transformer#

flair.embeddings.transformer.pad_sequence_embeddings(all_hidden_states)View on GitHub#
Return type:

Tensor

flair.embeddings.transformer.truncate_hidden_states(hidden_states, input_ids)View on GitHub#
Return type:

Tensor

flair.embeddings.transformer.combine_strided_tensors(hidden_states, overflow_to_sample_mapping, half_stride, max_length, default_value)View on GitHub#
Return type:

Tensor

flair.embeddings.transformer.fill_masked_elements(all_token_embeddings, sentence_hidden_states, mask, word_ids, lengths)View on GitHub#
flair.embeddings.transformer.insert_missing_embeddings(token_embeddings, word_id, length)View on GitHub#
Return type:

Tensor

flair.embeddings.transformer.fill_mean_token_embeddings(all_token_embeddings, sentence_hidden_states, word_ids, token_lengths)View on GitHub#
flair.embeddings.transformer.document_mean_pooling(sentence_hidden_states, sentence_lengths)View on GitHub#
flair.embeddings.transformer.document_max_pooling(sentence_hidden_states, sentence_lengths)View on GitHub#
flair.embeddings.transformer.remove_special_markup(text)View on GitHub#
class flair.embeddings.transformer.TransformerBaseEmbeddings(name, tokenizer, embedding_length, context_length, context_dropout, respect_document_boundaries, stride, allow_long_sentences, fine_tune, truncate, use_lang_emb, cls_pooling, is_document_embedding=False, is_token_embedding=False, force_device=None, force_max_length=False, feature_extractor=None, needs_manual_ocr=None, use_context_separator=True)View on GitHub#

Bases: Embeddings[Sentence]

Base class for all TransformerEmbeddings.

This base class handles the tokenizer and the input preparation, however it won’t implement the actual model. This can be further extended to implement the model in either a pytorch, jit or onnx way of working.

name: str#
to_args()View on GitHub#
classmethod from_params(params)View on GitHub#
to_params()View on GitHub#
classmethod create_from_state(**state)View on GitHub#
property embedding_length: int#

Returns the length of the embedding vector.

property embedding_type: str#
prepare_tensors(sentences, device=None)View on GitHub#
embeddings_name: str#
training: bool#
class flair.embeddings.transformer.TransformerOnnxEmbeddings(onnx_model, providers=[], session_options=None, **kwargs)View on GitHub#

Bases: TransformerBaseEmbeddings

to_params()View on GitHub#
classmethod from_params(params)View on GitHub#
Return type:

TransformerOnnxEmbeddings

create_session()View on GitHub#
remove_session()View on GitHub#
optimize_model(optimize_model_path, use_external_data_format=False, **kwargs)View on GitHub#

Wrapper for onnxruntime.transformers.optimizer.optimize_model.

quantize_model(quantize_model_path, use_external_data_format=False, **kwargs)View on GitHub#
classmethod collect_dynamic_axes(embedding, tensors)View on GitHub#
classmethod export_from_embedding(path, embedding, example_sentences, opset_version=14, providers=None, session_options=None)View on GitHub#
embeddings_name: str = 'TransformerOnnxEmbeddings'#
name: str#
training: bool#
tokenizer: PreTrainedTokenizer#
class flair.embeddings.transformer.TransformerJitEmbeddings(jit_model, param_names, **kwargs)View on GitHub#

Bases: TransformerBaseEmbeddings

to_params()View on GitHub#
classmethod from_params(params)View on GitHub#
Return type:

Embeddings

classmethod create_from_embedding(module, embedding, param_names)View on GitHub#
classmethod parameter_to_list(embedding, wrapper, sentences)View on GitHub#
Return type:

Tuple[List[str], List[Tensor]]

embeddings_name: str = 'TransformerJitEmbeddings'#
name: str#
training: bool#
tokenizer: PreTrainedTokenizer#
class flair.embeddings.transformer.TransformerJitWordEmbeddings(**kwargs)View on GitHub#

Bases: TokenEmbeddings, TransformerJitEmbeddings

embeddings_name: str = 'TransformerJitWordEmbeddings'#
name: str#
training: bool#
class flair.embeddings.transformer.TransformerJitDocumentEmbeddings(**kwargs)View on GitHub#

Bases: DocumentEmbeddings, TransformerJitEmbeddings

embeddings_name: str = 'TransformerJitDocumentEmbeddings'#
name: str#
training: bool#
class flair.embeddings.transformer.TransformerOnnxWordEmbeddings(**kwargs)View on GitHub#

Bases: TokenEmbeddings, TransformerOnnxEmbeddings

embeddings_name: str = 'TransformerOnnxWordEmbeddings'#
name: str#
training: bool#
class flair.embeddings.transformer.TransformerOnnxDocumentEmbeddings(**kwargs)View on GitHub#

Bases: DocumentEmbeddings, TransformerOnnxEmbeddings

embeddings_name: str = 'TransformerOnnxDocumentEmbeddings'#
name: str#
training: bool#
class flair.embeddings.transformer.TransformerEmbeddings(model='bert-base-uncased', fine_tune=True, layers='-1', layer_mean=True, subtoken_pooling='first', cls_pooling='cls', is_token_embedding=True, is_document_embedding=True, allow_long_sentences=False, use_context=False, respect_document_boundaries=True, context_dropout=0.5, saved_config=None, tokenizer_data=None, feature_extractor_data=None, name=None, force_max_length=False, needs_manual_ocr=None, use_context_separator=True, transformers_tokenizer_kwargs={}, transformers_config_kwargs={}, transformers_model_kwargs={}, peft_config=None, peft_gradient_checkpointing_kwargs={}, **kwargs)View on GitHub#

Bases: TransformerBaseEmbeddings

onnx_clsView on GitHub#

alias of TransformerOnnxEmbeddings

__init__(model='bert-base-uncased', fine_tune=True, layers='-1', layer_mean=True, subtoken_pooling='first', cls_pooling='cls', is_token_embedding=True, is_document_embedding=True, allow_long_sentences=False, use_context=False, respect_document_boundaries=True, context_dropout=0.5, saved_config=None, tokenizer_data=None, feature_extractor_data=None, name=None, force_max_length=False, needs_manual_ocr=None, use_context_separator=True, transformers_tokenizer_kwargs={}, transformers_config_kwargs={}, transformers_model_kwargs={}, peft_config=None, peft_gradient_checkpointing_kwargs={}, **kwargs)View on GitHub#

Instantiate transformers embeddings.

Allows using transformers as TokenEmbeddings and DocumentEmbeddings or both.

Parameters:
  • model (str) – name of transformer model (see huggingface hub for options)

  • fine_tune (bool) – If True, the weights of the transformers embedding will be updated during training.

  • layers (str) – Specify which layers should be extracted for the embeddings. Expects either “all” to extract all layers or a comma separated list of indices (e.g. “-1,-2,-3,-4” for the last 4 layers)

  • layer_mean (bool) – If True, the extracted layers will be averaged. Otherwise, they will be concatenated.

  • subtoken_pooling (Literal['first', 'last', 'first_last', 'mean']) – Specify how multiple sub-tokens will be aggregated for a token-embedding.

  • cls_pooling (Literal['cls', 'max', 'mean']) – Specify how the document-embeddings will be extracted.

  • is_token_embedding (bool) – If True, this embeddings can be handled as token-embeddings.

  • is_document_embedding (bool) – If True, this embeddings can be handled document-embeddings.

  • allow_long_sentences (bool) – If True, too long sentences will be patched and strided and afterwards combined.

  • use_context (Union[bool, int]) – If True, predicting multiple sentences at once, will use the previous and next sentences for context.

  • respect_document_boundaries (bool) – If True, the context calculation will stop if a sentence represents a context boundary.

  • context_dropout (float) – Integer percentage (0-100) to specify how often the context won’t be used during training.

  • saved_config (Optional[PretrainedConfig]) – Pretrained config used when loading embeddings. Always use None.

  • tokenizer_data (Optional[BytesIO]) – Tokenizer data used when loading embeddings. Always use None.

  • feature_extractor_data (Optional[BytesIO]) – Feature extractor data used when loading embeddings. Always use None.

  • name (Optional[str]) – The name for the embeddings. Per default the name will be used from the used transformers model.

  • force_max_length (bool) – If True, the tokenizer will always pad the sequences to maximum length.

  • needs_manual_ocr (Optional[bool]) – If True, bounding boxes will be calculated manually. This is used for models like layoutlm where the tokenizer doesn’t compute the bounding boxes itself.

  • use_context_separator (bool) – If True, the embedding will hold an additional token to allow the model to distingulish between context and prediction.

  • transformers_tokenizer_kwargs (Dict[str, Any]) – Further values forwarded to the initialization of the transformers tokenizer

  • transformers_config_kwargs (Dict[str, Any]) – Further values forwarded to the initialization of the transformers config

  • transformers_model_kwargs (Dict[str, Any]) – Further values forwarded to the initialization of the transformers model

  • peft_config – If set, the model will be trained using adapters and optionally QLoRA. Should be of type “PeftConfig” or subtype

  • peft_gradient_checkpointing_kwargs (Optional[Dict[str, Any]]) – Further values used when preparing the model for kbit training. Only used if peft_config is set.

  • **kwargs – Further values forwarded to the transformers config

embeddings_name: str = 'TransformerEmbeddings'#
property embedding_length: int#

Returns the length of the embedding vector.

property embedding_type: str#
classmethod from_params(params)View on GitHub#
to_params()View on GitHub#
forward(input_ids, sub_token_lengths=None, token_lengths=None, attention_mask=None, overflow_to_sample_mapping=None, word_ids=None, langs=None, bbox=None, pixel_values=None)View on GitHub#

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

export_onnx(path, example_sentences, **kwargs)View on GitHub#

Export TransformerEmbeddings to OnnxFormat.

Parameters:
  • path (Union[str, Path]) – the path to save the embeddings. Notice that the embeddings are stored as external file, hence it matters if the path is an absolue path or a relative one.

  • example_sentences (List[Sentence]) – a list of sentences that will be used for tracing. It is recommended to take 2-4 sentences with some variation.

  • **kwargs – the parameters passed to TransformerOnnxEmbeddings.export_from_embedding()

Return type:

TransformerOnnxEmbeddings