flair.datasets.ocr#

class flair.datasets.ocr.OcrJsonDataset(path_to_split_directory, label_type='ner', in_memory=True, encoding='utf-8', load_images=False, normalize_coords_to_thousands=True, label_name_map=None)View on GitHub#

Bases: FlairDataset

__init__(path_to_split_directory, label_type='ner', in_memory=True, encoding='utf-8', load_images=False, normalize_coords_to_thousands=True, label_name_map=None)View on GitHub#

Instantiates a Dataset from a OCR-Json format.

The folder is structured with a “images” folder and a “tagged” folder. Those folders contain respectively .jpg and .json files with matching file name. The json contains 3 fields “words”, “bbox”, “labels” which are lists of equal length “words” is a list of strings, containing the ocr texts, “bbox” is a list of int-Tuples, containing left, top, right, bottom “labels” is a BIO-tagging of the sentences :type path_to_split_directory: Union[str, Path] :param path_to_split_directory: base folder with the task data :type label_type: str :param label_type: the label_type to add the ocr labels to :type encoding: str :param encoding: the encoding to load the .json files with :type normalize_coords_to_thousands: bool :param normalize_coords_to_thousands: if True, the coordinates will be ranged from 0 to 1000 :type load_images: bool :param load_images: if True, the pillow images will be added as metadata :type in_memory: bool :param in_memory: If set to True, the dataset is kept in memory as Sentence objects, otherwise does disk reads :type label_name_map: Optional[dict[str, str]] :param label_name_map: Optionally map tag names to different schema. :return: a Dataset with Sentences that contain OCR information

is_in_memory()View on GitHub#
Return type:

bool

class flair.datasets.ocr.OcrCorpus(train_path=None, dev_path=None, test_path=None, encoding='utf-8', label_type='ner', in_memory=True, load_images=False, normalize_coords_to_thousands=True, label_name_map=None, **corpusargs)View on GitHub#

Bases: Corpus

__init__(train_path=None, dev_path=None, test_path=None, encoding='utf-8', label_type='ner', in_memory=True, load_images=False, normalize_coords_to_thousands=True, label_name_map=None, **corpusargs)View on GitHub#

Instantiates a Corpus from a OCR-Json format.

Parameters:
  • train_path (Optional[Path]) – the folder for the training data

  • dev_path (Optional[Path]) – the folder for the dev data

  • test_path (Optional[Path]) – the folder for the test data

  • path_to_split_directory – base folder with the task data

  • label_type (str) – the label_type to add the ocr labels to

  • encoding (str) – the encoding to load the .json files with

  • load_images (bool) – if True, the pillow images will be added as metadata

  • normalize_coords_to_thousands (bool) – if True, the coordinates will be ranged from 0 to 1000

  • in_memory (bool) – If set to True, the dataset is kept in memory as Sentence objects, otherwise does disk reads

  • label_name_map (Optional[dict[str, str]]) – Optionally map tag names to different schema.

Returns:

a Corpus with Sentences that contain OCR information

class flair.datasets.ocr.SROIE(base_path=None, encoding='utf-8', label_type='ner', in_memory=True, load_images=False, normalize_coords_to_thousands=True, label_name_map=None, **corpusargs)View on GitHub#

Bases: OcrCorpus

__init__(base_path=None, encoding='utf-8', label_type='ner', in_memory=True, load_images=False, normalize_coords_to_thousands=True, label_name_map=None, **corpusargs)View on GitHub#

Instantiates the SROIE corpus with perfect ocr boxes.

Parameters:
  • base_path (Union[str, Path, None]) – the path to store the dataset or load it from

  • label_type (str) – the label_type to add the ocr labels to

  • encoding (str) – the encoding to load the .json files with

  • load_images (bool) – if True, the pillow images will be added as metadata

  • normalize_coords_to_thousands (bool) – if True, the coordinates will be ranged from 0 to 1000

  • in_memory (bool) – If set to True, the dataset is kept in memory as Sentence objects, otherwise does disk reads

  • label_name_map (Optional[dict[str, str]]) – Optionally map tag names to different schema.

Returns:

a Corpus with Sentences that contain OCR information