flair.datasets.ocr.OcrJsonDataset#

class flair.datasets.ocr.OcrJsonDataset(path_to_split_directory, label_type='ner', in_memory=True, encoding='utf-8', load_images=False, normalize_coords_to_thousands=True, label_name_map=None)View on GitHub#

Bases: FlairDataset

__init__(path_to_split_directory, label_type='ner', in_memory=True, encoding='utf-8', load_images=False, normalize_coords_to_thousands=True, label_name_map=None)View on GitHub#

Instantiates a Dataset from a OCR-Json format.

The folder is structured with a “images” folder and a “tagged” folder. Those folders contain respectively .jpg and .json files with matching file name. The json contains 3 fields “words”, “bbox”, “labels” which are lists of equal length “words” is a list of strings, containing the ocr texts, “bbox” is a list of int-Tuples, containing left, top, right, bottom “labels” is a BIO-tagging of the sentences :type path_to_split_directory: Union[str, Path] :param path_to_split_directory: base folder with the task data :type label_type: str :param label_type: the label_type to add the ocr labels to :type encoding: str :param encoding: the encoding to load the .json files with :type normalize_coords_to_thousands: bool :param normalize_coords_to_thousands: if True, the coordinates will be ranged from 0 to 1000 :type load_images: bool :param load_images: if True, the pillow images will be added as metadata :type in_memory: bool :param in_memory: If set to True, the dataset is kept in memory as Sentence objects, otherwise does disk reads :type label_name_map: Optional[dict[str, str]] :param label_name_map: Optionally map tag names to different schema. :return: a Dataset with Sentences that contain OCR information

Methods

__init__(path_to_split_directory[, ...])

Instantiates a Dataset from a OCR-Json format.

is_in_memory()

is_in_memory()View on GitHub#
Return type:

bool