flair.datasets.ocr.OcrJsonDataset#
- class flair.datasets.ocr.OcrJsonDataset(path_to_split_directory, label_type='ner', in_memory=True, encoding='utf-8', load_images=False, normalize_coords_to_thousands=True, label_name_map=None)View on GitHub#
Bases:
FlairDataset
- __init__(path_to_split_directory, label_type='ner', in_memory=True, encoding='utf-8', load_images=False, normalize_coords_to_thousands=True, label_name_map=None)View on GitHub#
Instantiates a Dataset from a OCR-Json format.
The folder is structured with a “images” folder and a “tagged” folder. Those folders contain respectively .jpg and .json files with matching file name. The json contains 3 fields “words”, “bbox”, “labels” which are lists of equal length “words” is a list of strings, containing the ocr texts, “bbox” is a list of int-Tuples, containing left, top, right, bottom “labels” is a BIO-tagging of the sentences :type path_to_split_directory:
Union
[str
,Path
] :param path_to_split_directory: base folder with the task data :type label_type:str
:param label_type: the label_type to add the ocr labels to :type encoding:str
:param encoding: the encoding to load the .json files with :type normalize_coords_to_thousands:bool
:param normalize_coords_to_thousands: if True, the coordinates will be ranged from 0 to 1000 :type load_images:bool
:param load_images: if True, the pillow images will be added as metadata :type in_memory:bool
:param in_memory: If set to True, the dataset is kept in memory as Sentence objects, otherwise does disk reads :type label_name_map:Optional
[dict
[str
,str
]] :param label_name_map: Optionally map tag names to different schema. :return: a Dataset with Sentences that contain OCR information
Methods
__init__
(path_to_split_directory[, ...])Instantiates a Dataset from a OCR-Json format.
- is_in_memory()View on GitHub#
- Return type:
bool