flair.datasets.ocr.OcrJsonDataset#
- class flair.datasets.ocr.OcrJsonDataset(path_to_split_directory, label_type='ner', in_memory=True, encoding='utf-8', load_images=False, normalize_coords_to_thousands=True, label_name_map=None)View on GitHub#
Bases:
FlairDataset- __init__(path_to_split_directory, label_type='ner', in_memory=True, encoding='utf-8', load_images=False, normalize_coords_to_thousands=True, label_name_map=None)View on GitHub#
Instantiates a Dataset from a OCR-Json format.
The folder is structured with a “images” folder and a “tagged” folder. Those folders contain respectively .jpg and .json files with matching file name. The json contains 3 fields “words”, “bbox”, “labels” which are lists of equal length “words” is a list of strings, containing the ocr texts, “bbox” is a list of int-Tuples, containing left, top, right, bottom “labels” is a BIO-tagging of the sentences :type path_to_split_directory:
Union[str,Path] :param path_to_split_directory: base folder with the task data :type label_type:str:param label_type: the label_type to add the ocr labels to :type encoding:str:param encoding: the encoding to load the .json files with :type normalize_coords_to_thousands:bool:param normalize_coords_to_thousands: if True, the coordinates will be ranged from 0 to 1000 :type load_images:bool:param load_images: if True, the pillow images will be added as metadata :type in_memory:bool:param in_memory: If set to True, the dataset is kept in memory as Sentence objects, otherwise does disk reads :type label_name_map:Optional[dict[str,str]] :param label_name_map: Optionally map tag names to different schema. :return: a Dataset with Sentences that contain OCR information
Methods
__init__(path_to_split_directory[, ...])Instantiates a Dataset from a OCR-Json format.
- is_in_memory()View on GitHub#
- Return type:
bool