flair.datasets.ocr#
- class flair.datasets.ocr.OcrJsonDataset(path_to_split_directory, label_type='ner', in_memory=True, encoding='utf-8', load_images=False, normalize_coords_to_thousands=True, label_name_map=None)View on GitHub#
Bases:
FlairDataset
- __init__(path_to_split_directory, label_type='ner', in_memory=True, encoding='utf-8', load_images=False, normalize_coords_to_thousands=True, label_name_map=None)View on GitHub#
Instantiates a Dataset from a OCR-Json format.
The folder is structured with a “images” folder and a “tagged” folder. Those folders contain respectively .jpg and .json files with matching file name. The json contains 3 fields “words”, “bbox”, “labels” which are lists of equal length “words” is a list of strings, containing the ocr texts, “bbox” is a list of int-Tuples, containing left, top, right, bottom “labels” is a BIO-tagging of the sentences :type path_to_split_directory:
Union
[str
,Path
] :param path_to_split_directory: base folder with the task data :type label_type:str
:param label_type: the label_type to add the ocr labels to :type encoding:str
:param encoding: the encoding to load the .json files with :type normalize_coords_to_thousands:bool
:param normalize_coords_to_thousands: if True, the coordinates will be ranged from 0 to 1000 :type load_images:bool
:param load_images: if True, the pillow images will be added as metadata :type in_memory:bool
:param in_memory: If set to True, the dataset is kept in memory as Sentence objects, otherwise does disk reads :type label_name_map:Optional
[dict
[str
,str
]] :param label_name_map: Optionally map tag names to different schema. :return: a Dataset with Sentences that contain OCR information
- is_in_memory()View on GitHub#
- Return type:
bool
- class flair.datasets.ocr.OcrCorpus(train_path=None, dev_path=None, test_path=None, encoding='utf-8', label_type='ner', in_memory=True, load_images=False, normalize_coords_to_thousands=True, label_name_map=None, **corpusargs)View on GitHub#
Bases:
Corpus
- __init__(train_path=None, dev_path=None, test_path=None, encoding='utf-8', label_type='ner', in_memory=True, load_images=False, normalize_coords_to_thousands=True, label_name_map=None, **corpusargs)View on GitHub#
Instantiates a Corpus from a OCR-Json format.
- Parameters:
train_path (
Optional
[Path
]) – the folder for the training datadev_path (
Optional
[Path
]) – the folder for the dev datatest_path (
Optional
[Path
]) – the folder for the test datapath_to_split_directory – base folder with the task data
label_type (
str
) – the label_type to add the ocr labels toencoding (
str
) – the encoding to load the .json files withload_images (
bool
) – if True, the pillow images will be added as metadatanormalize_coords_to_thousands (
bool
) – if True, the coordinates will be ranged from 0 to 1000in_memory (
bool
) – If set to True, the dataset is kept in memory as Sentence objects, otherwise does disk readslabel_name_map (
Optional
[dict
[str
,str
]]) – Optionally map tag names to different schema.
- Returns:
a Corpus with Sentences that contain OCR information
- class flair.datasets.ocr.SROIE(base_path=None, encoding='utf-8', label_type='ner', in_memory=True, load_images=False, normalize_coords_to_thousands=True, label_name_map=None, **corpusargs)View on GitHub#
Bases:
OcrCorpus
- __init__(base_path=None, encoding='utf-8', label_type='ner', in_memory=True, load_images=False, normalize_coords_to_thousands=True, label_name_map=None, **corpusargs)View on GitHub#
Instantiates the SROIE corpus with perfect ocr boxes.
- Parameters:
base_path (
Union
[str
,Path
,None
]) – the path to store the dataset or load it fromlabel_type (
str
) – the label_type to add the ocr labels toencoding (
str
) – the encoding to load the .json files withload_images (
bool
) – if True, the pillow images will be added as metadatanormalize_coords_to_thousands (
bool
) – if True, the coordinates will be ranged from 0 to 1000in_memory (
bool
) – If set to True, the dataset is kept in memory as Sentence objects, otherwise does disk readslabel_name_map (
Optional
[dict
[str
,str
]]) – Optionally map tag names to different schema.
- Returns:
a Corpus with Sentences that contain OCR information