The dataset, part of the IMPACT Centre of Competence in Digitisation (digitisation.eu), contains more than half a million representative text-based images compiled by a number of major European libraries. Covering texts from as early as 1500, and containing material from newspapers, books, pamphlets and typewritten notes, the dataset is an invaluable resource for future research into imaging technology, OCR and language enrichment.
A carefully selected subset of these images has been reproduced with accompanying "ground truth". In digital imaging and OCR, ground truth is the objective verification of the properties of a digital image, used to test the accuracy of automated image analysis processes. (The ground truth of an image's text content, for instance, is the complete and accurate record of every character and word in the image.) This ground truth data will allow researchers and developers to evaluate and demonstrate the results of their tools and processes.
A survey of OCR evaluation tools and metrics
In The 6th International Workshop on Historical Document Imaging and Processing (HIP '21). Association for Computing Machinery, New York, NY, USA, 13–18.