Layout Analysis and Text Recognition are of fundamental importance among Document Image Analysis steps and have been (and continue to be) relatively well researched. Historical documents are of particular interest as they pose a number of challenges and, at the same time, represent a very large proportion of printed documents in existence. With the increasing number of digitisation projects initiated by libraries world-wide, the problem of analysis and recognition of these documents is very topical.
Historical books represent a large proportion of libraries’ holdings and continue to be the focus of large-scale digitisation projects. A number of distortions frequently manifest themselves in scans of historical books, hindering layout analysis and text recognition. The motivation of the competition is to evaluate existing approaches using a realistic dataset and an objective performance analysis system.
HBR2013 follows the successful running of all previous ICDAR Page Segmentation competitions (2001, 2003, 2005, 2007, 2009 and 2011). The proposed competition will expand the scope to historical books with distortions (the historical documents in the dataset of the ICDAR2011 competition were largely distortion free – in order to better evaluate the segmentation step on its own). Furthermore, the breadth of the competition will increase to cover recognition as well.
Dataset and evaluation methodology
The dataset to be used in this competition will be a subset of the recently created dataset by the IMPACT project, representing key holdings of major European libraries. It is realistic in that it represents a wide range of books, in different languages (the recognition aspect of the competition will focus on English, French, German and Spanish) with a variety of layouts and conditions that reflect historical books that are likely to be of broad interest to be digitised. All material has been ground-truthed using Aletheia and is available in the PAGE format. The dataset will be made publicly available after the competition.
The competition will use the evaluation approach successfully employed in the ICDAR2009 Page Segmentation Competition and further updated for the first competition on general historical documents at ICDAR2011. It has been recently extended to perform text-based evaluation (e.g. for OCR) as well. As a whole, it takes into account a wide range of situations and provides considerable details on the performance of different methods. Each type of error is weighted according to the type of regions involved and the situation they are found.
Participating system will be evaluated in different stages (e.g. segmentation, classification, recognition) according to how far their methods are applicable within the analysis and recognition workflow – not all participating systems have to be end-to-end applications.
Participants will be provided with a number of tools developed by PRImA that can be used in order to prepare and optimise their method(s) for submission (as well as to examine the example set in detail). They will also be supported in implementing the required output format by means of a PAGE exporter class and additional information about the underlying XML Schema.