Not registered? - Request an account here

HBR2015 ICDAR Competition on Historical Books Recognition

Overview

The competition presents challenges for page segmentation, region classification, and text recognition in an end-to-end scenario. The dataset contains scanned pages from a wide range of historical books with a variety of layouts and conditions. Participants will be provided with know-how and tools that aid the development or extension of their page analysis systems.

Background

Layout Analysis and Text Recognition are of fundamental importance among Document Image Analysis steps and have been (and continue to be) relatively well researched. Historical documents are of particular interest as they pose a number of challenges and, at the same time, represent a very large proportion of printed documents in existence. With the increasing number of digitisation projects initiated by libraries world-wide, the problem of analysis and recognition of these documents is very topical.

Historical books are of significant interest to Digital Humanities researchers and represent a large proportion of libraries’ holdings, and therefore, continue to be the focus of large-scale digitisation projects. A number of artefacts frequently manifest themselves in scans of historical books, hindering layout analysis and text recognition. The motivation of the competition is to evaluate existing approaches using a realistic dataset and an objective performance analysis system.

The ICDAR2015 Recognition of Historical Books Competition follows the successful running of all previous ICDAR Page Segmentation competitions (2001, 2003, 2005, 2007, 2009, 2011 and 2013). The proposed competition will build upon the challenges from 2013, adding a new class of problems. As in 2013, the competition will cover layout analysis as well as text recognition, in an end-to-end scenario.

Dataset and evaluation methodology

The dataset to be used in this competition will be a subset of the dataset created by the major EU-funded IMPACT project, representing key holdings of major European libraries. It is realistic in that it represents a wide range of books, in different languages (the recognition aspect of the competition will focus on English) with a variety of layouts and conditions that reflect historical books that are likely to be of broad interest to be digitised. All material has been ground-truthed using Aletheia and is available in the PAGE format. The dataset will be made publicly available after the competition.

The competition will use the comprehensive evaluation approach successfully employed in recent ICDAR competitions. It has been recently extended to perform text-based evaluation (e.g. for OCR) as well. As a whole, it takes into account a wide range of situations and provides considerable details on the performance of different methods. Each type of error is weighted according to the type of regions involved and the situation they are found. The evaluation tools used are freely available from the PRImA website.

Participating systems will be evaluated in different stages (i.e. segmentation, classification, recognition) according to how far their methods are applicable within the analysis and recognition workflow – not all participating systems have to be end-to-end applications. The organisers will offer assistance to participants on how to integrate an open-source OCR module into their workflow.

Additional information

Participants will be provided with a number of tools developed by PRImA that can be used in order to prepare and optimise their method(s) for submission (as well as to examine the example set in detail). They will also be supported in implementing the required output format by means of PAGE exporter modules (C++ or Java) and additional information about the underlying XML Schema.

Acknowledgment

This work has been supported in part through the EU 7th Framework Programme grant SUCCEED (Ref. 600555)

SUCCEED Project