The competition presents challenges for page segmentation, region classification, and text recognition in an end-to-end scenario. The dataset contains scanned pages from contemporary magazines and technical articles. Participants will be provided with know-how and tools that aid the development or extension of their page analysis systems.
|Dataset||Pages of magazines and technical articles|
|Main challenge||Page segmentation and region classification|
|Bonus challenge||Text recognition (OCR)|
A few examples can be found in this document.
There's also a short YouTube video.
Layout Analysis and Text Recognition are of fundamental importance among Document Image Analysis steps and have been (and continue to be) relatively well researched. While the recognition of contemporary documents with simple layout can be considered solved, complex layouts, such as found in magazines and some technical articles, still pose a significant challenge. With the increasing number of digitisation projects initiated by libraries world-wide, the problem of analysis and recognition of these documents is very topical.
Although this competition focuses on scanned pages of printed documents, the outcome is also applicable to the recognition of digital documents (e.g. PDF) with complex layouts. The page scans, comprising the dataset, are of good quality and have little to none distortions, noise and other artefacts. The motivation of the competition is to evaluate existing approaches using a realistic dataset and an objective performance analysis system.
The ICDAR2017 Recognition of Documents with Complex Layouts Competition follows the successful running of all previous ICDAR Page Segmentation and Layout Analysis competitions (2001, 2003, 2005, 2007, 2009, 2011, 2013, and 2015). The proposed competition will build upon the challenges of the previous editions, extending the dataset and covering text recognition as well as layout analysis, in an end-to-end workflow scenario.
The dataset to be used in this competition will be a subset of the publicly available PRImA Layout Analysis Dataset (to be further extended for this competition). It contains realistic documents with a wide variety of layouts, reflecting the various challenges in layout analysis. Particular emphasis is placed on contemporary magazines and technical/scientific publications which are likely to be the focus of digitisation efforts. All material has been ground-truthed using Aletheia and is available in the PAGE format.
The competition will use the comprehensive evaluation approach successfully employed in recent ICDAR competitions. It has been recently extended to perform text-based evaluation (e.g. for OCR) as well. As a whole, it takes into account a wide range of situations and provides considerable details on the performance of different methods. Each type of error is weighted according to the type of regions involved and the situation they are found. The evaluation tools used are freely available from the PRImA website.
Participating systems will be evaluated in different stages (i.e. segmentation, classification, recognition) according to how far their methods are applicable within the analysis and recognition workflow – not all participating systems have to be end-to-end applications. The organisers will offer assistance to participants on how to integrate an open-source OCR module into their workflow.
Participants will be provided with a number of tools developed by PRImA that can be used in order to prepare and optimise their method(s) for submission (as well as to examine the example set in detail). They will also be supported in implementing the required output format by means of PAGE exporter modules (C++ or Java) and additional information about the underlying XML Schema.