Not registered? - Request an account here

A Realistic Dataset for Performance Evaluation of Document Layout Analysis

A. Antonacopoulos, D. Bridson, C. Papadopoulos, S. Pletschacher

Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR2009), Barcelona, Spain, July 2009, pp. 296-300

Abstract

There is a significant need for a realistic dataset on which to evaluate layout analysis methods and examine their performance in detail. This paper presents a new dataset (and the methodology used to create it) based on a wide range of contemporary documents. Strong emphasis is placed on comprehensive and detailed representation of both complex and simple layouts, and on colour originals. In-depth information is recorded both at the page and region level. Ground truth is efficiently created using a new semi-automated tool and stored in a new comprehensive XML representation, the PAGE format. The dataset can be browsed and searched via a web-based front end to the underlying database and suitable subsets (relevant to specific evaluation goals) can be selected and downloaded.

Citation

A. Antonacopoulos, D. Bridson, C. Papadopoulos, S. Pletschacher , "A Realistic Dataset for Performance Evaluation of Document Layout Analysis", Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR2009), Barcelona, Spain, July 2009, pp. 296-300

DOI

10.1109/ICDAR.2009.271

Full Paper

Download PDF

Related Projects

IMPACT - Improving Access to Text