Close

Cookies warning

This web site uses cookies to improve your experience. By viewing our content, you are accepting the use of cookies.

Cookies are small text documents stored on your computer; the cookies set by this website can only be used on this website and pose no security risk.

Please do not proceed if you do not want these cookies being set. [Show details]

University of Salford
PRImA - Pattern Recognition & Image Analysis Group
     
Register

PRImA Layout Analysis Dataset :: Content

Content

The dataset includes a wide variety of different document types, reflecting the various challenges in layout analysis. Particular emphasis is placed on:

  • Magazine scans from a variety of mainstream news, business and technology publications which contain a mixture of simple and complex layouts (e.g. non-Manhattan, with varying font sizes etc.)

  • Technical articles on a variety of disciplines, including papers in journals and conference proceedings, with both simple and complex layouts present.

Volume

Currently, there are 305 ground-truthed images in final release and more are being added.

Structure

In addition to the downloadable collection of images and ground truth, the dataset contains searchable document-level metadata and provides access functionality through the web front-end. Subsets of the dataset fulfilling user-selected criteria can be created and downloaded.

Ground Truth Format

Ground truth is described and stored according to the PAGE (Page Analysis and Ground truth Elements) framework. PAGE has been developed by PRImA based on a long working experience in creating, managing and using datasets, including the large and significant historical document datasets of the EU-funded IMPACT project.

More details on the PAGE format can be found in the following paper:

S. Pletschacher and A. Antonacopoulos, "The PAGE (Page Analysis and Ground-Truth Elements) Format Framework", Proceedings of the 20th International Conference on Pattern Recognition (ICPR2008), Istanbul, Turkey, August 23-26, 2010, IEEE-CS Press, pp. 257-260. [further details]

The format provides for the representation of several different region types, which may be subject to different processing in recognition systems. The most important types of region are text, image, line drawing, graphic, table, chart, separator, maths, noise and frame. In the ground truth, the highest-level textual regions correspond to paragraphs (a conscious choice as a paragraph is also a complete logical entity, as opposed to columns of text for instance). The format allows for the representation of further subdivisions of text regions into text lines, words and glyphs, in order to enable the evaluation of segmentation methods at those lower levels also.

For each region there is a description of its outline in the form of a closely fitting polygon. Such a representation enables a very accurate and efficient geometric description, especially for complex-shaped regions. A range of metadata is recorded for each different type of region. For example, text regions hold information about language, font, reading direction, text colour, background colour, logical label (e.g. heading, paragraph, caption, footer, etc.) among others.

Moreover, the format offers sophisticated means for expressing reading order and more complex relations between regions. Examples are groups which may contain ordered or unordered elements and may even be nested. This is important for documents with intricate logical structures like newspapers.


Valid XHTML 1.0! Valid CSS!
Best viewed in 1024x768 - Maintained by: Christos Papadopoulos (e-mail), Apostolos Antonacopoulos (e-mail) - © 2009-2017