Not registered? - Request an account here

DMAS2019 Resources

Overview

Provided to participants:

  • Example images with OCR results and ground truth (in PAGE XML format)
  • Software libraries and tools to create/view PAGE
  • Support

To be delivered by participants:

  • Page classification results and article segmentation + classification results in valid PAGE XML (see examples below)
  • Access to the candidate methods (executable, Docker, web service or similar
  • A short description of the method (250 words)

Ground Truth Format

The ground truth for each image is provided in the PAGE (Page Analysis and Ground truth Elements) format. For a description of the relevant parts (for this competition) of the XML file structure please see the section "Page analysis and recognition results" below.

PAGE has been developed on a long working experience in creating, managing and using datasets, including the PRImA Layout Analysis Dataset and the large and significant historical document dataset of the EU-funded IMPACT project.

More details on the PAGE format can be found in the following paper:

S. Pletschacher, A. Antonacopoulos, "The PAGE (Page Analysis and Ground-Truth Elements) Format Framework", Proceedings of the 20th International Conference on Pattern Recognition (ICPR2010), Istanbul, Turkey, August 23-26, 2010, IEEE-CS Press, pp. 257-260. [further details]

And in the actual XML Schema:

http://schema.primaresearch.org/PAGE/gts/pagecontent/2018-07-15/pagecontent.xsd

The format provides for the representation of several different page types and region/block types.

For each region there is a description of its outline in the form of a closely fitting polygon. Such a representation enables a very accurate and efficient geometric description, especially for complex-shaped regions. Text regions may also contain Unicode text content.

A simple example XML is described in this document

Submission requirements

Authors of methods should submit the following by e-mail to the organisers:

  1. Page analysis and recognition results in PAGE format (see below)
  2. Access to the executables/systems of the candidate methods
  3. A short description (250 words) of the methods (principles of operation and steps). Cite and attach any relevant publications, if available.

Page Analysis and Recognition Results

The results must be stored in the PAGE format (same format as the ground-truth provided). Evaluation will be based on page classification, article grouping, and article classification.

Open source tools for exporting in the PAGE format are available from the PRImA Tools website.

Alternatively you can produce PAGE files using your own XML library, following the PAGE Schema.

Aletheia, a PAGE viewer and editor is also available for download so you can preview your results and check for validity of your produced XML files.

Filenames of submitted PAGE files should match the name of the original image.

Page Classification

  • <Page ... type="...">
  • One of: cover, table-of-contents, content, index
    • front-cover (for cover pages; but only if there are no articles, advertisements etc.)
    • table-of-contents (for table of contents pages with only headings and page numbers)
    • content (for common content pages)
    • index (for index / register pages (whole page); a list or table with alphabetised text and page numbers)

Page type in Aletheia

Article Grouping / Segmentation

  • Reading order groups, each referencing all regions belonging to an article, advertisement, or illustration with caption
  • <ReadingOrder> <UnorderedGroup id="..."> <RegionRef regionRef="..."/> ...

Article Classification

  • <UnorderedGroup ... type="...";>
  • One of:
    • article (for standard arcticles; contains texts and may also include images; can continue in the next column )
    • figure (for illustrations with caption; not part of an article, but can have text in form of a caption or heading)
    • list (for table of content / index / regiser (part of page))
    • div (for advertisemnts; avertisements often have logos and graphical elements; A group of advertisements may be clustered in one segment when they are printed within one column and have the same width)
    • other (for colophons; Colophons describe who worked on a magazine and who is responsible for its contents)

Articles in Aletheia

Useful Hints

  • Page Type
    • Cover (front-cover)
      • Only mark as a cover when, apart from the title information, there are no advertisements, articles, index, illustrations with captions or a colophon
      • It can only be the first page of an issue
      • It is always a full page
    • Table of contents
      • The table of contents is often recognised by it being a list or table with text and page numbers
      • Tables of contents often have the title: ‘Inhoud’, ‘Inhoudsopgave’, ‘In dit nummer’, ‘In deze aflevering’.
      • Only mark when the full page is a table of contents
    • Register/index
      • The register/index is recognised by it being a list or table with alphabetised text and page numbers
      • An index can cover multiple pages and usually spans one or more published years of a magazine (as opposed to a table of contents, which only spans one issue)
      • Registers/indexes often have the title: ‘Register’, ‘Index’, ‘Meerjarenregister’, ‘Alfabetische index’ or ‘Alphabetische index’.
      • Only mark when the full page is a register/index
    • Content
      • Any page that contains (multiple) articles in all variations and is not a title page, a full page of register/index or a full page of table of contents.
  • Article / Content Type - For all pages marked ‘content’, segment the separate articles and classify as follows:
    • Article
      • An article contains texts and may also include images
      • Articles can continue in the next column or the next page (subsequent pages may not be included in the data)
      • Tables could be included in an article
      • Page numbers, running titles, footnotes and text in the margins are not included in the articles and should be ignored
      • The title information of a page can take up a large part of the first page or the front cover. This information is not marked as an article
    • Illustration with caption (figure)
      • Illustrations with a caption (photos, drawings, cartoons, maps, etc.) are coded as separate articles, even if they belong to an article.
      • When the image is larger than the text related to the image it is an illustration with caption. When the text is larger than the image, it is an article
    • Advertisement (div)
      • Advertisements are texts that offer goods or services
      • They are sometimes indicates with the heading ‘advertentie’, ‘advertenties’ or ‘advertentiën’.
      • Advertisements often have logos and graphical elements
      • A group of advertisements may be clustered in one segment when they are printed within one column and have the same width
    • Colophon (other)
      • Colophons describe who worked on a magazine and who is responsible for its contents
      • The colophon is often marked with ‘colofon’, ‘redactie’, ‘hoofdredacteur’, ‘hoofdredactrice’, ‘redactieraad’, ‘verantwoordelijk redacteur’
    • Table of contents (list)
      • When a table of contents is only part of a page, it becomes an article
      • The table of contents is often recognised by it being a list or table with text and page numbers
      • Tables of contents often have the title: ‘Inhoud’, ‘Inhoudsopgave’, ‘In dit nummer’, ‘In deze aflevering’

Example dataset

The example is available to participants after registration.

The following are examples of representative images from the variety of situations existing within the evaluation dataset.

Evaluation dataset

The evaluation set is now available (email on 15/04/2019, please contact us if you have not received it).