Not registered? - Request an account here

Document Representation Refinement for Precise Region Description

C. Clausner, S. Pletschacher, A. Antonacopoulos

Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (DATeCH2014), Madrid, Spain, May 2014, pp. 9-13

Abstract

Precise description of layout entities (content regions on a page) is crucial for all but the most trivial document analysis and recognition applications. The output of layout analysis methods and state-of-the-art OCR systems varies significantly, from bounding boxes (e.g. Tesseract) to stacks of text line rectangles (e.g. ABBYY FineReader). There is a clear need for a consistent and accurate representation of regions (e.g. text paragraphs, graphics entities etc.) for further processing, correction and performance evaluation (comparison of segmentation results with ground truth regions). This paper describes a method for refinement of document representations by fitting polygons around lower-level layout objects (such as text lines, words and glyphs) in a systematic way that reconstructs region outlines and preserves the fine details of complex layouts. Experimental results on a standard dataset demonstrate the validity and usefulness of the proposed approach.

Citation

C. Clausner, S. Pletschacher, A. Antonacopoulos , "Document Representation Refinement for Precise Region Description", Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (DATeCH2014), Madrid, Spain, May 2014, pp. 9-13

DOI

10.1145/2595188.2595198

Full Paper

Download PDF

Related Projects

Europeana Newspapers SUCCEED