Aletheia is an advanced system for accurate and yet cost-effective analysis, recognition and annotation of scanned documents. It aids the user with a number of automated and semi-automated tools which were developed and fine-tuned based on feedback from major libraries across Europe and from their digitisation service providers which are using it in a production environment.
A web based version of the Aletheia Document Analysis System, supporting a selected subset of features.
This tool is part of a framework for evaluating the performance of layout analysis methods. It combines efficiency and accuracy by using a special interval based geometric representation of regions. A wide range of sophisticated evaluation measures provide the means for a deep insight into the analysed systems, which goes far beyond simple benchmarking. The support of user-defined profiles allows the tuning for any kind of evaluation scenario related to real world applications.
The Text Evaluation tool implements the word and character accuracy measures developed by the University of Nevada Las Vegas (UNLV dissertation by S. V. Rice). It has been complemented by a bag-of-words method which is independent from the reading order.
Platform independent libraries for Java and C++ to create valid layout descriptions in PAGE XML format. The libraries can be easily integrated in other software projects such as page segmentation methods for ICDAR competitions.
This tool can be used to convert page layout files to the latest PAGE XML format
PAGE Metadata Scanner is a Java command line tool that scans a single PAGE XML file (document page layout and text content) and outputs its properties/statistics as comma-separated values.Download the latest version
Tesseract to PAGE is a Windows command line tool to analyse a document image using the open source OCR engine Tesseract and export the results to PAGE (Page Analysis and Ground truth Elements) XML format.Download the latest version
The PAGE Extractor/Exporter is a Windows command line tool to extract document snippets (image / layout description) for layout elements of documents in PAGE XML format. Furthermore, the text content of layout regions can be serialised according to the reading order and exported into a text file.Download the latest version
The Page Viewer tool is a simple viewer for page layout and text content of segmentation ground truth and results of page recognition/OCR systems. The natively supported file format is PAGE XML. However, ALTO XML, FineReader XML, and HOCR can be opened as well.
A web-based page layout editor created for the eMOP Project, a crowdsourcing initiative funded by the Mellon Foundation.