2013/064 - Using Ancestral Layout Models for Document Digitization
DATeCH conference on Digital Access to Textual Cultural Heritage, Madrid, Spain, 19th - 20th May 2014.
In this article, we show how some concepts found in traditional and old layout practices used to layout text (ruling, grid) can improve document digitization. We will first present these basic layout methods, some used since the Antiquity, and explain how some of their key concepts can be ‘translated’ and used in today’s document digitization. In particular, we will show that the traditional concept of type area is a key notion for modeling document layout. An algorithm to compute type area is detailed. We will then illustrate this work with several practical usages and evaluations, from OCR improvement to high-level logical segmentation.