NAVER LABS Europe

ICDAR-OST is held in conjunction with IAPR from 10th - 12th November 2017. This two-day event aims at promoting open tools, software and open services in the domain of Document Image Analysis Research.

Jean-Luc Meunier and Hervé Déjean, Transkribus Python Toolkit [PDF]

This paper introduces an extension to PyStruct, which is an open source Python library for structured machine learning, based on general Conditional Random Fields (CRF) models. We have extended it by supporting multi-type CRF to jointly classify objects of different natures, and by supporting logical constraints at prediction time. Our motivation for extending the library consists in addressing Document Understanding (DU) tasks by collectively classifying the objects present on one or several pages, jointly considering textual objects and objects of other nature like table or images or whatever object can be recognized. The logical constraints are meant to reflect prior knowledge on the DU task. We focus here on the improvement made over PyStruct, giving practical information as well. We also present a reproducible experiment. This new library is publicly available on GitHub under the Simplified BSD License.

Jean-Luc Meunier, Pystruct Extension for Typed CRF Graphs. [PDF

This paper introduces an extension to PyStruct, which is an open source Python library for structured machine learning, based on general Conditional Random Fields (CRF) models. We have extended it by supporting multi-type CRF to jointly classify objects of different natures, and by supporting logical constraints at prediction time. Our motivation for extending the library consists in addressing Document Understanding (DU) tasks by collectively classifying the objects present on one or several pages, jointly considering textual objects and objects of other nature like table or images or whatever object can be recognized. The logical constraints are meant to reflect prior knowledge on the DU task. We focus here on the improvement made over PyStruct, giving practical information as well. We also present a reproducible experiment. This new library is publicly available on GitHub under the Simplified BSD License.

Related reading: Read the blog post on Document Analysis and Layout Using Sequential Pattern Mining Techniques