Transkribus Python Toolkit
Jean-Luc Meunier, Hervé Déjean
ICDAR-OST, Kyoto, Japan, 10 - 12 November 2017
This paper introduces an extension to PyStruct, which is an open source Python library for structured machine learning, based on general Conditional Random Fields (CRF) models. We have extended it by supporting multi-type CRF to jointly classify objects of different natures, and by supporting logical constraints at prediction time. Our motivation for extending the library consists in addressing Document Understanding (DU) tasks by collectively classifying the objects present on one or several pages, jointly considering textual objects and objects of other nature like table or images or whatever object can be recognized. The logical constraints are meant to reflect prior knowledge on the DU task. We focus here on the improvement made over PyStruct, giving practical information as well. We also present a reproducible experiment. This new library is publicly available on GitHub under the Simplified BSD License.