Uncovering non-obvious properties of text
Speaker: Cyril Goutte
Text categorization is a mature technology that usually reaches a high level of performance, close to human baselines, on predicting document topics and other content-based categories. In this talk I will discuss problems where we try to predict less obvious properties of text, on which humans typically struggle or fail to produce accurate results. I will discuss results obtained on three such problems: predicting whether a text is a translation or an original, predicting the native language of a L2 English writer, and predicting local variants of languages. We will present results benchmarked in recent international evaluations.