Sami Virpioja, Mari-Sanna Paukkeri, Abhishek Tripathi, Tiina Lindh-Knuutila, Krista Lagus
Natural Language Engineering Journal, Editor : Cambridge University Press 2011.
Vector space models are used in language processing applications for calculating semantic
similarities of words or documents. The vector spaces are generated with feature extraction
methods for text data. However, evaluation of the feature extraction methods may be
difficult. Indirect evaluation in an application is often time-consuming and the results
may not generalize to other applications, whereas direct evaluations that measure the
amount of captured semantic information usually require human evaluators or annotated
data sets. We propose a novel direct evaluation method based on canonical correlation
analysis (CCA), the classical method for finding linear relationship between two data
sets. In our setting, the two sets are parallel text documents in two languages. A good
feature extraction method should provide representations that reflect the semantic content
of the documents. Assuming that the underlying semantic content is independent of the
language, we can study which feature extraction methods capture it best by measuring
the dependence between the representations of the same documents in two languages.
In the case of CCA, the applied measure of dependence is correlation. The evaluation
method is based on unsupervised learning, it is language and domain independent, and
it does not require additional resources besides a parallel corpus. We demonstrate the evaluation method on a sentence-aligned parallel corpus. The method is validated by
showing that the obtained results with bag-of-words representations are intuitive and agree
well with the previous findings. Moreover, we examine the performance of the proposed
evaluation method with indirect evaluation methods in simple sentence matching tasks,
and a quantitative manual evaluation of word translations. The results of the proposed method correlate well with the results of the indirect and manual evaluations.
Report number: