Greg Grefenstette
Workshop on Acquisition of Lexical Knowledge from text (ACL SIGEX) Columbus, Ohio. Corpus Processing for Lexical Acquisition, Eds:
Bran Boguraev and James Pustejovsky,MIT Press, 1996 ISBN:
As large on-line corpora become more prevalent, a number of attempts have been made to automatically
extract thesaurus-like relations directly from text using knowledge poor methods. In the absence of any
specific application, comparing the results of these attempts is difficult. Here we propose an evaluation
method using gold standards, i.e., pre-existing hand-compiled resources, as a means of comparing extraction
techniques. Using this evaluation method, we compare two semantic extraction techniques which
produce similar word lists, one using syntactic context of words, and the other using windows of
heuristically tagged words. The two techniques are very similar except that in one case selective natural
language processing, a partial syntactic analysis, is performed. On a 4 megabyte corpus, syntactic
contexts produce significantly better results against, the gold standards for the most characteristic words
in the corpus, while windows produce better results for rare words.
Report number: