Viet Ha-Thuc, Nicola Cancedda
ACL/HLT 2011, June 19-24, 2011, Portland, Oregon, USA.
Language modeling is a key component in
most statistical machine translation systems,
where it plays a crucial role in promoting out
put fluency. Since they rely on word sur
face forms only, mainstream language mod
els are unable to benefit from available lin
guistic knowledge sources. Moreover, they
tend to suffer from poor estimates for rare fea
tures. In this disclosure we propose an ap
proach to overcome these two limitations. For
the first one, we use factored features that
can flexibly capture linguistic regularities. To
overcome the second, we adopt confidence
weighted learning, a form of discriminative
online learning that can better take advantage
of a heavy tail of rare features. Finally, we ex
tend the confidence-weighted learning to deal
with noise in training data, the most common
case with language modeling.
Report number: