Accurate Word Segmentation using Transliteration and Language Model Projection

Author: Masato Hagiwara and Satoshi Sekine


Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS). Offline approaches have been proposed to split them using word statistics, but
they rely on static lexicon, limiting their use. We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. The experiments on Japanese and Chinese WS have shown that the proposed models achieve significant improvement over state-of-the-art, reducing 16% errors in Japanese.

