PUBLICATIONS

Latent Semantic Transliteration using Dirichlet Mixture

Author: Masato Hagiwara and Satoshi Sekine

Jul 2012

Proc. of the 50th Annual Meeting of the Association for Computational Linguistics (ACL2012), pp.30-37, 2012.

ABSTRACT

Transliteration has been usually recognized by spelling-based supervised models. However, a single model cannot deal with mixture of words with diﬀerent origins, such as “get” in “piaget” and “target”. Li et al. (2007) propose a class transliteration method, which explicitly models the source language origins and switches them to address this issue. In contrast to their model which requires an explicitly tagged training corpus with language origins, Hagiwara and Sekine (2011) have proposed the latent class transliteration model, which models language origins as latent classes and train the transliteration table via the EM algorithm. However, this model, which can be formulated as unigram mixture, is prone to overﬁtting since it is based on maximum likelihood estimation. We propose a novel latent semantic transliteration model based on Dirichlet mixture, where a Dirichlet mixture prior is introduced to mitigate the overﬁtting problem. We have shown that the proposed method considerably outperform the conventional transliteration models.

Paper Link

Copied!

Research Areas : #Language Program

Tags : #Natural Language Processing

Careers : Open Positions