Latent Semantic Transliteration using Dirichlet Mixture
Transliteration has been usually recognized by spelling-based supervised models. However, a single model cannot deal with mixture of words with diﬀerent origins, such as “get” in “piaget” and “target”. Li et al. (2007) propose a class transliteration method, which explicitly models the source language origins and switches them to address this issue. In contrast to their model which requires an explicitly tagged training corpus with language origins, Hagiwara and Sekine (2011) have proposed the latent class transliteration model, which models language origins as latent classes and train the transliteration table via the EM algorithm. However, this model, which can be formulated as unigram mixture, is prone to overﬁtting since it is based on maximum likelihood estimation. We propose a novel latent semantic transliteration model based on Dirichlet mixture, where a Dirichlet mixture prior is introduced to mitigate the overﬁtting problem. We have shown that the proposed method considerably outperform the conventional transliteration models.