RIT Singapore Releases New Corpus and Sheds Light on the Future of Automatic Post-Editing
RIT Singapore has built and released datasets this week containing publicly available neural machine translation post-edited datasets. They coincide with the publication of their upcoming EMNLP 2020paper, “Can Automatic Post-Editing Improve NMT” which compiles this large corpus of human post-edits of English to German neural machine translation (NMT) to empirically show the conditions under which automatic post-editing (APE) can improve neural machine translation.
There are two datasets, as follows:
1.SubEdits (English-German 160k triplets): A human-annnoated post-editing dataset of neural machine translation outputs, compiled from in-house NMT outputs and human post-edits of subtitles form Rakuten Viki. Details about dataset collection and preprocessing can be found in the paper.