RIT Singapore Releases New Corpus and Sheds Light on the Future of Automatic Post-Editing

2020.10.01

RIT Singapore has built and released datasets this week containing publicly available neural machine translation post-edited datasets. They coincide with the publication of their upcoming EMNLP 2020 paper, “Can Automatic Post-Editing Improve NMT” which compiles this large corpus of human post-edits of English to German neural machine translation (NMT) to empirically show the conditions under which automatic post-editing (APE) can improve neural machine translation.

There are two datasets, as follows:

1. SubEdits (English-German 160k triplets): A human-annnoated post-editing dataset of neural machine translation outputs, compiled from in-house NMT outputs and human post-edits of subtitles form Rakuten Viki. Details about dataset collection and preprocessing can be found in the paper.

2. SubEscape (English-German, 5.6m triplets): An artificial post-editing dataset created by translating OpenSubtitles2016 corpus (Lison and Tiedemann, 2016) collected from www.opensubtitles.org/ using the NMT system used for SubEdits and the references used as synthetic post-edits following the procedure used to compile eSCAPE (Negri et al., 2018).

Please see the Github repository to learn more about using the datasets for research purposes as well as issues of citation and licensing