Regression Loss in Transformer-based Supervised Neural Machine Translation
Keywords:supervised neural machine translation (SNMT), Transformer, attention mechanism, semantic regression loss, evaluation metric
Transformer-based model has achieved human-level performance in supervised neural machine translation (SNMT), much better than the models based on recurrent neural networks (RNNs) or convolutional neural network (CNN). The original Transformer-based model is trained through maximum likelihood estimation (MLE), which regards the machine translation task as a multilabel classification problem and takes the sum of the cross entropy loss of all the target tokens as the loss function. However, this model assumes that token generation is partially independent, without realizing that tokens are the components of a sequence. To solve the problem, this paper proposes a semantic regression loss for Transformer training, treating the generated sequence as a global. Upon finding that the semantic difference is proportional to candidate-reference distance, the authors considered the machine translation problem as a multi-task problem, and took the linear combination of cross entropy loss and semantic regression loss as the overall loss function. The semantic regression loss was proved to significantly enhance SNMT performance, with a slight reduction in convergence speed.
 Banerjee, S.; Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization 65-72, 2005.
 Cettolo, M.; Niehues, J.; Stüker, S.; Bentivogli, L.; Federico, M. (2014). Report on the 11th iwslt evaluation campaign, In Proceedings of the International Workshop on Spoken Language Translation, Hanoi, 2014.
 Cho, K.; Van Merrií«nboer, B.; Gulcehre, C.;Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078, 2014. https://doi.org/10.3115/v1/D14-1179
 Cohn-Gordon, R.; Goodman, N. (2019). Lost in machine translation: A method to reduce meaning loss, arXiv preprint arXiv:1902.09514, 2019. https://doi.org/10.18653/v1/N19-1042
 Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, 2018.
 Dozat, T. (2016). Incorporating nesterov momentum into adam, ICLR 2016 workshop homepage 2016, 2016.
 Endres, D.M.; Schindelin, J.E. (2003). A new metric for probability distributions, IEEE Transactions on Information theory, 49(7): 1858-1860, 2003. https://doi.org/10.1109/TIT.2003.813506
 Federmann, C.; Lewis, W.D. (2016). Microsoft speech language translation (mslt) corpus: The iwslt 2016 release for english, french and german, In International Workshop on Spoken Language Translation, 2016.
 Freitag, M.; Al-Onaizan, Y. (2017). Beam search strategies for neural machine translation, arXiv preprint arXiv:1702.01806, 2017. https://doi.org/10.18653/v1/W17-3207
 Gehring, J.; Auli, M.; Grangier, D.; Dauphin, Y.N. (2016). A convolutional encoder model for neural machine translation, arXiv preprint arXiv:1611.02344, 2016. https://doi.org/10.18653/v1/P17-1012
 Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. (2017). Convolutional sequence to sequence learning, In International Conference on Machine Learning, 1243-1252, 2017.
 Hassan, H.; Aue, A.; Chen, C.; Chowdhary, V.; Clark, J.; Federmann, C. (2018). Achieving human parity on automatic chinese to english news translation, arXiv preprint arXiv:1803.05567, 2018.
 Hochreiter, S.; Schmidhuber, J. (1997). Long short-term memory, Neural computation, 9(8): 1735-1780, 1997. https://doi.org/10.1162/neco.1918.104.22.1685
 Kingma, D.P.; Ba, J. (2014). Adam: a method for stochastic optimization, CoRR abs/1412.6980, 2014.
 Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Federico, M.; Bertoldi, N. (2007). Moses: Open source toolkit for statistical machine translation, In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, 177-180, 2007. https://doi.org/10.3115/1557769.1557821
 Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates, arXiv preprint arXiv:1804.10959, 2018. https://doi.org/10.18653/v1/P18-1007
 Kudo, T.; Richardson, J. (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, arXiv preprint arXiv:1808.06226, 2018. https://doi.org/10.18653/v1/D18-2012
 Kullback. S.; Leibler, R.A. (1951). On information and sufficiency, The annals of mathematical statistics, 22(1): 79-86, 1951. https://doi.org/10.1214/aoms/1177729694
 Li, Y.; Wang, Q.; Xiao, T.; Liu, T.; Zhu, J. (2020). Neural machine translation with joint representation, In Proceedings of the AAAI Conference on Artificial Intelligence, 34(5): 8285- 8292, 2020. https://doi.org/10.1609/aaai.v34i05.6344
 Lin, C.Y. (2004). Rouge: A package for automatic evaluation of summaries, In Text summarization branches out, 74-81, 2004
 Loshchilov, I.; Hutter, F. (2017). Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101, 2017.
 Luong, M.T.; Pham, H.; Manning, C.D. (2015). Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025, 2015. https://doi.org/10.18653/v1/D15-1166
 Ma, S.; Sun, X.; Wang, Y.; Lin, J. (2018). Bag-of-words as target for neural machine translation, arXiv preprint arXiv:1805.04871, 2018. https://doi.org/10.18653/v1/P18-2053
 Norouzi, M.; Mikolov, T.; Bengio, S.; Singer, Y.; Shlens, J.; Frome, A. (2013). Zero-shot learning by convex combination of semantic embeddings, arXiv preprint arXiv:1312.5650, 2013.
 Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. (2002). BLEU: a method for automatic evaluation of machine translation, In Proceedings of the 40th annual meeting on association for computational linguistics, 311-318, 2002. https://doi.org/10.3115/1073083.1073135
 Schuster, M.; Paliwal, K.K. (1997). Bidirectional recurrent neural networks, IEEE transactions on Signal Processing, 45(11): 2673-2681 https://doi.org/10.1109/78.650093
 Sennrich, R.; Haddow, B.; Birch, A. (2015). Neural machine translation of rare words with subword units, arXiv preprint arXiv:1508.07909, 2015. https://doi.org/10.18653/v1/P16-1162
 Shen, S.; Cheng, Y.; He, Z.; He, W.; Wu, H.; Sun, M.; Liu, Y. (2015). Minimum risk training for neural machine translation, arXiv preprint arXiv:1512.02433, 2015. https://doi.org/10.18653/v1/P16-1159
 Snover, M.; Dorr, B.; Schwartz, R.; Micciulla, L.; Makhoul, J. (2006). A study of translation edit rate with targeted human annotation, In Proceedings of association for machine translation in the Americas, 2006.
 Sutskever, I.; Vinyals, O.; Le, Q.V. (2014). Sequence to sequence learning with neural networks, Advances in neural information processing systems, 3104-3112, 2014.
 Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. (2017). Attention is all you need, arXiv preprint arXiv:1706.03762, 2017.
 Wang, Q.; Li, B.; Xiao, T.; Zhu, J.; Li, C.; Wong, D.F.; Chao, L.S. (2019). Learning deep transformer models for machine translation, arXiv preprint arXiv:1906.01787, 2019. https://doi.org/10.18653/v1/P19-1176
 Wu, L.; Tian, F.; Qin, T.; Lai, J.; Liu, T.Y. (2018). A study of reinforcement learning for neural machine translation, arXiv preprint arXiv:1808.08866, 2018. https://doi.org/10.18653/v1/D18-1397
 Yang, B.; Li, J.; Wong, D.F.; Chao, L.S.; Wang, X.; Tu, Z. (2019). Context-aware self-attention networks, In Proceedings of the AAAI Conference on Artificial Intelligence, 33: 387-394, 2019. https://doi.org/10.1609/aaai.v33i01.3301387
 [Online]. Available: https://github.com/moses-smt/mosesdecoder, Accesed on 30 December 2017.
 [Online]. Available: https://github.com/Maluuba/nlg-eval, Accesed on 23 November 2019.
 [Online]. Available: https://pypi.org/project/pyter, Accesed on 7 Decemebr 2012.
 [Online]. Available: https://github.com/google/sentencepiece, Accesed on 10 January 2021.
 [Online]. Available: https://github.com/lovit/WordPieceModel, Accesed on 5 November 2018.
ONLINE OPEN ACCES: Acces to full text of each article and each issue are allowed for free in respect of Attribution-NonCommercial 4.0 International (CC BY-NC 4.0.
You are free to:
-Share: copy and redistribute the material in any medium or format;
-Adapt: remix, transform, and build upon the material.
The licensor cannot revoke these freedoms as long as you follow the license terms.
DISCLAIMER: The author(s) of each article appearing in International Journal of Computers Communications & Control is/are solely responsible for the content thereof; the publication of an article shall not constitute or be deemed to constitute any representation by the Editors or Agora University Press that the data presented therein are original, correct or sufficient to support the conclusions reached or that the experiment design or methodology is adequate.