Regression Loss in Transformer-based Supervised Neural Machine Translation

  • Dongxing Li
  • Zuying Luo School of Artificial Intelligence, Beijing Normal University, Beijing 100875, China


Transformer-based model has achieved human-level performance in supervised neural machine translation (SNMT), much better than the models based on recurrent neural networks (RNNs) or convolutional neural network (CNN). The original Transformer-based model is trained through maximum likelihood estimation (MLE), which regards the machine translation task as a multilabel classification problem and takes the sum of the cross entropy loss of all the target tokens as the loss function. However, this model assumes that token generation is partially independent, without realizing that tokens are the components of a sequence. To solve the problem, this paper proposes a semantic regression loss for Transformer training, treating the generated sequence as a global. Upon finding that the semantic difference is proportional to candidate-reference distance, the authors considered the machine translation problem as a multi-task problem, and took the linear combination of cross entropy loss and semantic regression loss as the overall loss function. The semantic regression loss was proved to significantly enhance SNMT performance, with a slight reduction in convergence speed.


[1] Bahdanau, D.; Cho, K.; Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473, 2014.

[2] Banerjee, S.; Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization 65-72, 2005.

[3] Cettolo, M.; Niehues, J.; Stüker, S.; Bentivogli, L.; Federico, M. (2014). Report on the 11th iwslt evaluation campaign, In Proceedings of the International Workshop on Spoken Language Translation, Hanoi, 2014.

[4] Cho, K.; Van Merriënboer, B.; Gulcehre, C.;Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078, 2014.

[5] Cohn-Gordon, R.; Goodman, N. (2019). Lost in machine translation: A method to reduce meaning loss, arXiv preprint arXiv:1902.09514, 2019.

[6] Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, 2018.

[7] Dozat, T. (2016). Incorporating nesterov momentum into adam, ICLR 2016 workshop homepage 2016, 2016.

[8] Endres, D.M.; Schindelin, J.E. (2003). A new metric for probability distributions, IEEE Transactions on Information theory, 49(7): 1858-1860, 2003.

[9] Federmann, C.; Lewis, W.D. (2016). Microsoft speech language translation (mslt) corpus: The iwslt 2016 release for english, french and german, In International Workshop on Spoken Language Translation, 2016.

[10] Freitag, M.; Al-Onaizan, Y. (2017). Beam search strategies for neural machine translation, arXiv preprint arXiv:1702.01806, 2017.

[11] Gehring, J.; Auli, M.; Grangier, D.; Dauphin, Y.N. (2016). A convolutional encoder model for neural machine translation, arXiv preprint arXiv:1611.02344, 2016.

[12] Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. (2017). Convolutional sequence to sequence learning, In International Conference on Machine Learning, 1243-1252, 2017.

[13] Hassan, H.; Aue, A.; Chen, C.; Chowdhary, V.; Clark, J.; Federmann, C. (2018). Achieving human parity on automatic chinese to english news translation, arXiv preprint arXiv:1803.05567, 2018.

[14] Hochreiter, S.; Schmidhuber, J. (1997). Long short-term memory, Neural computation, 9(8): 1735-1780, 1997.

[15] Kingma, D.P.; Ba, J. (2014). Adam: a method for stochastic optimization, CoRR abs/1412.6980, 2014.

[16] Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Federico, M.; Bertoldi, N. (2007). Moses: Open source toolkit for statistical machine translation, In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, 177-180, 2007.

[17] Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates, arXiv preprint arXiv:1804.10959, 2018.

[18] Kudo, T.; Richardson, J. (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, arXiv preprint arXiv:1808.06226, 2018.

[19] Kullback. S.; Leibler, R.A. (1951). On information and sufficiency, The annals of mathematical statistics, 22(1): 79-86, 1951.

[20] Li, Y.; Wang, Q.; Xiao, T.; Liu, T.; Zhu, J. (2020). Neural machine translation with joint representation, In Proceedings of the AAAI Conference on Artificial Intelligence, 34(5): 8285- 8292, 2020.

[21] Lin, C.Y. (2004). Rouge: A package for automatic evaluation of summaries, In Text summarization branches out, 74-81, 2004

[22] Loshchilov, I.; Hutter, F. (2017). Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101, 2017.

[23] Luong, M.T.; Pham, H.; Manning, C.D. (2015). Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025, 2015.

[24] Ma, S.; Sun, X.; Wang, Y.; Lin, J. (2018). Bag-of-words as target for neural machine translation, arXiv preprint arXiv:1805.04871, 2018.

[25] Norouzi, M.; Mikolov, T.; Bengio, S.; Singer, Y.; Shlens, J.; Frome, A. (2013). Zero-shot learning by convex combination of semantic embeddings, arXiv preprint arXiv:1312.5650, 2013.

[26] Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. (2002). BLEU: a method for automatic evaluation of machine translation, In Proceedings of the 40th annual meeting on association for computational linguistics, 311-318, 2002.

[27] Schuster, M.; Paliwal, K.K. (1997). Bidirectional recurrent neural networks, IEEE transactions on Signal Processing, 45(11): 2673-2681

[28] Sennrich, R.; Haddow, B.; Birch, A. (2015). Neural machine translation of rare words with subword units, arXiv preprint arXiv:1508.07909, 2015.

[29] Shen, S.; Cheng, Y.; He, Z.; He, W.; Wu, H.; Sun, M.; Liu, Y. (2015). Minimum risk training for neural machine translation, arXiv preprint arXiv:1512.02433, 2015.

[30] Snover, M.; Dorr, B.; Schwartz, R.; Micciulla, L.; Makhoul, J. (2006). A study of translation edit rate with targeted human annotation, In Proceedings of association for machine translation in the Americas, 2006.

[31] Sutskever, I.; Vinyals, O.; Le, Q.V. (2014). Sequence to sequence learning with neural networks, Advances in neural information processing systems, 3104-3112, 2014.

[32] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. (2017). Attention is all you need, arXiv preprint arXiv:1706.03762, 2017.

[33] Wang, Q.; Li, B.; Xiao, T.; Zhu, J.; Li, C.; Wong, D.F.; Chao, L.S. (2019). Learning deep transformer models for machine translation, arXiv preprint arXiv:1906.01787, 2019.

[34] Wu, L.; Tian, F.; Qin, T.; Lai, J.; Liu, T.Y. (2018). A study of reinforcement learning for neural machine translation, arXiv preprint arXiv:1808.08866, 2018.

[35] Yang, B.; Li, J.; Wong, D.F.; Chao, L.S.; Wang, X.; Tu, Z. (2019). Context-aware self-attention networks, In Proceedings of the AAAI Conference on Artificial Intelligence, 33: 387-394, 2019.

[36] [Online]. Available:, Accesed on 30 December 2017.

[37] [Online]. Available:, Accesed on 23 November 2019.

[38] [Online]. Available:, Accesed on 7 Decemebr 2012.

[39] [Online]. Available:, Accesed on 10 January 2021.

[40] [Online]. Available:, Accesed on 5 November 2018.
How to Cite
LI, Dongxing; LUO, Zuying. Regression Loss in Transformer-based Supervised Neural Machine Translation. INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL, [S.l.], v. 16, n. 4, apr. 2021. ISSN 1841-9844. Available at: <>. Date accessed: 17 sep. 2021. doi: