Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. And for this reason, a number of different metrics have been proposed for tasks such as machine translation or summarization. The BLEU score is a string-matching algorithm that provides basic output quality metrics for MT researchers and developers. While it is widely understood that the BLEU metric has many limitations, it is probably the most widely used MT quality assessment metric in use by MT researchers and developers over the last 15 years.

We utilize ExactMatch (EM), BLEU (Papineni et al., 2002) and SARI (Xu et al., 2016) scores 5 as evaluation metrics for fluency. 6.