skip to content

What is ROUGE in NLP

/ 4 min read

What is ROUGE in NLP ?

ROUGE what’s that mean ?

ROUGE === [Recall-Oriented Understudy for Gisting Evaluation].

“It includes measures to automatically determine the quality of a summary by > comparing it to other (ideal) summaries created by humans” - Chin-Yew Lin

  • The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary

four different ROUGE measures

  • ROUGE-W,


Traditionally evaluation of summarization involves human judgments of different quality metrics, for example, coherence, conciseness, grammaticality, readability, and content (Mani, 2001).

ROUGE-N is a recall-related measure because the denominator of the equation is the total sum of the number of n-grams occurring at the reference summary side.

A closely related measure, BLEU, used in automatic evaluation of machine translation, is a precision-based measure

Note that the number of n-grams in the denominator of the ROUGE-N formula increases as we add more references. This is intuitive and reasonable because there might exist multiple good summaries.

Every time we add a reference into the pool, we expand the space of alternative summaries.

By controlling what types of references we add to the reference pool, we can design evaluations that focus on different aspects of summarization.

Also note that the numerator sums over all reference summaries

This effectively gives more weight to matching n-grams occurring in multiple references. Therefore a candidate summary that contains words shared by more references is favored by the ROUGE-N measure

Multiple references

When multiple references are used, we compute pairwise summarylevel ROUGE-N between a candidate summary $s$ and every reference, $r_i$ , in the reference set.

$ROUGE-N_{multi} = argmax_iROUGE-N(r_i,s)$

ROUGE-L: Longest Common Subsequence ROUGE-L: Longest Common Subsequence

By only awarding credit to in-sequence unigram matches, ROUGE-L also captures sentence level structure in a natural way.

the latter part has to be expanded upon more properly

Summary-Level LCS


One advantage of skip-bigram vs. BLEU is that it does not require consecutive matches but is still sensitive to word order

Comparing skip-bigram with LCS, skip-bigram counts all in-order matching word pairs while LCS only counts one longest common subsequence.

evaluation of ROUGE

ROUGE assigned summary scores and human assigned summary scores. The intuition is that a good evaluation measure should assign a good score to a good summary and a bad score to a bad summary. The ground truth is based on human assigned scores.

Diag table 2 figure explain

The best values in each column are marked with dark (green) color and statistically equivalent values to the best values are marked with gray. W

We found that correlations were not affected by stemming or removal of stopwords in this data set, ROUGE-2 performed better among the ROUGE-N variants, ROUGE-L, ROUGE-W, and ROUGE-S were all performing well, and using multiple references improved performance though not much

The results indicated that using multiple references improved correlation and exclusion of stopwords usually improved performance.


we introduced ROUGE, an automatic evaluation package for summarization, and conducted comprehensive evaluations of the automatic measures included in the ROUGE package using three years of DUC data.

check the significance of the results, we estimated confidence intervals of correlations using bootstrap resampling.

  • We found that (1) ROUGE-2, ROUGE-L, ROUGE-W, and ROUGE-S worked well in single document summarization tasks
  • (2) ROUGE-1, ROUGE-L, ROUGE-W, ROUGE-SU4, and ROUGE-SU9 performed great in evaluating very short summaries (or headline-like summaries)
  • (3) correlation of high 90% was hard to achieve for multi-document summarization tasks but ROUGE-1, ROUGE-2, ROUGE-S4, ROUGE-S9, ROUGE-SU4, and ROUGE-SU9 worked reasonably well when stopwords were excluded from matching,
  • (4) exclusion of stopwords usually improved correlation, and
  • (5) correlations to human judgments were increased by using multiple references.

ROUGE-L, W, and S were also shown to be very effective in automatic evaluation of machine translation

In summary, we showed that the ROUGE package could be used effectively in automatic evaluation of summaries.

Open pointers

how to achieve high correlation with human judgments in multi-document summarization tasks as ROUGE already did in single document summarization tasks is still an open research topic.