This post is a brief summary about the paper that I read for my study and curiosity, so I shortly arrange the content of the paper, titled Length-Controleed AlpacaEval: A Simple Way to Debias Automatic Evaluators (Dubois et al., arXiv 2024), that I read and studied.

In this paper, they want to focus on operatinalizing “what would be the AlpacaEval metric be, if the output of all models had the same lengthas those of the baseline?” into a simple regression-based estimator.

In other words, The automated evaluation measures such as AlpacaEval return their quality estimates through a combination of direct effects that measure the quality of model response and indirect effects that are mediated by spurious variables such as the length of outputs.

Dubois et al., arXiv 2024

The following is length control via regression.

  • Model Identity
  • Length of output
  • Instruction difficulty

Dubois et al., arXiv 2024

For detailed experiment and explanation, refer to the paper, titled Length-Controleed AlpacaEval: A Simple Way to Debias Automatic Evaluators (Dubois et al., arXiv 2024)

Reference