This post is a brief summary about the paper that I read for my study and curiosity, so I shortly arrange the content of the paper, titled DoLA: DECODING BY CONTRASTING LAYERS IMPROVES FACTUALITY IN LARGE LANGUAGE MODELS (Chung et al., arXiv 2023), that I read and studied.

This paper proposes a simple decoding strategy for reducing hallucinations with pretrained LLMs as follows:

Chung et al. ArXiv 2023

Their method doesn’t requir conditioning on retrievd external knowledge nor additional finetuning.

Let’s see the following image including method called contrastive.

Their method obtains the next-token distribution by contratsting the differences in logits attained from projecting the later laters versus earlier layers to the vocabulari space, exploiting the fact that factual knowledge in an LLMs has generally been shown to be localized to particular transformer layers.

Chung et al. ArXiv 2023

You can see the detailed empirical analysis and experiemtn in the paper, titled DoLA: DECODING BY CONTRASTING LAYERS IMPROVES FACTUALITY IN LARGE LANGUAGE MODELS (Chung et al., arXiv 2023)

For detailed experiment and explanation, refer to the paper, titled DoLA: DECODING BY CONTRASTING LAYERS IMPROVES FACTUALITY IN LARGE LANGUAGE MODELS (Chung et al., arXiv 2023)