This is a brief summary of paper for me to note it, The Curious Case of Neural Text Degeneration (Holtzman et al., ICLR 2020)

Despite considerable advances in neural language modeling with the maximization-based method which lead to high quality models for a range of language understanding tasks.

They point out the problems of the current text generation that based on the maximization-based decoding method.

The text generation with the maximization-based decoding method such as beam-search is bland, incoherent, and get stuck in repetitive loops.

So they propose the new decoding strategy which the call Nucleus Sampling.

Let’s see futher basic concept about text generation.

A number of recent works have alluded to disadvantage of maximization-based generation, which tends to generate output with high grammaticality but low diversity.

Generative Adversarial Networks have been a prominent research direction, but it is worse than language model in jointly considering generation’s quality and diversity.

Meanwhile, Anoter approach give a rise to beam search tweaking scoring function for task-specific diversity.

Also they split the field of generation into two fileds which are Open-ended and directed generation.

Many text generation tasks are defined through (input, output) pairs, such that the output is a constrained transformation of the input.

Example applications include machine translation, data-to-text generation, and summarization that they refered to as directed generation

Another approach, open-ended generation, includes conditional stroy generation and contextual text continuation. They also think of goal-oriented dialog as between open-ended generation and directed generation.

From now on, Let’s see several decoding strategies they explained.

  1. Maximization-based decoding.

The most commonly used decoding objective, in particular for directed generation, is maximization-based decoding.
Assuming that the model assigns higher probability to higher quality text, these decoding strategies search for the continuation with the highest likelihood.
Since finding the optimum argmax sequence from recurrent neural language models or Transformers is not tractable,
common practice is to use beam search.

  1. Nucleus sampling

a new stochastic decoding method: Nucleus Sampling. The key idea is to use the shape of the probability distribution to determine the set of tokens to be sampled from.

Specifically, in order to sample from truncted neural language model distribution for next token, Nucleus sampling truncate the distribution with threshold value \(p\) and then use it as prediction zone.

Given a distribution \(P(x_{i}|x_{1:i-1})\), they defined top-p(portion) vocabulary \(V^{(p)} \subset V\) as a small set to be sampled from.

p(portion) is set with threshold value \(p\) as follows:

\[\sum_{x \in V^{(p)}} P(x_{i}|x_{1:i-1}) \geq p\]

In order to sample from the prediction zone by threshold value \(p\), they turn the prediction zone into a re-scaled distribtuion \(P^{`}(x_{i}|x_{1:i-1})\) as follows:

First of all, let’s \(p^{`} = \sum_{x \in V^{(p)}} P(x_{i}|x_{1:i-1})\)

\[P^{'}(x_{i}|x_{1:i-1}) = \begin{cases} \frac{P^{'}(x_{i}|x_{1:i-1})}{p^{'}}, & \text{if x } \in V^{(p)} \\ 0, & \text{otherwise} \end{cases}\]
  1. Top-k sampling

Nucleus Sampling and top-k both sample from truncated Neural LM distributions, differing only in the strategy of where to truncate.
Choosing where to truncate can be interpreted as determining the generative model’s trustworthy prediction zone.

Top-k sampling is similar to Nucleus sampling but when trucating nueral language model distribution for next token, the top k possible next tokens are sampled from top k candidates according to their relative probability distribution for next token.

  1. Sampling with temperature

Another common approach to sampling-based generation is to shape a probability distribution through temperature.

With temperature \(t\), softmax is re-estimated as :

\[softmax(x) = \frac{exp^(x/t)}{\sum exp^(x^{'}/t)}\]

When they generate target sentence, they use a basic form of beam-search to find a translation that maximizes the conditional probability given by a specific models.

it is better than Greed search.

If you want to know what beam-search is, see the following (e.g. Youtube lecture)

Reference