This is a brief summary of paper for me to note it, The Curious Case of Neural Text Degeneration (Holtzman et al., ICLR 2020)

Despite considerable advances in neural language modeling with the maximization-based method which lead to high quality models for a range of language understanding tasks.

They point out the problems of the current text generation that based on the maximization-based decoding method.

The text generation with the maximization-based decoding method such as beam-search is bland, incoherent, and get stuck in repetitive loops.

So they propose the new decoding strategy which the call Nucleus Sampling.

Let’s see futher basic concept about text generation.

A number of recent works have alluded to disadvantage of maximization-based generation, which tends to generate output with high grammaticality but low diversity.

Generative Adversarial Networks have been a prominent research direction, but it is worse than language model in jointly considering generation’s quality and diversity.

Meanwhile, Anoter approach give a rise to beam search tweaking scoring function for task-specific diversity.

Also they split the field of generation into two fileds which are Open-ended and directed generation.

Many text generation tasks are defined through (input, output) pairs, such that the output is a constrained transformation of the input.

Example applications include machine translation, data-to-text generation, and summarization that they refered to as directed generation

Another approach, open-ended generation, includes conditional stroy generation and contextual text continuation. They also think of goal-oriented dialog as between open-ended generation and directed generation.

From now on, Let’s see several decoding strategies they explained.

Maximization-based decoding.

The most commonly used decoding objective, in particular for directed generation, is maximization-based decoding.
Assuming that the model assigns higher probability to higher quality text, these decoding strategies search for the continuation with the highest likelihood.
Since finding the optimum argmax sequence from recurrent neural language models or Transformers is not tractable,
common practice is to use beam search.

Nucleus sampling

a new stochastic decoding method: Nucleus Sampling. The key idea is to use the shape of the probability distribution to determine the set of tokens to be sampled from.

Specifically, in order to sample from truncted neural language model distribution for next token, Nucleus sampling truncate the distribution with threshold value \(p\) and then use it as prediction zone.

Given a distribution \(P(x_{i}|x_{1:i-1})\), they defined top-p(portion) vocabulary \(V^{(p)} \subset V\) as a small set to be sampled from.

p(portion) is set with threshold value \(p\) as follows:

\[\sum_{x \in V^{(p)}} P(x_{i}|x_{1:i-1}) \geq p\]

In order to sample from the prediction zone by threshold value \(p\), they turn the prediction zone into a re-scaled distribtuion \(P^{`}(x_{i}|x_{1:i-1})\) as follows:

First of all, let’s \(p^{`} = \sum_{x \in V^{(p)}} P(x_{i}|x_{1:i-1})\)

\[P^{'}(x_{i}|x_{1:i-1}) = \begin{cases} \frac{P^{'}(x_{i}|x_{1:i-1})}{p^{'}}, & \text{if x } \in V^{(p)} \\ 0, & \text{otherwise} \end{cases}\]

Top-k sampling

Nucleus Sampling and top-k both sample from truncated Neural LM distributions, differing only in the strategy of where to truncate.
Choosing where to truncate can be interpreted as determining the generative model’s trustworthy prediction zone.

Top-k sampling is similar to Nucleus sampling but when trucating nueral language model distribution for next token, the top k possible next tokens are sampled from top k candidates according to their relative probability distribution for next token.

Sampling with temperature

Another common approach to sampling-based generation is to shape a probability distribution through temperature.

With temperature \(t\), softmax is re-estimated as :

\[softmax(x) = \frac{exp^(x/t)}{\sum exp^(x^{'}/t)}\]

When they generate target sentence, they use a basic form of beam-search to find a translation that maximizes the conditional probability given by a specific models.

it is better than Greed search.

If you want to know what beam-search is, see the following (e.g. Youtube lecture)

Youtube of Deeplearning Ai

Note(Abstract): Despite considerable advances in neural language modeling, it remains an open question what the best decoding strategy is for text generation from a language model (e.g. to generate a story). The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, maximization-based decoding methods such as beam search lead to degeneration — output text that is bland, incoherent, or gets stuck in repetitive loops. To address this they propose Nucleus Sampling, a simple but effective method to draw considerably higher quality text out of neural language models than previous decoding strategies. Their approach avoids text degeneration by truncating the unreliable tail of the probability distribution, sampling from the dynamic nucleus of tokens containing the vast majority of the probability mass. To properly examine current maximization-based and stochastic decoding methods, they compare generations from each of these methods to the distribution of human text along several axes such as likelihood, diversity, and repetition. Their results show that (1) maximization is an inappropriate decoding objective for openended text generation, (2) the probability distributions of the best current language models have an unreliable tail which needs to be truncated during generation and (3) Nucleus Sampling is currently the best available decoding strategy for generating long-form text that is both high-quality — as measured by human evaluation — and as diverse as human-written text.

Download URL:
The paper: The Curious Case of Neural Text Degeneration (Holtzman et al., ICLR 2020)

Reference

Paper
- arXiv Version: The Curious Case of Neural Text Degeneration (Holtzman et al., arXiv 2020)
- ICLR Version: The Curious Case of Neural Text Degeneration (Holtzman et al., ICLR 2020)
How to use html for alert
- how to use icon
For your information

The Curious Case of Neural Text Degeneration

Title of paper - The Curious Case of Neural Text Degeneration

The Curious Case of Neural Text Degeneration

Title of paper - The Curious Case of Neural Text Degeneration

Reference