This is a brief summary of paper for me to study and arrange it, Learning Distributed Representations of Sentences from Unlabelled Data (Hill et al., NAACL 2016) I read and studied.

They proposed the two approaches to represent a sentence into a fixed-lengtj vector.

  • Sequential Denoising Autoencoders:

In a Denosigin Autoencoder, high-dimensional input data is corrupted according to some noise function, and the model is trained to recover the original data from the corrupted version.

The original DAEs were feedforward nets applied to (image) data of fixed size. Here, they adapt the approach to variable-length sentences by means of a noise function N(S|po, px), determined by free parameters po, px ∈ [0, 1]. First, for each word w in S, N deletes w with (independent) probability po.
Then, for each non-overlapping bigram wiwi+1 inS, N swaps wi and wi+1 with probability px.
They then train the same LSTM-based encoder-decoder architecture as NMT, but with the denoising objective to predict (as target) the original source sentence S given a corrupted version N(S|po, px) (as source).
The trained model can then encode novel word sequences into distributed representations.
They call this model the Sequential Denoising Autoencoder (SDAE). Note that, unlike SkipThought, SDAEs can be trained on sets of sentences in arbitrary order.

  • FastSent:

FastSent is a simple additive (log-bilinear) sentence model designed to exploit the same signal, but at much lower computational expense.
Given a BOW representation of some sentence in context, the model simply predicts adjacent sentences (also represented as BOW).

FastSent:

  • More formally, FastSent learns a source \(u_w\) and target \(v_w\) embedding for each word in the model vocabulary. For a training example \(S_{i-1}, S_i, S_{i+1}\) of consecutive sentences, \(S_i\) is represented as the sum of its source embeddings \(s_i = \sum_{w \in S_i} v_w\). The cost of the example is then simply:
\[\sum_{w \in S_{i-1} \cup S_{i+1}} \phi(s_i, v_w)\]

Also, They experiment a variant (+AE) in which the encoded (source) representation must predict its own words as target in addition to those of adjacent sentences.

FastSent+AE:

\[\sum_{w \in S_{i-1} \cup S_i\cup S_{i+1}} \phi(s_i, v_w)\]

Reference