This is a brief summary of paper for me to study and arrange it, Learning Distributed Representations of Sentences from Unlabelled Data (Hill et al., NAACL 2016) I read and studied.

They proposed the two approaches to represent a sentence into a fixed-lengtj vector.

Sequential Denoising Autoencoders:

In a Denosigin Autoencoder, high-dimensional input data is corrupted according to some noise function, and the model is trained to recover the original data from the corrupted version.

The original DAEs were feedforward nets applied to (image) data of fixed size. Here, they adapt the approach to variable-length sentences by means of a noise function N(S|po, px), determined by free parameters po, px ∈ [0, 1]. First, for each word w in S, N deletes w with (independent) probability po.
Then, for each non-overlapping bigram wiwi+1 inS, N swaps wi and wi+1 with probability px.
They then train the same LSTM-based encoder-decoder architecture as NMT, but with the denoising objective to predict (as target) the original source sentence S given a corrupted version N(S|po, px) (as source).
The trained model can then encode novel word sequences into distributed representations.
They call this model the Sequential Denoising Autoencoder (SDAE). Note that, unlike SkipThought, SDAEs can be trained on sets of sentences in arbitrary order.

FastSent:

FastSent is a simple additive (log-bilinear) sentence model designed to exploit the same signal, but at much lower computational expense.
Given a BOW representation of some sentence in context, the model simply predicts adjacent sentences (also represented as BOW).

FastSent:

More formally, FastSent learns a source \(u_w\) and target \(v_w\) embedding for each word in the model vocabulary. For a training example \(S_{i-1}, S_i, S_{i+1}\) of consecutive sentences, \(S_i\) is represented as the sum of its source embeddings \(s_i = \sum_{w \in S_i} v_w\). The cost of the example is then simply:

\[\sum_{w \in S_{i-1} \cup S_{i+1}} \phi(s_i, v_w)\]

Also, They experiment a variant (+AE) in which the encoded (source) representation must predict its own words as target in addition to those of adjacent sentences.

FastSent+AE:

\[\sum_{w \in S_{i-1} \cup S_i\cup S_{i+1}} \phi(s_i, v_w)\]

Note(Abstract): Unsupervised methods for learning distributed representations of words are ubiquitous in today’s NLP research, but far less is known about the best ways to learn distributed phrase or sentence representations from unlabelled data. This paper is a systematic comparison of models that learn such representations. They find that the optimal approach depends critically on the intended application. Deeper, more complex models are preferable for representations to be used in supervised systems, but shallow log-bilinear models work best for building representation spaces that can be decoded with simple spatial distance metrics. They also propose two new unsupervised representation-learning objectives designed to optimise the trade-off between training time, domain portability and performance.

Download URL:
The paper: Learning Distributed Representations of Sentences from Unlabelled Data (Hill et al., NAACL 2016)

Learning Distributed Representations of Sentences from Unlabelled Data

Title of paper - Learning Distributed Representations of Sentences from Unlabelled Data

Learning Distributed Representations of Sentences from Unlabelled Data

Title of paper - Learning Distributed Representations of Sentences from Unlabelled Data

Reference