This is a brief summary of paper for me to study and note it, Skip-Thought Vectors (Kiros et al., NIPS 2015).

This papre is related to how to representation a sentence to a fixed-size vector utilizing sequence to sequence model with GRU.

Kiros et al., NIPS 2015

They model use a sentence tuple (st1,st,st+1). Let wti denote the t-th word for sentence si and let xti denote its word embedding.

From here, I describe their model in three parts: the encoder, decoder, and objective function.

Encoder. Let w1i,,wNi be the words in sentence si where N is the number of words in the sentence. At each time step, the encoder produces a hidden state hti which can be interpreted as the representation of the seqeuence w1i,,wti. The hidden state hNi thus represents the full sentence.

To encode a sentence, they iterate the following sequence of equations (dropping the subscript i):

rt=σ(Wrxt+Urht1)(1)
zt=σ(Wzxt+Uzht1)(2)
¯ht=tanh(Wxt+U(rtht1))(3)
ht=(1zt)ht1+zt¯ht(4)

where ¯ht is the proposed state update at time t, zt is the update gate, rt is the reset gate () denotes a component-wise product. Both update gates takes values between zero and one.

Decoder. The decoder is a neural language model which conditions on the encoder output hi. The computation is similar to that of the enocoder except we introduce matrices Cz,Cr and C that are used to bias the update gate, reset gate and hidden state computation by the sentence vector. One decoder is used for the next sentence Si+1 while a second decoder is used for the previous sentence si1. Separate parameters are used for each decoder with the exception of the vocabulary matrix V, which is the weight matrix connecting the decoder’s hidden state for computing a distribution over words.

They describe the decoder for the next sentence si+1 although an analogous computation is used for the previous sentence si1. Let hti+1 denote the hidden state of the decoder at time t. Decoding involves iterating through the following sequence of equations (dropping the subscript i+1):

rt=σ(Wdrxt+Udrht1+Crhi)(5)
zt=σ(Wdzxt+Udzht1+Czhi)(6)
¯ht=tanh(Wxt+U(rtht1)+Chi)(7)
ht=(1zt)ht1+zt¯ht(8)

Given hti+1, the probability of word wti+1 given the previous t-1 words and the encoder vector is

P(wti+1|w<ti+1,hi)exp(vwti+1hti+1)(9)

where hti+1 denotes the row of V corresponding to the word of wti+1. An analogous computation is performed for the previous sentence Si1

Objectiv. Given a tuple (si1,si,si+1), the objective optimized is the sum of the log-probabilities for the forward and backward sentences conditioned on the encoder representation:

tlog(P(wti+1|w<ti+1,hi))+tlog(P(wti1|w<ti1,hi))(10)

The total objective is the above summed over all such training tuples.

Reference