This is a brief summary of paper for me to study and note it, Skip-Thought Vectors (Kiros et al., NIPS 2015).
This papre is related to how to representation a sentence to a fixed-size vector utilizing sequence to sequence model with GRU.
They model use a sentence tuple (st−1,st,st+1). Let wti denote the t-th word for sentence si and let xti denote its word embedding.
From here, I describe their model in three parts: the encoder, decoder, and objective function.
Encoder. Let w1i,…,wNi be the words in sentence si where N is the number of words in the sentence. At each time step, the encoder produces a hidden state hti which can be interpreted as the representation of the seqeuence w1i,…,wti. The hidden state hNi thus represents the full sentence.
To encode a sentence, they iterate the following sequence of equations (dropping the subscript i):
rt=σ(Wrxt+Urht−1)(1)where ¯ht is the proposed state update at time t, zt is the update gate, rt is the reset gate (⨀) denotes a component-wise product. Both update gates takes values between zero and one.
Decoder. The decoder is a neural language model which conditions on the encoder output hi. The computation is similar to that of the enocoder except we introduce matrices Cz,Cr and C that are used to bias the update gate, reset gate and hidden state computation by the sentence vector. One decoder is used for the next sentence Si+1 while a second decoder is used for the previous sentence si−1. Separate parameters are used for each decoder with the exception of the vocabulary matrix V, which is the weight matrix connecting the decoder’s hidden state for computing a distribution over words.
They describe the decoder for the next sentence si+1 although an analogous computation is used for the previous sentence si−1. Let hti+1 denote the hidden state of the decoder at time t. Decoding involves iterating through the following sequence of equations (dropping the subscript i+1):
rt=σ(Wdrxt+Udrht−1+Crhi)(5)Given hti+1, the probability of word wti+1 given the previous t-1 words and the encoder vector is
P(wti+1|w<ti+1,hi)∝exp(vwti+1hti+1)(9)where hti+1 denotes the row of V corresponding to the word of wti+1. An analogous computation is performed for the previous sentence Si−1
Objectiv. Given a tuple (si−1,si,si+1), the objective optimized is the sum of the log-probabilities for the forward and backward sentences conditioned on the encoder representation:
∑tlog(P(wti+1|w<ti+1,hi))+∑tlog(P(wti−1|w<ti−1,hi))(10)The total objective is the above summed over all such training tuples.
- The evaluation tasks.
- Semantic relatedness on the SemEval 2014 Task 1 (Marelli et al., 2014)
- Paraphrase detection on the Microsoft Research Paraphrase Corpus (Dolan et al., 2004)
- Image-Sentence ranking on the Microsoft COCO dataset (Lin el al., 2014): They consider two task. e.g. One is image annotation and the other image search
- Classification benchmarks on 5 datasets: movie review sentiment (MR), customer product reviews (CR), subjectivity/objectivity classification (SUBJ), opinion polarity (MPQA) and question-type classification (TREC).
Skip-Thought Vectors (Kiros et al., 2015)
Reference
- Paper
- How to use html for alert
- For your information