I think this paper, Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al. NIPS 2013), is the best to understand why the addition of two vectors works well to meaningfully infer the relation between two words.
And also it is good to understand why I have to make phrase from words. let’s think of the reason.
“Boston Globe” is a newspaper, and so it is not a natural combination of the meanings of “Boston” and “Globe”.
The two reasons above is a good idea when we make word embedding.
The performance to infer meanings of words depends on the loss function.
I recommend you to read this paper.
This paper presents several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significangant speedup and also learn more regular word representations.
This paper introduces another loss function called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. for example, The meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada".
If you want to download the data set, visit hereThe paper: Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al. NIPS 2013)
Reference
- Paper
- How to use html for alert
- Tutorial site of word2vec