I think this paper, Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al. NIPS 2013), is the best to understand why the addition of two vectors works well to meaningfully infer the relation between two words.

And also it is good to understand why I have to make phrase from words. let’s think of the reason.

“Boston Globe” is a newspaper, and so it is not a natural combination of the meanings of “Boston” and “Globe”.

The two reasons above is a good idea when we make word embedding.

The performance to infer meanings of words depends on the loss function.

I recommend you to read this paper.

Reference