Overall, This paper,Efficient Estimation of Word Representations in Vector Space (Mikolov et al., arXiv 2013), is saying about comparing computational time with each other model, and extension of NNLM which turns into two step. one is training word vector and then the other step is using the trained vector on The NNLM.

In estimaiting continuous representations of words including the well-known Latent Semantic Analysis(LSA) and Latent Dirichlet Allocation(LDA).

They are saying the neural network is performing better thatnn LSA for preserving linear regularities among words. even LDA becomes computationally very expensive on large data sets.

In this paper, the computational time complexity is defined as follows:

The training complexity is proportional to :
O = E * T * Q

Where E is number of the training epochs, T is the number of the words in the training set and Q is defined further for each model architecture.

They are saygint when you compute computational time, it heavily depens on the number of output units or hidden layer units.

The best expensive one of two units is first output units, after that, the second one is hidden layer.

So They implemented hiararchical softmax with huffman tree to reduce output unit. but the bottleneck situation remains in hidden layer.

Finally they got rid of the hidden layer, so their model heavily depends on the efficiency of the softmax normalizations.

The models are called CBOW(continuous bag of word) and skip gram.

Mikolov et al., arXiv 2013

Let’s see an examle they used for testing syntactic and semantic questions.

Mikolov et al., arXiv 2013

Reference