This is a brief summary of paper for me to study and organize it, From word embeddings to document distances (Kusner et al., ICML 2015) I read and studied.

WMD use word embeddings to calculate the distance so that it can calculate even though there is no common word. The assumption is that similar words should have similar vectors.

Captured from Kusner et al. publication

First of all, lower case and removing stopwords is an essential step to reduce complexity and preventing misleading.

Sentence 1: obama speaks media illinois
Sentence 2: president greets press chicago

Retrieve vectors from any pre-trained word embeddings models. It can be GloVe, word2vec, fasttext or custom vectors. After that it using normalized bag-of-words (nBOW) to represent the weight or importance. It assumes that higher frequency implies that it is more important.

Captured from Kusner et al. publication

Strengths of WMD:

  • Hyperparameter-free
  • Straight-forward to understand and use, highly interpretable
  • leads to unprecedented low k-nearest neighbor document classification error rates

Reference