This paper,Enriching Word Vectors with Subword Information (Bojanowski et al., arXiv 2017) have been arguing it is better on inference of words out of Vocabulary by using character levels n-gram.

i.e. word2vec treats each word in corpus like an atomic entity and generates a vector for each word.

But, in this paper, treats each word as compossed of character ngrams.

So the vector for a word is made fo the sum of this character n grams.

Let’s see an example like this :

the word of apple is a sum of the vectors of the n-grams

“<ap”, “app”, “appl”, “apple”, “apple>”, “ppl”, “pple”, “pple>”, “ple”, “ple>”, “le>”

”<” and “>” is intended for order of word.

On the example above, using hyperparameters for smallest ngram is 3 and largest ngram is 6. i.e. minimum is 3 and maximum is 6.

that is why they argue on small size of corpus the vector using the character level n grams is better than the vector using word as atomic entity.

They present a word by the sum of vector representations of its n-grams.

Reference