This paper,Enriching Word Vectors with Subword Information (Bojanowski et al., arXiv 2017) have been arguing it is better on inference of words out of Vocabulary by using character levels n-gram.
i.e. word2vec treats each word in corpus like an atomic entity and generates a vector for each word.
But, in this paper, treats each word as compossed of character ngrams.
So the vector for a word is made fo the sum of this character n grams.
Let’s see an example like this :
the word of apple is a sum of the vectors of the n-grams
“<ap”, “app”, “appl”, “apple”, “apple>”, “ppl”, “pple”, “pple>”, “ple”, “ple>”, “le>”
”<” and “>” is intended for order of word.
On the example above, using hyperparameters for smallest ngram is 3 and largest ngram is 6. i.e. minimum is 3 and maximum is 6.
that is why they argue on small size of corpus the vector using the character level n grams is better than the vector using word as atomic entity.
They present a word by the sum of vector representations of its n-grams.
they dealt with the following about "apple" :
> "<ap", "app", "appl", "apple", "apple>", "ppl", "pple", "pple>", "ple", "ple>", "le>"
i.e. ngram size is between 3 and 6 on the example above.
The paper: Enriching Word Vectors with Subword Information (Bojanowski et al., arXiv 2017)