This is a brief summary of paper for me to study and arrange for Exploiting Similarities among Languages for Machine Translation (Mikolov et al., arXiv 2013) I read and studied.

The following Figure can be easily seen that these concepts have similar geometric arrangements.

The reason is that as all common languages share concepts that are grounded in the real world (such as that cat is an animal smaller than a dog), there is often a strong similarity between the vector spaces.

In other words, both languages (English and Spanish) has the similarity of geometric arrangments in vector spaces.

Mikolov et al., 2013 arXiv

As you can see figure above, The similarity of geometric arrangments in vector spaces is the key reason why our method, which is linearly projecting from source language to target language, works well.

As they visualized the word vectors using PCA, they said that the vector representations of similar words in different languages were related by a linear transformation. For instance, Figure 1 above shows that the word vectors for English numbers one to five and the corresponding Spanish words uno to cinco have similar geometric arrangements. The relationship between vector spaces that represent these two languages can thus possibly be captured by linear mapping (namely, a rotation and scaling).

So they argued if they know the translation of one and four from English to Spanish, they can learn the transformation matrix that can help us to translate even the other numbers to Spanish.

At the test time, They translated any word that has been seen in the monolingual corpora by projecting its vector representation from the source language space to the target language space. Once they obtain the vector in the target language space, they output the most similar word vector as the translation.

For information of Word2vec

They said that the key point when using word2vec like CBOW and Skip-gram like this:

In practice, Skip-gram gives better word representations when the monolingual data is small. CBOW however is faster and more suitable for larger datasets.

Note(Abstract): Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Their method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data. It uses distributed representation of words and learns a linear mapping between vector spaces of languages. Despite its simplicity, their method is surprisingly effective: they can achieve almost 90% precision@5 for translation of words between English and Spanish. This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs.

Download URL:
The paper: Exploiting Similarities among Languages for Machine Translation (Mikolov et al., arXiv 2013)

Reference

Paper
- arXiv Version: Exploiting Similarities among Languages for Machine Translation (Mikolov et al., arXiv 2013)
How to use html for alert
- how to use icon

Exploiting Similarities among Languages for Machine Translation

Title of paper - Exploiting Similarities among Languages for Machine Translation

Exploiting Similarities among Languages for Machine Translation

Title of paper - Exploiting Similarities among Languages for Machine Translation

Reference