How To Use Google’s Word2Vec C Source File

How to use Google’s Word2Vec c source file.

you can download Word2Vec c source file from https://code.google.com/archive/p/word2vec/source/default/source.

Also Google’s code archive of Word2Vec is : https://code.google.com/archive/p/word2vec/

How to use it

download

you cannot export this source files to your github, so you just have to download this source files directly.

unzip

After Downloading the above code, uncompress the download file.

# hyunyoung2 @ hyunyoung2-desktop in ~/my-jupyter/word2vec-of-google/test [17:26:31] 
$ ls
source-archive.zip

$ unzip source-archive.zip

# hyunyoung2 @ hyunyoung2-desktop in ~/my-jupyter/word2vec-of-google/test [17:27:41] 
$ unzip source-archive.zip 
Archive:  source-archive.zip
   creating: word2vec/
  .......
  inflating: word2vec/trunk/distance.c  
  inflating: word2vec/trunk/word2vec.c  
  inflating: word2vec/trunk/questions-words.txt  
  inflating: word2vec/trunk/LICENSE  

After unziping source-archive.zip, you can see some directory

I mean word2vec directory is created. so if you enter the directory.

you see a directory, trunk. just continuously enter in.

Finally You find out C source file of word2vec of google

let’s see the processing of what I said above in the following command line :

# hyunyoung2 @ hyunyoung2-desktop in ~/my-jupyter/word2vec-of-google/test [17:28:03] 
$ ls
source-archive.zip  word2vec

# hyunyoung2 @ hyunyoung2-desktop in ~/my-jupyter/word2vec-of-google/test [17:28:49] 
$ cd word2vec 

# hyunyoung2 @ hyunyoung2-desktop in ~/my-jupyter/word2vec-of-google/test/word2vec [17:29:04] 
$ ls
trunk

# hyunyoung2 @ hyunyoung2-desktop in ~/my-jupyter/word2vec-of-google/test/word2vec [17:29:05] 
$ cd trunk 

# hyunyoung2 @ hyunyoung2-desktop in ~/my-jupyter/word2vec-of-google/test/word2vec/trunk [17:29:10] 
$ ls
compute-accuracy.c       demo-train-big-model-v1.sh  makefile               word2vec.c
demo-analogy.sh          demo-word-accuracy.sh       questions-phrases.txt  word-analogy.c
demo-classes.sh          demo-word.sh                questions-words.txt
demo-phrase-accuracy.sh  distance.c                  README.txt
demo-phrases.sh          LICENSE                     word2phrase.c

Make and a example with demo-analogy.sh

In order to compile the Word2Vec source, use make

# hyunyoung2 @ hyunyoung2-desktop in ~/my-jupyter/word2vec-of-google/test/word2vec/trunk [17:33:25] 
$ make
gcc word2vec.c -o word2vec -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
gcc word2phrase.c -o word2phrase -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
gcc distance.c -o distance -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
.......                                                                            
chmod +x *.sh


# hyunyoung2 @ hyunyoung2-desktop in ~/my-jupyter/word2vec-of-google/test/word2vec/trunk [17:38:36] 
$ ls
compute-accuracy         demo-train-big-model-v1.sh  makefile               word2vec
compute-accuracy.c       demo-word-accuracy.sh       questions-phrases.txt  word2vec.c
demo-analogy.sh          demo-word.sh                questions-words.txt    word-analogy
demo-classes.sh          distance                    README.txt             word-analogy.c
demo-phrase-accuracy.sh  distance.c                  word2phrase
demo-phrases.sh          LICENSE                     word2phrase.c

As you can see abvoe, after make command, according to chmod +x *.sh, You can execute all shell scripts.

From now on, let’s walk through a shell scripts.

One of demo shell scripts we are talking about is demo-analogy to find out analogy of the relationship of words like Man - Woman + King = Queen :

$ vim demo-analogy.sh

make
if [ ! -e text8 ]; then
  wget http://mattmahoney.net/dc/text8.zip -O text8.gz
  gzip -d text8.gz -f
fi
echo ---------------------------------------------------------------------------------------------------
echo Note that for the word analogy to perform well, the model should be trained on much larger data set
echo Example input: paris france berlin
echo ---------------------------------------------------------------------------------------------------
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
./word-analogy vectors.bin            

As you can see the above snippets of all shell scripts. you get a hint of how to make word2vec, it show you a line, “time ./word2vec….” :

time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15

Keep in mind, size is the size of your word-embedding. I mean word dimension.

Let’s execute the demo-word.sh

$ ./demo-analogy.sh

You can also the stat of threads, In this case I used 20 threads. let’s see the state of threads

$ ls

$ ls
compute-accuracy         demo-train-big-model-v1.sh  makefile               word2phrase
compute-accuracy.c       demo-word-accuracy.sh       questions-phrases.txt  word2phrase.c
demo-analogy.sh          demo-word.sh                questions-words.txt    word2vec
demo-classes.sh          distance                    README.txt             word2vec.c
demo-phrase-accuracy.sh  distance.c                  text8                  word-analogy
demo-phrases.sh          LICENSE                     vectors.bin            word-analogy.c

as you can see above, you could find out something like vectors.bin and text8(input data file)

thai is file that stores vector values

let’s see the vector file

$ vim vector.bin

as you can see, the file is broken. That is becuase i run the word2vec with binary mode(1).

help message

But you don’t need to analyze shell script. just word2vec executable show you how to use word2vec executable as you type ./word2vec in command line like this :

$ ./word2vec

In here, simply speaking about word2vec usage.

  • data.txt means a file you want to train for word embedding

  • binary : whether output is binary or not

Let’s see vector.txt using word2vec of google.

$ time ./word2vec -train text8 -output vec.txt -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 1 -iter 3

# hyunyoung2 @ hyunyoung2-desktop in ~/before-ubutu/my-jupyter/word2vec-of-google/test/word2vec/trunk [19:16:18] C:1
$ ./word2vec -train text8 -output vec.txt -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 1 -iter 3
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000005  Progress: 100.03%  Words/thread/sec: 298.74k  %  

$ vim vec.txt

As you can see above, you can check the total number of word vector’ and dimensions of a word. and then you can identify the vector value of a word like this :

71291(total count of word vector) 200(dimension of a word)

The following is vector value of </s>

</s> 0.002001 0.002210 -0.001915 -0.001639 0.000683 0.001511 0.000470 …….

71291 : means the total number of word vector.

200 : means the number of dimension of a word

0.002001 0.002210 ……. : means vector of word

Reference