word2vecを使うために、python3.5.2(Anaconda 4.2.0)のgensimモジュールをインストールする。
> pip install gensim
簡単にインストールできたので、早速word2vecをimport出きるかどうか確認してみたところ、以下のようなMKLのエラーがでて落ちてしまった。。。
> python >>> from gensim.models import word2vec Intel MKL FATAL ERROR: Cannot load libmkl_mc3.so or libmkl_def.so.
以下を、参考にcondaでMKLをupdateしたところ解決できたようだ
http://d.hatena.ne.jp/m_matsunag/20160415/1460687923
> conda update mkl > python >>> from gensim.models import word2vec
次は、以下のサイトを参考に、実際に、word2vecのモデルをテキストデータで学習してみる。
https://m0t0k1ch1st0ry.com/blog/2016/08/28/word2vec/
まず、下準備として文字コードを変換するコマンドnkfをインストールしておく。
> sudo apt-get install nkf
今回は、コーパスとして「livedoorニュース」を以下からダウンロードする。
https://www.rondhuit.com/download/livedoor-news-data.tar.gz
ダウンロードしたファイルを展開し、文字コードを確認する。
> tar -xvf livedoor-news-data.tar.gz > mkdir livedoor-news-data > mv *.xml livedoor-news-data > nkf -g livedoor-news-data/topic-news.xml UTF-8
そして、以下のサイトを参考に、分かち書きと、学習用のスクリプトwakati.pyおよびtrain.pyを作る。
https://radimrehurek.com/gensim/models/word2vec.html
> cat wakati.py # -*- coding: utf-8 -*- import MeCab import sys tagger = MeCab.Tagger('-F\s%f[6] -U\s%m -E\\n') fi = open(sys.argv[1], 'r') fo = open(sys.argv[2], 'w') line = fi.readline() while line: result = tagger.parse(line) fo.write(result[1:]) # skip first \s line = fi.readline() fi.close() fo.close() > cat train.py # -*- coding: utf-8 -*- from gensim.models import word2vec import logging import sys logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) sentences = word2vec.LineSentence(sys.argv[1]) model = word2vec.Word2Vec(sentences, sg=1, size=100, min_count=1, window=10, hs=1, negative=0) model.save(sys.argv[2])
そして、分かち書きに変換後、word2vecの学習をする。
> python wakati.py livedoor-news-data/topic-news.xml livedoor-news-data/topic-news_wakati.xml > python train.py livedoor-news-data/topic-news_wakati.xml livedoor-news-data/topic-news.model 2017-10-12 17:28:35,893 : INFO : collecting all words and their counts 2017-10-12 17:28:35,893 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types 2017-10-12 17:28:35,973 : INFO : PROGRESS: at sentence #10000, processed 271445 words, keeping 15523 word types 2017-10-12 17:28:36,031 : INFO : collected 21663 word types from a corpus of 470676 raw words and 16734 sentences 2017-10-12 17:28:36,031 : INFO : Loading a fresh vocabulary 2017-10-12 17:28:36,067 : INFO : min_count=1 retains 21663 unique words (100% of original 21663, drops 0) 2017-10-12 17:28:36,067 : INFO : min_count=1 leaves 470676 word corpus (100% of original 470676, drops 0) 2017-10-12 17:28:36,109 : INFO : deleting the raw counts dictionary of 21663 items 2017-10-12 17:28:36,110 : INFO : sample=0.001 downsamples 43 most-common words 2017-10-12 17:28:36,110 : INFO : downsampling leaves estimated 262394 word corpus (55.7% of prior 470676) 2017-10-12 17:28:36,110 : INFO : estimated required memory for 21663 words and 100 dimensions: 32494500 bytes 2017-10-12 17:28:36,127 : INFO : constructing a huffman tree from 21663 words 2017-10-12 17:28:36,611 : INFO : built huffman tree with maximum node depth 19 2017-10-12 17:28:36,619 : INFO : resetting layer weights 2017-10-12 17:28:36,803 : INFO : training model with 3 workers on 21663 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=0 window=10 2017-10-12 17:28:37,807 : INFO : PROGRESS: at 22.87% examples, 291736 words/s, in_qsize 5, out_qsize 0 2017-10-12 17:28:38,842 : INFO : PROGRESS: at 45.94% examples, 290614 words/s, in_qsize 6, out_qsize 0 2017-10-12 17:28:39,876 : INFO : PROGRESS: at 68.84% examples, 290533 words/s, in_qsize 5, out_qsize 0 2017-10-12 17:28:40,885 : INFO : PROGRESS: at 91.54% examples, 292501 words/s, in_qsize 5, out_qsize 0 2017-10-12 17:28:41,244 : INFO : worker thread finished; awaiting finish of 2 more threads 2017-10-12 17:28:41,267 : INFO : worker thread finished; awaiting finish of 1 more threads 2017-10-12 17:28:41,273 : INFO : worker thread finished; awaiting finish of 0 more threads 2017-10-12 17:28:41,273 : INFO : training on 2353380 raw words (1311998 effective words) took 4.5s, 293579 effective words/s 2017-10-12 17:28:41,273 : INFO : saving Word2Vec object under livedoor-news-data/topic-news.model, separately None 2017-10-12 17:28:41,273 : INFO : not storing attribute syn0norm 2017-10-12 17:28:41,274 : INFO : not storing attribute cum_table 2017-10-12 17:28:41,782 : INFO : saved livedoor-news-data/topic-news.model
そして、modelを読み込み、以下のサイトを参考に、いろいろ試してみる。
https://radimrehurek.com/gensim/models/word2vec.html
> python >>> from gensim.models import word2vec >>> model = word2vec.Word2Vec.load('livedoor-news-data/topic-news.model') >>> model.wv['AKB48'] array([ -5.62809408e-02, 3.99775743e-01, -9.42421407e-02, -5.23237109e-01, 4.44891416e-02, 5.43846749e-02, ..., 1.55055657e-01, 2.84974463e-02, 2.10736513e-01, 1.71845078e-01], dtype=float32) >>> model.wv['AKB48'].shape (100,) >>> model.wv.most_similar(positive=['AKB48'],topn=10) [('指原莉乃', 0.7347375750541687), ('前田敦子', 0.6920689344406128), ('篠田麻里子', 0.6650659441947937), ('柏木由紀', 0.6423534154891968), ('AKB', 0.6356086134910583), ('ぷっちょ', 0.6345366835594177), ('板野友美', 0.633526086807251), ('元カレ', 0.6171307563781738), ('大島麻衣', 0.6131355166435242), ('高橋みなみ', 0.609819233417511)] >>> model.wv.similarity('AKB48','モー娘') 0.31667615140081529 >>> model.wv.similarity('AKB48','SMAP') 0.48196754241974527 >>>