Doc2Vec to wikipedia articles

We conduct the replication to Document Embedding with Paragraph Vectors (http://arxiv.org/abs/1507.07998). In this paper, they showed only DBOW results to Wikipedia data. So we replicate this experiments using not only DBOW but also DM.

Basic Setup

Let's import Doc2Vec module.

In [1]:
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pprint import pprint
import multiprocessing

Preparing the corpus

First, download the dump of all Wikipedia articles from here (you want the file enwiki-latest-pages-articles.xml.bz2, or enwiki-YYYYMMDD-pages-articles.xml.bz2 for date-specific dumps).

Second, convert the articles to WikiCorpus. WikiCorpus construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.

For more details on WikiCorpus, you should access Corpus from a Wikipedia dump.

In [2]:
wiki = WikiCorpus("enwiki-latest-pages-articles.xml.bz2")
#wiki = WikiCorpus("enwiki-YYYYMMDD-pages-articles.xml.bz2")

Define TaggedWikiDocument class to convert WikiCorpus into suitable form for Doc2Vec.

In [3]:
class TaggedWikiDocument(object):
    def __init__(self, wiki):
        self.wiki = wiki
        self.wiki.metadata = True
    def __iter__(self):
        for content, (page_id, title) in self.wiki.get_texts():
            yield TaggedDocument([c.decode("utf-8") for c in content], [title])
In [4]:
documents = TaggedWikiDocument(wiki)

Preprocessing

To set the same vocabulary size with original papar. We first calculate the optimal min_count parameter.

In [5]:
pre = Doc2Vec(min_count=0)
pre.scan_vocab(documents)
In [6]:
for num in range(0, 20):
    print('min_count: {}, size of vocab: '.format(num), pre.scale_vocab(min_count=num, dry_run=True)['memory']['vocab']/700)
min_count: 0, size of vocab:  8545782.0
min_count: 1, size of vocab:  8545782.0
min_count: 2, size of vocab:  4227783.0
min_count: 3, size of vocab:  3008772.0
min_count: 4, size of vocab:  2439367.0
min_count: 5, size of vocab:  2090709.0
min_count: 6, size of vocab:  1856609.0
min_count: 7, size of vocab:  1681670.0
min_count: 8, size of vocab:  1546914.0
min_count: 9, size of vocab:  1437367.0
min_count: 10, size of vocab:  1346177.0
min_count: 11, size of vocab:  1267916.0
min_count: 12, size of vocab:  1201186.0
min_count: 13, size of vocab:  1142377.0
min_count: 14, size of vocab:  1090673.0
min_count: 15, size of vocab:  1043973.0
min_count: 16, size of vocab:  1002395.0
min_count: 17, size of vocab:  964684.0
min_count: 18, size of vocab:  930382.0
min_count: 19, size of vocab:  898725.0

In the original paper, they set the vocabulary size 915,715. It seems similar size of vocabulary if we set min_count = 19. (size of vocab = 898,725)

Training the Doc2Vec Model

To train Doc2Vec model by several method, DBOW and DM, we define the list of models.

In [7]:
cores = multiprocessing.cpu_count()

models = [
    # PV-DBOW 
    Doc2Vec(dm=0, dbow_words=1, size=200, window=8, min_count=19, iter=10, workers=cores),
    # PV-DM w/average
    Doc2Vec(dm=1, dm_mean=1, size=200, window=8, min_count=19, iter =10, workers=cores),
]
In [8]:
models[0].build_vocab(documents)
print(str(models[0]))
models[1].reset_from(models[0])
print(str(models[1]))
Doc2Vec(dbow+w,d200,hs,w8,mc19,t8)
Doc2Vec(dm/m,d200,hs,w8,mc19,t8)

Now we’re ready to train Doc2Vec of the English Wikipedia.

In [9]:
for model in models:
    %%time model.train(documents)
CPU times: user 5d 18h 24min 30s, sys: 26min 6s, total: 5d 18h 50min 36s
Wall time: 1d 2h 58min 58s
CPU times: user 1d 1h 28min 2s, sys: 33min 15s, total: 1d 2h 1min 18s
Wall time: 15h 27min 18s

Similarity interface

After that, let's test both models! DBOW model show the simillar results with the original paper. First, calculating cosine simillarity of "Machine learning" using Paragraph Vector. Word Vector and Document Vector are separately stored. We have to add .docvecs after model name to extract Document Vector from Doc2Vec Model.

In [10]:
for model in models:
    print(str(model))
    pprint(model.docvecs.most_similar(positive=["Machine learning"], topn=20))
Doc2Vec(dbow+w,d200,hs,w8,mc19,t8)
[('Theoretical computer science', 0.7256590127944946),
 ('Artificial neural network', 0.7162272930145264),
 ('Pattern recognition', 0.6948175430297852),
 ('Data mining', 0.6938608884811401),
 ('Bayesian network', 0.6938260197639465),
 ('Support vector machine', 0.6706081628799438),
 ('Glossary of artificial intelligence', 0.670173704624176),
 ('Computational learning theory', 0.6648679971694946),
 ('Outline of computer science', 0.6638073921203613),
 ('List of important publications in computer science', 0.663051187992096),
 ('Mathematical optimization', 0.655048131942749),
 ('Theory of computation', 0.6508707404136658),
 ('Word-sense disambiguation', 0.6505812406539917),
 ('Reinforcement learning', 0.6480429172515869),
 ("Solomonoff's theory of inductive inference", 0.6459559202194214),
 ('Computational intelligence', 0.6458009481430054),
 ('Information visualization', 0.6437181234359741),
 ('Algorithmic composition', 0.643247127532959),
 ('Ray Solomonoff', 0.6425477862358093),
 ('Kriging', 0.6425424814224243)]
Doc2Vec(dm/m,d200,hs,w8,mc19,t8)
[('Artificial neural network', 0.640324592590332),
 ('Pattern recognition', 0.6244156360626221),
 ('Data stream mining', 0.6140210032463074),
 ('Theoretical computer science', 0.5964258909225464),
 ('Outline of computer science', 0.5862746834754944),
 ('Supervised learning', 0.5847170352935791),
 ('Data mining', 0.5817658305168152),
 ('Decision tree learning', 0.5785809755325317),
 ('Bayesian network', 0.5768401622772217),
 ('Computational intelligence', 0.5717238187789917),
 ('Theory of computation', 0.5703311562538147),
 ('Bayesian programming', 0.5693561434745789),
 ('Reinforcement learning', 0.564978837966919),
 ('Helmholtz machine', 0.564972460269928),
 ('Inductive logic programming', 0.5631471276283264),
 ('Algorithmic learning theory', 0.563083291053772),
 ('Semi-supervised learning', 0.5628935694694519),
 ('Early stopping', 0.5597405433654785),
 ('Decision tree', 0.5596889853477478),
 ('Artificial intelligence', 0.5569720268249512)]

DBOW model interpret the word 'Machine Learning' as a part of Computer Science field, and DM model as Data Science related field.

Second, calculating cosine simillarity of "Lady Gaga" using Paragraph Vector.

In [11]:
for model in models:
    print(str(model))
    pprint(model.docvecs.most_similar(positive=["Lady Gaga"], topn=10))
Doc2Vec(dbow+w,d200,hs,w8,mc19,t8)
[('Katy Perry', 0.7374469637870789),
 ('Adam Lambert', 0.6972734928131104),
 ('Miley Cyrus', 0.6212848424911499),
 ('List of awards and nominations received by Lady Gaga', 0.6138384938240051),
 ('Nicole Scherzinger', 0.6092700958251953),
 ('Christina Aguilera', 0.6062655448913574),
 ('Nicki Minaj', 0.6019431948661804),
 ('Taylor Swift', 0.5973174571990967),
 ('The Pussycat Dolls', 0.5888757705688477),
 ('Beyoncé', 0.5844652652740479)]
Doc2Vec(dm/m,d200,hs,w8,mc19,t8)
[('ArtRave: The Artpop Ball', 0.5719832181930542),
 ('Artpop', 0.5651129484176636),
 ('Katy Perry', 0.5571318864822388),
 ('The Fame', 0.5388195514678955),
 ('The Fame Monster', 0.5380634069442749),
 ('G.U.Y.', 0.5365751385688782),
 ('Beautiful, Dirty, Rich', 0.5329179763793945),
 ('Applause (Lady Gaga song)', 0.5328119993209839),
 ('The Monster Ball Tour', 0.5299569368362427),
 ('Lindsey Stirling', 0.5281971096992493)]

DBOW model reveal the similar singer in the U.S., and DM model understand that many of Lady Gaga's songs are similar with the word "Lady Gaga".

Third, calculating cosine simillarity of "Lady Gaga" - "American" + "Japanese" using Document vector and Word Vectors. "American" and "Japanese" are Word Vectors, not Paragraph Vectors. Word Vectors are already converted to lowercases by WikiCorpus.

In [12]:
for model in models:
    print(str(model))
    vec = [model.docvecs["Lady Gaga"] - model["american"] + model["japanese"]]
    pprint([m for m in model.docvecs.most_similar(vec, topn=11) if m[0] != "Lady Gaga"])
Doc2Vec(dbow+w,d200,hs,w8,mc19,t8)
[('Game (Perfume album)', 0.5571034550666809),
 ('Katy Perry', 0.5537782311439514),
 ('Taboo (Kumi Koda song)', 0.5304880142211914),
 ('Kylie Minogue', 0.5234110355377197),
 ('Ayumi Hamasaki', 0.5110630989074707),
 ("Girls' Generation", 0.4996713399887085),
 ('Britney Spears', 0.49094343185424805),
 ('Koda Kumi', 0.48719698190689087),
 ('Perfume (Japanese band)', 0.48536181449890137),
 ('Kara (South Korean band)', 0.48507893085479736)]
Doc2Vec(dm/m,d200,hs,w8,mc19,t8)
[('Artpop', 0.47699037194252014),
 ('Jessie J', 0.4439432621002197),
 ('Haus of Gaga', 0.43463900685310364),
 ('The Fame', 0.4278091788291931),
 ('List of awards and nominations received by Lady Gaga', 0.4268512427806854),
 ('Applause (Lady Gaga song)', 0.41563737392425537),
 ('New Cutie Honey', 0.4152414798736572),
 ('M.I.A. (rapper)', 0.4091864228248596),
 ('Mama Do (Uh Oh, Uh Oh)', 0.4044945538043976),
 ('The Fame Monster', 0.40421998500823975)]

As a result, DBOW model demonstrate the similar artists with Lady Gaga in Japan such as 'Perfume', which is the Most famous Idol in Japan. On the other hand, DM model results don't include the Japanese aritsts in top 10 simillar documents. It's almost same with no vector calculated results.

This results demonstrate that DBOW employed in the original paper is outstanding for calculating the similarity between Document Vector and Word Vector.