News classification with topic models in gensim

News article classification is a task which is performed on a huge scale by news agencies all over the world. We will be looking into how topic modeling can be used to accurately classify news articles into different categories such as sports, technology, politics etc.

Our aim in this tutorial is to come up with some topic model which can come up with topics that can easily be interpreted by us. Such a topic model can be used to discover hidden structure in the corpus and can also be used to determine the membership of a news article into one of the topics.

For this tutorial, we will be using the Lee corpus which is a shortened version of the Lee Background Corpus. The shortened version consists of 300 documents selected from the Australian Broadcasting Corporation's news mail service. It consists of texts of headline stories from around the year 2000-2001.

Accompanying slides can be found here.

Requirements

In this tutorial we look at how different topic models can be easily created using gensim. Following are the dependencies for this tutorial:

- Gensim Version >=0.13.1 would be preferred since we will be using topic coherence metrics extensively here.
- matplotlib
- Patterns library; Gensim uses this for lemmatization. ONLY FOR PYTHON 2.5+ - no support for Python 3 yet.
- nltk.stopwords
- pyLDAVis

We will be playing around with 4 different topic models here:

- LSI (Latent Semantic Indexing)
- HDP (Hierarchical Dirichlet Process)
- LDA (Latent Dirichlet Allocation)
- LDA (tweaked with topic coherence to find optimal number of topics) and
- LDA as LSI with the help of topic coherence metrics

First we'll fit those topic models on our existing data then we'll compare each against the other and see how they rank in terms of human interpretability.

All can be found in gensim and can be easily used in a plug-and-play fashion. We will tinker with the LDA model using the newly added topic coherence metrics in gensim based on this paper by Roeder et al and see how the resulting topic model compares with the exsisting ones.

In [ ]:
import os
import re
import operator
import matplotlib.pyplot as plt
import warnings
import gensim
import numpy as np
warnings.filterwarnings('ignore')  # Let's not pay heed to them right now

from gensim.models import CoherenceModel, LdaModel, LsiModel, HdpModel
from gensim.models.wrappers import LdaMallet
from gensim.corpora import Dictionary
from pprint import pprint

%matplotlib inline
In [2]:
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'

Analysing our corpus.

- The first document talks about a bushfire that had occured in New South Wales.
- The second talks about conflict between India and Pakistan in Kashmir.
- The third talks about road accidents in the New South Wales area.
- The fourth one talks about Argentina's economic and political crisis during that time.
- The last one talks about the use of drugs by midwives in a Sydney hospital.

Our final topic model should be giving us keywords which we can easily interpret and make a small summary out of. Without this the topic model cannot be of much practical use.

In [3]:
with open(lee_train_file) as f:
    for n, l in enumerate(f):
        if n < 5:
            print([l])
['Hundreds of people have been forced to vacate their homes in the Southern Highlands of New South Wales as strong winds today pushed a huge bushfire towards the town of Hill Top. A new blaze near Goulburn, south-west of Sydney, has forced the closure of the Hume Highway. At about 4:00pm AEDT, a marked deterioration in the weather as a storm cell moved east across the Blue Mountains forced authorities to make a decision to evacuate people from homes in outlying streets at Hill Top in the New South Wales southern highlands. An estimated 500 residents have left their homes for nearby Mittagong. The New South Wales Rural Fire Service says the weather conditions which caused the fire to burn in a finger formation have now eased and about 60 fire units in and around Hill Top are optimistic of defending all properties. As more than 100 blazes burn on New Year\'s Eve in New South Wales, fire crews have been called to new fire at Gunning, south of Goulburn. While few details are available at this stage, fire authorities says it has closed the Hume Highway in both directions. Meanwhile, a new fire in Sydney\'s west is no longer threatening properties in the Cranebrook area. Rain has fallen in some parts of the Illawarra, Sydney, the Hunter Valley and the north coast. But the Bureau of Meteorology\'s Claire Richards says the rain has done little to ease any of the hundred fires still burning across the state. "The falls have been quite isolated in those areas and generally the falls have been less than about five millimetres," she said. "In some places really not significant at all, less than a millimetre, so there hasn\'t been much relief as far as rain is concerned. "In fact, they\'ve probably hampered the efforts of the firefighters more because of the wind gusts that are associated with those thunderstorms." \n']
["Indian security forces have shot dead eight suspected militants in a night-long encounter in southern Kashmir. The shootout took place at Dora village some 50 kilometers south of the Kashmiri summer capital Srinagar. The deaths came as Pakistani police arrested more than two dozen militants from extremist groups accused of staging an attack on India's parliament. India has accused Pakistan-based Lashkar-e-Taiba and Jaish-e-Mohammad of carrying out the attack on December 13 at the behest of Pakistani military intelligence. Military tensions have soared since the raid, with both sides massing troops along their border and trading tit-for-tat diplomatic sanctions. Yesterday, Pakistan announced it had arrested Lashkar-e-Taiba chief Hafiz Mohammed Saeed. Police in Karachi say it is likely more raids will be launched against the two groups as well as other militant organisations accused of targetting India. Military tensions between India and Pakistan have escalated to a level not seen since their 1971 war. \n"]
['The national road toll for the Christmas-New Year holiday period stands at 45, eight fewer than for the same time last year. 20 people have died on New South Wales roads, with eight fatalities in both Queensland and Victoria. Western Australia, the Northern Territory and South Australia have each recorded three deaths, while the ACT and Tasmania remain fatality free. \n']
["Argentina's political and economic crisis has deepened with the resignation of its interim President who took office just a week ago. Aldolfo Rodregiuez Saa told a stunned nation that he could not rescue Argentina because key fellow Peronists would not support his default on massive foreign debt repayment or his plan for a new currency. It was only a week ago that he was promising a million new jobs to end four years of recession, days after his predecessor resigned following a series of failed rescue packages. After announcing that the senate leader, Ramon Puerta, would assume the presidency until congress appoints a new caretaker president, the government said he too had quit and another senior lawmaker would act in the role. Fresh elections are not scheduled until March leaving whoever assumes the presidency with the daunting task of tackling Argentina's worst crisis in 12 years, but this time, isolated by international lending agencies. \n"]
['Six midwives have been suspended at Wollongong Hospital, south of Sydney, for inappropriate use of nitrous oxide during work hours, on some occasions while women were in labour. The Illawarra Area Health Service says that following an investigation of unprofessional conduct, a further four midwives have been relocated to other areas within the hospital. The service\'s chief executive officer, Tony Sherbon, says no one was put at risk, because other staff not involved in the use of nitrous oxide were able to take over caring for women in labour. "Well we\'re very concerned and the body of midwives to the hospital - there are over 70 midwives that work in our service - are very annoyed and angry at the inappropriate behaviour of these very senior people who should know better," he said. "And that\'s why we\'ve take the action of suspending them and we\'ll consider further action next week." \n']
In [4]:
def build_texts(fname):
    """
    Function to build tokenized texts from file
    
    Parameters:
    ----------
    fname: File to be read
    
    Returns:
    -------
    yields preprocessed line
    """
    with open(fname) as f:
        for line in f:
            yield gensim.utils.simple_preprocess(line, deacc=True, min_len=3)
In [5]:
train_texts = list(build_texts(lee_train_file))
In [6]:
len(train_texts)
Out[6]:
300

Preprocessing our data. Remember: Garbage In Garbage Out

                                    "NLP is 80% preprocessing."
                                                            -Lev Konstantinovskiy

This is the single most important step in setting up a good topic modeling system. If the preprocessing is not good, the algorithm can't do much since we would be feeding it a lot of noise. In this tutorial, we will be filtering out the noise using the following steps in this order for each line:

  1. Stopword removal using NLTK's english stopwords dataset.
  2. Bigram collocation detection (frequently co-occuring tokens) using gensim's Phrases. This is our first attempt to find some hidden structure in the corpus. You can even try trigram collocation detection.
  3. Lemmatization (using gensim's lemmatize) to only keep the nouns. Lemmatization is generally better than stemming in the case of topic modeling since the words after lemmatization still remain understable. However, generally stemming might be preferred if the data is being fed into a vectorizer and isn't intended to be viewed.
In [7]:
bigram = gensim.models.Phrases(train_texts)  # for bigram collocation detection
In [8]:
bigram[['new', 'york', 'example']]
Out[8]:
[u'new_york', u'example']
In [9]:
from gensim.utils import lemmatize
from nltk.corpus import stopwords
In [10]:
stops = set(stopwords.words('english'))  # nltk stopwords list
In [11]:
def process_texts(texts):
    """
    Function to process texts. Following are the steps we take:
    
    1. Stopword Removal.
    2. Collocation detection.
    3. Lemmatization (not stem since stemming can reduce the interpretability).
    
    Parameters:
    ----------
    texts: Tokenized texts.
    
    Returns:
    -------
    texts: Pre-processed tokenized texts.
    """
    texts = [[word for word in line if word not in stops] for line in texts]
    texts = [bigram[line] for line in texts]
    texts = [[word.split('/')[0] for word in lemmatize(' '.join(line), allowed_tags=re.compile('(NN)'), min_length=3)] for line in texts]
    return texts
In [12]:
train_texts = process_texts(train_texts)
train_texts[5:6]
Out[12]:
[['afghani',
  'asylum_seeker',
  'australia',
  'return',
  'home',
  'environment',
  'government',
  'application',
  'kabul',
  'foreign_affair',
  'downer',
  'process',
  'threat',
  'person',
  'asylum',
  'afghan',
  'australia',
  'matter',
  'britain',
  'country',
  'europe',
  'taliban',
  'power',
  'afghanistan',
  'taliban',
  'airlift',
  'detainee',
  'christmas',
  'island',
  'island',
  'nauru',
  'total',
  'person',
  'island',
  'operation',
  'aircraft',
  'airlift',
  'today',
  'asylum_seeker',
  'claim',
  'visa',
  'department',
  'immigration',
  'detainee',
  'christmas',
  'island',
  'spokesman',
  'decision']]

Finalising our dictionary and corpus

In [13]:
dictionary = Dictionary(train_texts)
corpus = [dictionary.doc2bow(text) for text in train_texts]

Topic modeling with LSI

This is a useful topic modeling algorithm in that it can rank topics by itself. Thus it outputs topics in a ranked order. However it does require a num_topics parameter (set to 200 by default) to determine the number of latent dimensions after the SVD.

In [14]:
lsimodel = LsiModel(corpus=corpus, num_topics=10, id2word=dictionary)
In [15]:
lsimodel.show_topics(num_topics=5)  # Showing only the top 5 topics
Out[15]:
[(0,
  u'-0.241*"person" + -0.202*"australia" + -0.201*"government" + -0.193*"afghanistan" + -0.182*"day" + -0.174*"attack" + -0.156*"force" + -0.155*"area" + -0.154*"man" + -0.147*"security"'),
 (1,
  u'0.524*"fire" + 0.274*"sydney" + 0.269*"area" + 0.219*"firefighter" + 0.180*"wale" + 0.163*"wind" + -0.139*"israel" + -0.138*"attack" + 0.136*"line" + 0.126*"today"'),
 (2,
  u'-0.333*"australia" + 0.320*"israel" + 0.243*"palestinian" + -0.205*"afghanistan" + 0.204*"fire" + 0.177*"attack" + 0.174*"sharon" + 0.128*"yasser_arafat" + -0.122*"company" + 0.119*"office"'),
 (3,
  u'0.353*"afghanistan" + -0.301*"australia" + 0.236*"pakistan" + 0.221*"force" + 0.153*"afghan" + -0.152*"test" + -0.150*"company" + 0.146*"area" + -0.132*"union" + 0.114*"tora_bora"'),
 (4,
  u'-0.331*"union" + -0.327*"company" + 0.197*"test" + 0.193*"australia" + -0.190*"worker" + 0.189*"day" + -0.169*"qanta" + -0.150*"pakistan" + 0.136*"wicket" + -0.130*"commission"')]
In [93]:
lsitopics = lsimodel.show_topics(formatted=False)

Topic modeling with HDP

An HDP model is fully unsupervised. It can also determine the ideal number of topics it needs through posterior inference.

In [17]:
hdpmodel = HdpModel(corpus=corpus, id2word=dictionary)
In [18]:
hdpmodel.show_topics()
Out[18]:
[u'topic 0: 0.004*collapse + 0.004*afghanistan + 0.004*troop + 0.003*force + 0.003*government + 0.002*benefit + 0.002*operation + 0.002*taliban + 0.002*time + 0.002*today + 0.002*ypre + 0.002*tourism + 0.002*person + 0.002*help + 0.002*wayne + 0.002*fire + 0.002*peru + 0.002*day + 0.002*united_state + 0.002*hih',
 u'topic 1: 0.003*group + 0.003*government + 0.002*target + 0.002*palestinian + 0.002*end + 0.002*terrorism + 0.002*cease + 0.002*memorandum + 0.002*radio + 0.002*call + 0.002*official + 0.002*path + 0.002*security + 0.002*wayne + 0.002*attack + 0.002*human_right + 0.001*four + 0.001*gunman + 0.001*sharon + 0.001*subsidiary',
 u'topic 2: 0.003*rafter + 0.003*double + 0.003*team + 0.002*reality + 0.002*manager + 0.002*cup + 0.002*australia + 0.002*abc + 0.002*nomination + 0.002*user + 0.002*freeman + 0.002*herberton + 0.002*lung + 0.002*believe + 0.002*injury + 0.002*steve_waugh + 0.002*fact + 0.002*statement + 0.002*mouth + 0.002*alejandro',
 u'topic 3: 0.003*india + 0.003*sector + 0.002*anthony + 0.002*interview + 0.002*suicide_bomber + 0.002*union + 0.002*marconi + 0.002*imprisonment + 0.002*document + 0.002*mood + 0.002*remember + 0.002*repair + 0.002*vicki + 0.001*training + 0.001*dressing + 0.001*government + 0.001*indian + 0.001*law + 0.001*convention + 0.001*pair',
 u'topic 4: 0.003*airport + 0.003*commission + 0.002*marathon + 0.002*tonne + 0.002*citizen + 0.002*dickie + 0.002*arrest + 0.002*taliban + 0.002*opposition + 0.002*agha + 0.002*pitch + 0.002*tune + 0.002*regulation + 0.002*monday + 0.002*chile + 0.002*night + 0.002*foreign_affair + 0.002*charge + 0.002*county + 0.002*signature',
 u'topic 5: 0.005*company + 0.002*share + 0.002*version + 0.002*entitlement + 0.002*staff + 0.002*value + 0.002*tanzim + 0.002*bay + 0.002*beaumont + 0.002*cent + 0.002*world + 0.002*hass + 0.002*broker + 0.002*line + 0.002*tie + 0.002*plane + 0.002*flare + 0.001*creditor + 0.001*pay + 0.001*administrator',
 u'topic 6: 0.002*hiv + 0.002*aids + 0.002*margin + 0.002*worker + 0.002*horror + 0.002*claire + 0.002*nation + 0.002*person + 0.002*battleground + 0.002*christmas + 0.002*quarters + 0.002*day + 0.002*underdog + 0.002*festival + 0.002*devaluation + 0.002*immunity + 0.001*quirindi + 0.001*auditor + 0.001*europe + 0.001*board',
 u'topic 7: 0.002*david + 0.002*victim + 0.002*navy + 0.002*promise + 0.002*symbol + 0.002*site + 0.002*agenda + 0.002*endeavour + 0.002*hamas + 0.002*installation + 0.002*bulli + 0.002*quarrel + 0.002*israeli + 0.002*leaf + 0.002*space + 0.002*sharon + 0.002*spa + 0.002*dispute + 0.002*council + 0.002*tit',
 u'topic 8: 0.005*storm + 0.004*tree + 0.002*roger + 0.002*aedt + 0.002*minister + 0.002*service + 0.002*sydney + 0.002*electricity + 0.002*power + 0.002*split + 0.002*impact + 0.002*australia + 0.002*area + 0.002*quirindi + 0.002*expansion + 0.002*hornsby + 0.002*standing + 0.002*judgment + 0.002*search + 0.002*thank',
 u'topic 9: 0.003*australia + 0.003*economy + 0.002*ward + 0.002*game + 0.002*brought + 0.002*johnston + 0.002*supporter + 0.002*recession + 0.002*stray + 0.002*boat_people + 0.002*ritual + 0.002*thousand + 0.001*police + 0.001*box + 0.001*britain + 0.001*year + 0.001*thing + 0.001*kill + 0.001*tour + 0.001*junction',
 u'topic 10: 0.003*match + 0.003*crowd + 0.002*team + 0.002*rafter + 0.002*scrapping + 0.002*decision + 0.002*guarantee + 0.002*masood + 0.002*tennis + 0.002*forestry + 0.002*world + 0.002*france + 0.002*member + 0.002*career + 0.002*australia + 0.002*single + 0.002*rubber + 0.002*road + 0.002*tower + 0.002*attack',
 u'topic 11: 0.002*cycle + 0.002*communication + 0.002*spend + 0.002*airline + 0.002*flight + 0.002*amendment + 0.002*swift + 0.002*morning + 0.002*ansett + 0.002*mark + 0.002*platform + 0.002*administrator + 0.002*screen + 0.002*launceston + 0.002*airplane + 0.002*alarming + 0.002*worker + 0.001*tent + 0.001*severance + 0.001*wilton',
 u'topic 12: 0.003*summit + 0.003*indonesia + 0.002*john + 0.002*pitwater + 0.002*president + 0.002*week + 0.002*howard + 0.002*issue + 0.002*baptist + 0.002*city + 0.002*model + 0.002*mile + 0.002*talk + 0.002*australia + 0.002*network + 0.002*head + 0.002*passage + 0.002*quinlan + 0.002*start + 0.002*match',
 u'topic 13: 0.002*sorrow + 0.002*australia + 0.002*israelis + 0.002*middle_east + 0.002*deck + 0.002*sydney + 0.002*variety + 0.002*zimbabwean + 0.002*general + 0.002*calculation + 0.002*instrument + 0.002*piece + 0.002*treatment + 0.002*truce + 0.002*wicket + 0.002*submission + 0.002*line + 0.002*december + 0.002*showing + 0.001*father',
 u'topic 14: 0.002*game + 0.002*giuliani + 0.002*care + 0.002*java + 0.002*mystery + 0.002*session + 0.002*seeker + 0.002*distance + 0.002*tennessee + 0.002*transmission + 0.002*hamid + 0.002*cabinet + 0.002*day + 0.002*regret + 0.002*australia + 0.002*lifestyle + 0.002*afghanistan + 0.002*preview + 0.002*test + 0.002*hit',
 u'topic 15: 0.003*president + 0.002*rabbani + 0.002*maxi + 0.002*penalty + 0.002*show + 0.002*sibling + 0.002*adjournment + 0.002*new_delhi + 0.002*permission + 0.002*jackie + 0.002*arrest + 0.002*motive + 0.002*outcome + 0.002*shift + 0.002*spy + 0.002*beech + 0.002*beset + 0.002*need + 0.002*personnel + 0.002*mitchell',
 u'topic 16: 0.002*today + 0.002*matter + 0.002*work + 0.002*debate + 0.002*agreement + 0.002*mastermind + 0.002*member + 0.002*downer + 0.002*intercept + 0.002*bedside + 0.002*felix + 0.002*assembly + 0.002*afghan + 0.002*saudi + 0.002*burn + 0.002*franc + 0.002*modification + 0.002*spelt + 0.002*declared + 0.002*resist',
 u'topic 17: 0.002*margaret + 0.002*government + 0.002*disruption + 0.002*hingis + 0.002*section + 0.002*security + 0.002*corps + 0.002*pakistan + 0.002*front + 0.002*insurance + 0.002*maintenance + 0.002*order + 0.002*plume + 0.002*amendment + 0.002*demand + 0.001*hawke + 0.001*coal + 0.001*discontent + 0.001*modification + 0.001*distress',
 u'topic 18: 0.002*speaker + 0.002*love + 0.002*safety + 0.002*chaman + 0.002*coastguard + 0.002*salfit + 0.002*soccer + 0.002*payment + 0.002*complexity + 0.002*personnel + 0.002*flood + 0.002*employment + 0.002*morrow + 0.002*community + 0.002*darren + 0.002*context + 0.001*tunnel + 0.001*negotiation + 0.001*friendship + 0.001*sutherland',
 u'topic 19: 0.003*brain + 0.003*team + 0.003*olympic + 0.002*cell + 0.002*embryo + 0.002*suburb + 0.002*speaking + 0.002*macfarlane + 0.002*sheet + 0.002*overtime + 0.002*man + 0.002*finding + 0.002*canyon + 0.002*research + 0.002*manhattan + 0.002*brutality + 0.002*spot + 0.002*backdrop + 0.001*pervez + 0.001*sector']
In [94]:
hdptopics = hdpmodel.show_topics(formatted=False)

Topic modeling using LDA

This is one the most popular topic modeling algorithms today. It is a generative model in that it assumes each document is a mixture of topics and in turn, each topic is a mixture of words. To understand it better you can watch this lecture by David Blei. Let's choose 10 topics to initialize this.

In [20]:
ldamodel = LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)

pyLDAvis is a great way to visualize an LDA model. To summarize in short, the area of the circles represent the prevelance of the topic. The length of the bars on the right represent the membership of a term in a particular topic. pyLDAvis is based on this paper.

In [21]:
import pyLDAvis.gensim
In [22]:
pyLDAvis.enable_notebook()
In [88]:
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
Out[88]:
In [95]:
ldatopics = ldamodel.show_topics(formatted=False)

Finding out the optimal number of topics

Introduction to topic coherence: Topic coherence in essence measures the human interpretability of a topic model. Traditionally perplexity has been used to evaluate topic models however this does not correlate with human annotations at times. Topic coherence is another way to evaluate topic models with a much higher guarantee on human interpretability. Thus this can be used to compare different topic models among many other use-cases. Here's a short blog I wrote explaining topic coherence: What is topic coherence?

In [25]:
def evaluate_graph(dictionary, corpus, texts, limit):
    """
    Function to display num_topics - LDA graph using c_v coherence
    
    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    limit : topic limit
    
    Returns:
    -------
    lm_list : List of LDA topic models
    c_v : Coherence values corresponding to the LDA model with respective number of topics
    """
    c_v = []
    lm_list = []
    for num_topics in range(1, limit):
        lm = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary)
        lm_list.append(lm)
        cm = CoherenceModel(model=lm, texts=texts, dictionary=dictionary, coherence='c_v')
        c_v.append(cm.get_coherence())
        
    # Show graph
    x = range(1, limit)
    plt.plot(x, c_v)
    plt.xlabel("num_topics")
    plt.ylabel("Coherence score")
    plt.legend(("c_v"), loc='best')
    plt.show()
    
    return lm_list, c_v
In [26]:
%%time
lmlist, c_v = evaluate_graph(dictionary=dictionary, corpus=corpus, texts=train_texts, limit=10)
CPU times: user 22.8 s, sys: 536 ms, total: 23.4 s
Wall time: 22.9 s
In [97]:
pyLDAvis.gensim.prepare(lmlist[2], corpus, dictionary)
Out[97]:
In [96]:
lmtopics = lmlist[5].show_topics(formatted=False)

LDA as LSI

One of the problem with LDA is that if we train it on a large number of topics, the topics get "lost" among the numbers. Let us see if we can dig out the best topics from the best LDA model we can produce. The function below can be used to control the quality of the LDA model we produce.

In [76]:
def ret_top_model():
    """
    Since LDAmodel is a probabilistic model, it comes up different topics each time we run it. To control the
    quality of the topic model we produce, we can see what the interpretability of the best topic is and keep
    evaluating the topic model until this threshold is crossed. 
    
    Returns:
    -------
    lm: Final evaluated topic model
    top_topics: ranked topics in decreasing order. List of tuples
    """
    top_topics = [(0, 0)]
    while top_topics[0][1] < 0.97:
        lm = LdaModel(corpus=corpus, id2word=dictionary)
        coherence_values = {}
        for n, topic in lm.show_topics(num_topics=-1, formatted=False):
            topic = [word for word, _ in topic]
            cm = CoherenceModel(topics=[topic], texts=train_texts, dictionary=dictionary, window_size=10)
            coherence_values[n] = cm.get_coherence()
        top_topics = sorted(coherence_values.items(), key=operator.itemgetter(1), reverse=True)
    return lm, top_topics
In [70]:
lm, top_topics = ret_top_model()
In [79]:
print(top_topics[:5])
[(91, 0.99286550077029223), (42, 0.96031455145699274), (54, 0.87011963575683104), (2, 0.84575428129030361), (10, 0.83238343784453017)]

Inference

We can clearly see below that the first topic is about cinema, second is about email malware, third is about the land which was given back to the Larrakia aboriginal community of Australia in 2000. Then there's one about Australian cricket. LDA as LSI has worked wonderfully in finding out the best topics from within LDA.

In [78]:
pprint([lm.show_topic(topicid) for topicid, c_v in top_topics[:10]])
[[(u'actor', 0.034688196735986693),
  (u'picture', 0.023163878883499418),
  (u'award', 0.023163878883499418),
  (u'comedy', 0.023163878883499418),
  (u'globe', 0.023163878883499418),
  (u'nomination', 0.023163878883499418),
  (u'actress', 0.023163878883499418),
  (u'film', 0.023163878883499418),
  (u'drama', 0.011639561031012149),
  (u'winner', 0.011639561031012149)],
 [(u'virus', 0.064292949289013482),
  (u'user', 0.048074573973209883),
  (u'computer', 0.040350900997751814),
  (u'company', 0.028173623478117912),
  (u'email', 0.022580226976870982),
  (u'worm', 0.020928236506996975),
  (u'attachment', 0.014534311779706417),
  (u'outlook', 0.01260706654637953),
  (u'software', 0.011909411409069969),
  (u'list', 0.0088116041533348403)],
 [(u'claim', 0.0096511365969504694),
  (u'agreement', 0.0082836950379963047),
  (u'hectare', 0.0077564979304569235),
  (u'larrakia', 0.0065928813973845394),
  (u'rosebury', 0.006086042494624749),
  (u'term', 0.004880655853124416),
  (u'region', 0.004786636929111303),
  (u'title', 0.0045026307214029735),
  (u'palmerston', 0.0043726827115423677),
  (u'developer', 0.0040102561358092521)],
 [(u'government', 0.046880132726190141),
  (u'razor', 0.035772624674521684),
  (u'gang', 0.034958865711441162),
  (u'minister', 0.023615858300345904),
  (u'interest', 0.023531518290467797),
  (u'taxpayer', 0.023484887279677492),
  (u'nelson', 0.023408331025582648),
  (u'spending', 0.023363131530296326),
  (u'program', 0.022809499664362586),
  (u'colleague', 0.012039863390851384)],
 [(u'australia', 0.019022701887671096),
  (u'outlook', 0.012806577991883974),
  (u'price', 0.012017645637892888),
  (u'growth', 0.011021360611214826),
  (u'world', 0.010586500333515535),
  (u'imf', 0.0074848683800558145),
  (u'half', 0.0073080219523406773),
  (u'release', 0.0073069514968024446),
  (u'oil', 0.0071307771829650724),
  (u'weakening', 0.0067585126681211785)],
 [(u'role', 0.036823234375415084),
  (u'heart', 0.018676496748175567),
  (u'mcreddie', 0.018520830095514161),
  (u'sir', 0.018430691138823303),
  (u'actor', 0.018423768093119148),
  (u'attack', 0.018421603513127272),
  (u'minister', 0.018330977218667187),
  (u'cancer', 0.018246768643902407),
  (u'servant', 0.018246520413261125),
  (u'friend', 0.018230140539399531)],
 [(u'australia', 0.038230610979973961),
  (u'test', 0.03039802044037989),
  (u'day', 0.026478028361575149),
  (u'adam', 0.023237227270639361),
  (u'wicket', 0.018060239149805601),
  (u'match', 0.015652900511647725),
  (u'gilchrist', 0.015206348827236857),
  (u'steve_waugh', 0.01496754571623464),
  (u'south_africa', 0.013902623982144873),
  (u'selector', 0.012332915474867073)],
 [(u'product', 0.067729999063555119),
  (u'food', 0.033921347284742248),
  (u'consumer', 0.033921347284742241),
  (u'company', 0.033921347284742241),
  (u'hooke', 0.022651796691804622),
  (u'law', 0.022651796691804622),
  (u'grocery', 0.022651796691804622),
  (u'technology', 0.022651796691804622),
  (u'sultan', 0.014079780537934588),
  (u'stage', 0.013736597864617922)],
 [(u'credit', 0.020223411999648302),
  (u'way', 0.017706515460000523),
  (u'bank', 0.017459639386736926),
  (u'card', 0.016308335204832106),
  (u'consumer', 0.014565787979687885),
  (u'reserve_bank', 0.014365008462949415),
  (u'association', 0.011448453247788988),
  (u'rate', 0.010363334709658676),
  (u'movement', 0.010204675471073506),
  (u'inquiry', 0.0093452022355641085)],
 [(u'fire', 0.045611922604745642),
  (u'area', 0.021994719721821848),
  (u'firefighter', 0.018748173264525044),
  (u'sydney', 0.016599279291396325),
  (u'wind', 0.014270025525472343),
  (u'property', 0.0098028785236429564),
  (u'hour', 0.0097079779464512347),
  (u'today', 0.0093953004964965076),
  (u'year', 0.0089216764257795157),
  (u'state', 0.0086116373269496185)]]
In [98]:
lda_lsi_topics = [[word for word, prob in lm.show_topic(topicid)] for topicid, c_v in top_topics]

Evaluating all the topic models

Any topic model which can come up with topic terms can be plugged into the coherence pipeline. You can even plug in an NMF topic model created with scikit-learn.

In [99]:
lsitopics = [[word for word, prob in topic] for topicid, topic in lsitopics]

hdptopics = [[word for word, prob in topic] for topicid, topic in hdptopics]

ldatopics = [[word for word, prob in topic] for topicid, topic in ldatopics]

lmtopics = [[word for word, prob in topic] for topicid, topic in lmtopics]
In [100]:
lsi_coherence = CoherenceModel(topics=lsitopics[:10], texts=train_texts, dictionary=dictionary, window_size=10).get_coherence()

hdp_coherence = CoherenceModel(topics=hdptopics[:10], texts=train_texts, dictionary=dictionary, window_size=10).get_coherence()

lda_coherence = CoherenceModel(topics=ldatopics, texts=train_texts, dictionary=dictionary, window_size=10).get_coherence()

lm_coherence = CoherenceModel(topics=lmtopics, texts=train_texts, dictionary=dictionary, window_size=10).get_coherence()

lda_lsi_coherence = CoherenceModel(topics=lda_lsi_topics[:10], texts=train_texts, dictionary=dictionary, window_size=10).get_coherence()
In [101]:
def evaluate_bar_graph(coherences, indices):
    """
    Function to plot bar graph.
    
    coherences: list of coherence values
    indices: Indices to be used to mark bars. Length of this and coherences should be equal.
    """
    assert len(coherences) == len(indices)
    n = len(coherences)
    x = np.arange(n)
    plt.bar(x, coherences, width=0.2, tick_label=indices, align='center')
    plt.xlabel('Models')
    plt.ylabel('Coherence Value')
In [102]:
evaluate_bar_graph([lsi_coherence, hdp_coherence, lda_coherence, lm_coherence, lda_lsi_coherence],
                   ['LSI', 'HDP', 'LDA', 'LDA_Mod', 'LDA_LSI'])

Customizing the topic coherence measure

Till now we only used the c_v coherence measure. There are others such as u_mass, c_uci, c_npmi. All of these calculate coherence in a different way. c_v is found to be most in line with human ratings but can be much slower than u_mass since it uses a sliding window over the texts.

Making your own coherence measure

Let's modify c_uci to use s_one_pre instead of s_one_one segmentation

In [47]:
from gensim.topic_coherence import (segmentation, probability_estimation,
                                    direct_confirmation_measure, indirect_confirmation_measure,
                                    aggregation)
from gensim.matutils import argsort
from collections import namedtuple
In [48]:
make_pipeline = namedtuple('Coherence_Measure', 'seg, prob, conf, aggr')
In [49]:
measure = make_pipeline(segmentation.s_one_one,
                        probability_estimation.p_boolean_sliding_window,
                        direct_confirmation_measure.log_ratio_measure,
                        aggregation.arithmetic_mean)

To get topics out of the topic model:

In [50]:
topics = []
for topic in lm.state.get_lambda():
    bestn = argsort(topic, topn=10, reverse=True)
topics.append(bestn)

Step 1: Segmentation

In [51]:
# Perform segmentation
segmented_topics = measure.seg(topics)

Step 2: Probability estimation

In [52]:
# Since this is a window-based coherence measure we will perform window based prob estimation
per_topic_postings, num_windows = measure.prob(texts=train_texts, segmented_topics=segmented_topics,
                                               dictionary=dictionary, window_size=2)

Step 3: Confirmation Measure

In [53]:
confirmed_measures = measure.conf(segmented_topics, per_topic_postings, num_windows, normalize=False)

Step 4: Aggregation

In [54]:
print(measure.aggr(confirmed_measures))
-11.2873225334

How this topic model can be used further

The best topic model here can be used as a standalone for news article classification. However a topic model can also be used as a dimensionality reduction algorithm to feed into a classifier. A good topic model should be able to extract the signal from the noise efficiently, hence improving the performance of the classifier.