Getting started with python NLTK

December 23, 2013

I want to find band and people names in some text and automatically tag articles that feature the same person and/or band. I'm investigating the python NLTK to do this.


First install nltk into your virtualenv:

$ pip install nltk

We need some corpora (data files containing text). This can be done with a built in downloader:

(venv)drumcoder@drumcoder:~/dev$ python
Python 2.7.3 (default, Aug  1 2012, 05:14:39) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
showing info

Click Download when the graphical installer appears.

Example Code

This simple example (taken from the Python Text Processing book listed below) looks for people's names in the passed in word list:

from nltk.tag import SequentialBackoffTagger
from nltk.corpus import names

class NamesTagger(SequentialBackoffTagger):
    def __init__(self, *args, **kwargs):
        SequentialBackoffTagger.__init__(self, *args, **kwargs)
        self.name_set = set([n.lower() for n in names.words()])

    def choose_tag(self, tokens, index, history):
        word = tokens[index]
        if word.lower() in self.name_set:
            return 'NNP'
            return None

nt = NamesTagger()
print nt.tag(['Bob','Fred', 'Edith', 'This','is','not','a','name'])

and it results in the following output:

$ python
[('Bob', 'NNP'), ('Fred', 'NNP'), ('Edith', 'NNP'), ('This', None), ('is', None), ('not', None), ('a', None), ('name', None)]


