Getting started with python NLTK
December 23, 2013
I want to find band and people names in some text and automatically tag articles that feature the same person and/or band. I'm investigating the python NLTK to do this.
Installation
First install nltk
into your virtualenv:
$ pip install nltk
We need some corpora (data files containing text). This can be done with a built in downloader:
(venv)drumcoder@drumcoder:~/dev$ python Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import nltk >>> nltk.download() showing info http://nltk.github.com/nltk_data/
Click Download when the graphical installer appears.
Example Code
This simple example (taken from the Python Text Processing book listed below) looks for people's names in the passed in word list:
from nltk.tag import SequentialBackoffTagger from nltk.corpus import names class NamesTagger(SequentialBackoffTagger): def __init__(self, *args, **kwargs): SequentialBackoffTagger.__init__(self, *args, **kwargs) self.name_set = set([n.lower() for n in names.words()]) def choose_tag(self, tokens, index, history): word = tokens[index] if word.lower() in self.name_set: return 'NNP' else: return None nt = NamesTagger() print nt.tag(['Bob','Fred', 'Edith', 'This','is','not','a','name'])
and it results in the following output:
$ python test.py [('Bob', 'NNP'), ('Fred', 'NNP'), ('Edith', 'NNP'), ('This', None), ('is', None), ('not', None), ('a', None), ('name', None)]