Finding proper nouns with NLTK

December 23, 2013

I want to parse news stories and automatically tag them. In order to do this I'm using the NLTK library for python.

What I need to do is find the list of proper nouns that occur in the text. We're looking for proper nouns (like 'Scotland'), not common nouns (like 'book'). I can do this using NLTK.

import nltk

def findtags(tag_prefix, tagged_text):
    """
    Find tokens matching the specified tag_prefix
    """
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                  if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions())

# read file into lHtml
lFile = open('text.txt', 'r')
lHtml = lFile.read()

# clean up html into raw text
lRaw = nltk.clean_html(lHtml)

# Tokenize the raw text 
lTokens = nltk.word_tokenize(lRaw)

# Tag the tokens with their type - ie are they nouns or not
lTokens = nltk.pos_tag(lTokens)

# find all the proper nouns and print them out
lTagDict = findtags('NNP', lTokens)
for tag in sorted(lTagDict):
    print tag, lTagDict[tag]

For my short example text, this outputs:

NNP ['Chris', 'Drighlington', 'Lisa', 'Accepted', 'Band']

Accepted

Accepted isn't a proper noun - why's it in this list? It turns out that my source text contains a h3 containing Accepted, which is being added on to the next sentence, and therefore it's confusing the tag parser.

This doesn't much matter to me - I'm going to be matching up against a known list of keywords, so anything not in this list will be ignored.

Noun Phrases

I'm actually interested in noun phrases - groups of two NNPs following each other. We can find these by using a RegexpParser, given lTokens from the code above:

def ExtractPhrases( myTree, phrase):
    """ 
    Extract phrases from a parsed (chunked) tree
    Phrase = tag for the string phrase (sub-tree) to extract
    Returns: List of deep copies;  Recursive
    """
    myPhrases = []
    if (myTree.node == phrase):
        myPhrases.append( myTree.copy(True) )
    for child in myTree:
        if (type(child) is nltk.Tree):
            list_of_phrases = ExtractPhrases(child, phrase)
            if (len(list_of_phrases) > 0):
                myPhrases.extend(list_of_phrases)
    return myPhrases

cp = nltk.RegexpParser(GRAMMAR)
lTree = cp.parse(lTokens)
lNouns = ExtractPhrases(lTree, "NP")
for noun in lNouns:
    print noun

This outputs the following noun pairs for my (different) document:

(NP Cornwall/NNP Youth/NNP)
(NP Brass/NNP Band/NNP)
(NP Richard/NNP Evans/NNP)
(NP Truro/NNP High/NNP)
(NP Girls./NNP Flying/NNP)
(NP Owen/NNP Farr/NNP)
(NP Gala/NNP Concert/NNP)
(NP St./NNP Michael/NNP)
(NP Christopher/NNP Bond/NNP)
(NP Royal/NNP Welsh/NNP)
(NP masterclass./NNP Working/NNP)
(NP Truro/NNP College/NNP)
(NP Alan/NNP Pope/NNP)
(NP Aaron/NNP Harvey/NNP)
(NP Jeremy/NNP Willcock/NNP)

Using these pairs I'll get better matching than if I just look for the presence of two proper nouns.

References