Finding proper nouns with NLTK
December 23, 2013
I want to parse news stories and automatically tag them. In order to do this I'm using the NLTK library for python.
What I need to do is find the list of proper nouns that occur in the text. We're looking for proper nouns (like 'Scotland'), not common nouns (like 'book'). I can do this using NLTK.
import nltk def findtags(tag_prefix, tagged_text): """ Find tokens matching the specified tag_prefix """ cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix)) return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions()) # read file into lHtml lFile = open('text.txt', 'r') lHtml = lFile.read() # clean up html into raw text lRaw = nltk.clean_html(lHtml) # Tokenize the raw text lTokens = nltk.word_tokenize(lRaw) # Tag the tokens with their type - ie are they nouns or not lTokens = nltk.pos_tag(lTokens) # find all the proper nouns and print them out lTagDict = findtags('NNP', lTokens) for tag in sorted(lTagDict): print tag, lTagDict[tag]
For my short example text, this outputs:
NNP ['Chris', 'Drighlington', 'Lisa', 'Accepted', 'Band']
Accepted
Accepted isn't a proper noun - why's it in this list? It turns out that my source text contains a h3
containing Accepted
, which is being added on to the next sentence, and therefore it's confusing the tag parser.
This doesn't much matter to me - I'm going to be matching up against a known list of keywords, so anything not in this list will be ignored.
Noun Phrases
I'm actually interested in noun phrases - groups of two NNPs following each other. We can find these by using a RegexpParser, given lTokens
from the code above:
def ExtractPhrases( myTree, phrase): """ Extract phrases from a parsed (chunked) tree Phrase = tag for the string phrase (sub-tree) to extract Returns: List of deep copies; Recursive """ myPhrases = [] if (myTree.node == phrase): myPhrases.append( myTree.copy(True) ) for child in myTree: if (type(child) is nltk.Tree): list_of_phrases = ExtractPhrases(child, phrase) if (len(list_of_phrases) > 0): myPhrases.extend(list_of_phrases) return myPhrases cp = nltk.RegexpParser(GRAMMAR) lTree = cp.parse(lTokens) lNouns = ExtractPhrases(lTree, "NP") for noun in lNouns: print noun
This outputs the following noun pairs for my (different) document:
(NP Cornwall/NNP Youth/NNP) (NP Brass/NNP Band/NNP) (NP Richard/NNP Evans/NNP) (NP Truro/NNP High/NNP) (NP Girls./NNP Flying/NNP) (NP Owen/NNP Farr/NNP) (NP Gala/NNP Concert/NNP) (NP St./NNP Michael/NNP) (NP Christopher/NNP Bond/NNP) (NP Royal/NNP Welsh/NNP) (NP masterclass./NNP Working/NNP) (NP Truro/NNP College/NNP) (NP Alan/NNP Pope/NNP) (NP Aaron/NNP Harvey/NNP) (NP Jeremy/NNP Willcock/NNP)
Using these pairs I'll get better matching than if I just look for the presence of two proper nouns.