Fetching HTML via Beautiful Soup

January 5, 2014

I had the need to pull down news stories from a web site that didn't have an RSS feed. I therefore resorted to using Beautiful Soup from python to parse the HTML and find the appropriate section of the page.

In the following code, lNewsIndexUrl contains the full url to a news index page (http://host.com/path.html). We will fetch the HTML from this page and pass it into BeautifulSoup:

lIndexHtml = urllib2.urlopen(lNewsIndexUrl).read()
lIndexSoup = BeautifulSoup(lIndexHtml)

We can now find the html container for the news headlines (each linked to the specific news page) by unique html id, and find all links in the tag associated with the passed in id:

lNewsIndex = lIndexSoup.find(id='IdFromTheHtml')
for lLink in lNewsIndex.find_all('a'):
  lLinkText = lLink.contents[0]
  lHref = lLink.get('href')

At this point lLinkText contains the title text from the link, and lHref contains the actual link destination.

We can now go onto open that link (prefixing it with the http://host.com/ if required) and fetch the HTML there too:

lItemHtml = urllib2.urlopen(lNewsItemUrl).read()
lItemSoup = BeautifulSoup(lItemHtml) 
lItemText = lItemSoup.find(id='IdOfNewsItemText')
if lItemText:
    lItemContent = u""
    for lItemString in lItemText.stripped_strings:
        lItemContent += lItemString

lItemContent now contains the child html for the entire news item.