PhantomJS Site Scrape
November 23, 2012
PhantomJS is a standalone headless webkit based browser that can run from the command line. It runs scripts written in JavaScript, which can also run in the context of a remote web page.
This script goes to the Rothwell Temperance Band website at http://rtb.org.uk/ and finds all the H3 tags that are inside #contenthome. It gets the contents of the H3 tags, and prints them out as JSON.
var page = new WebPage(); var output = { errors: [], results: [] }; page.open('http://rtb.org.uk/', function (status) { if (status !== 'success') { output.errors.push('Unable to access network'); } else { var headlines = page.evaluate(function(){ try { var headlines = document.querySelectorAll('#contenthome h3'); var lReturn = [] for (var i = 0; i < headlines.length; ++i) { var item = headlines[i]; lReturn.push(item.innerHTML); } return lReturn } catch (e) { return []; } }); // console.log(headlines.length); if (headlines.length === 0) { output.errors.push('No headlines found'); } else { for (var i = 0; i < headlines.length; ++i) { var item = headlines[i]; output.results.push(item); } } console.log(JSON.stringify(output, null, ' ')); } phantom.exit(); });
page.open
opens the web page, and then runs the callback function passed. This looks to see whether the page was downloaded succesfully, and returns an appropriate error message.
page.evaluate
then runs a short script on the page, which returns an array containing the text of the nodes that match #contenthome h3
.
Next, we add the text into the output structure and print out that structure as JSON.
Example output is:
{ "errors": [], "results": [ "2nd November 2012 - Joint Concert with Brighouse & Rastrick", "22nd October 2012 - Can We Buy Some Luck Please?", "17th October 2012 - Rehearsal for Friends and Supporters" ] }