PhantomJS Site Scrape

November 23, 2012

PhantomJS is a standalone headless webkit based browser that can run from the command line. It runs scripts written in JavaScript, which can also run in the context of a remote web page.

This script goes to the Rothwell Temperance Band website at and finds all the H3 tags that are inside #contenthome. It gets the contents of the H3 tags, and prints them out as JSON.

var page = new WebPage();
var output = { errors: [], results: [] };'', function (status) {
    if (status !== 'success') {
        output.errors.push('Unable to access network');
    } else {
        var headlines = page.evaluate(function(){
            try {
                var headlines = document.querySelectorAll('#contenthome h3');
                var lReturn = []
                for (var i = 0; i < headlines.length; ++i) {
                  var item = headlines[i];
                return lReturn
            } catch (e) {
                return [];
        // console.log(headlines.length);
        if (headlines.length === 0) {
            output.errors.push('No headlines found');
        } else {
          for (var i = 0; i < headlines.length; ++i) {
            var item = headlines[i];
        console.log(JSON.stringify(output, null, '    '));
}); opens the web page, and then runs the callback function passed. This looks to see whether the page was downloaded succesfully, and returns an appropriate error message.

page.evaluate then runs a short script on the page, which returns an array containing the text of the nodes that match #contenthome h3.

Next, we add the text into the output structure and print out that structure as JSON.

Example output is:

    "errors": [],
    "results": [
        "2nd November 2012 - Joint Concert with Brighouse &amp; Rastrick",
        "22nd October 2012 - Can We Buy Some Luck Please?",
        "17th October 2012 - Rehearsal for Friends and Supporters"