Using Inspect / Javascript to scrape data from visualisations online

My last post talked about making over this visualisation from The Guardian:

2016-11-13_12-55-29

What I haven’t explained is how I found the data. That is what I intend to outline in this post. Learning these skills is very useful if you need to find data for re-visualising data visualisations / tables found online.

The first step with trying to download data for any visualisation online is by looking checking how it is made, it may simply be a graphic (in which case it may be hard unless it is a chart you can unplot using WebPlotDigitiser) but in the case of interactive visualisations they are typically made with javascript unless they are using a bespoke product such as Tableau.

Assuming it is interactive then you can start to explore by using right-click on the image and choose Inspect (in Chrome, other browsers have similar developer tools).

2016-11-13_19-26-35

I was treated with this view:

2016-11-13_19-28-09.png

I don’t know much about coding but this looking like the view is being built by a series of paths. I wonder how it might be doing this? We can find out by digging deeper, let’s visit the Sources tab:

2016-11-13_19-31-30

Our job on this tab is to look for anything unusual outside the typical javascript libraries (you learn these by being curious and looking at lots of sites). The first file gay-rights-united-states looks suspect but as can be seen from the image above it is empty.

Scrolling down, see below, we find there is an embedded file / folder (flat.html) and in that is something new all.js and main.js….

2016-11-13_19-34-05

Investigating all.js reveals nothing much but main.js shows us something very interesting on line 8. JACKPOT! A google sheet containing the full dataset.

2016-11-13_19-38-25

And we can start vizzing! (btw I transposed this for my visualisation to get a column per right).

Advanced Interrogation using Javascript

Now part way through my visualisation I realised I needed to show the text items the Guardian had on their site but these weren’t included in the dataset.

2016-11-13_19-41-27

I decided to check the javascript code to see where this was created to see if I could decipher it, looking through main.js I found this snippet:

function populateHoverBox (type, position){

 var overviewObj = {
 'state' : stateData[position].state
 }
.....
if(stateData[position]['marriage'] != ''){
 overviewObj.marriage = 'key-marriage'
 overviewObj.marriagetext = 'Allows same-sex marriage.'
 } else if(stateData[position]['union'] != '' && stateData[position]['marriageban'] != ''){
 overviewObj.marriage = 'key-marriage-ban'
 overviewObj.marriagetext = 'Allows civil unions; does not allow same-sex marriage.'
 } else if(stateData[position]['union'] != '' ){
 overviewObj.marriage = 'key-union'
 overviewObj.marriagetext = 'Allows civil unions.'
 } else if(stateData[position]['dpartnership'] != '' && stateData[position]['marriageban'] != ''){
 overviewObj.marriage = 'key-marriage-ban'
 overviewObj.marriagetext = 'Allows domestic partnerships; does not allow same-sex marriage.'
 } else if(stateData[position]['dpartnership'] != ''){
 overviewObj.marriage = 'key-union'
 overviewObj.marriagetext = 'Allows domestic partnerships.'
 } else if (stateData[position]['marriageban'] != ''){
 overviewObj.marriage = 'key-ban'
 overviewObj.marriagetext = 'Same-sex marriage is illegal or banned.'
 } else {
 overviewObj.marriagetext = 'No action taken.'
 overviewObj.marriage = 'key-none'
 }

…and it continued for another 100 odd lines of code. This wasn’t going to be as easy as I hoped. Any other options? Well what if I could extract the contents of the overviewObj. Could I write this out to a file?

I tried a “Watch” using the develop tools but the variable went out of scope each time I hovered, so that wouldn’t be useful. I’d therefore try saving the flat.html locally and try outputting a file with the contents to my local drive….

As I say I’m no coder (but perhaps more comfortable than some) and so I googled (and googled) and eventually stumbled on this post

http://stackoverflow.com/questions/16376161/javascript-set-file-in-download

I therefore added the function to my local main.js and added a line in the populateHoverBox function….okay so maybe I can code a tiny bit….

var str = JSON.stringify(overviewObj);
 
download(str, stateData[position].state + '.txt', 'text/plain');

In theory this should serialise the overviewObj to a string (according to google!) and then download the resulting data to a file called <State>.txt

Now for the test…..

downloadingfiles

BOOM, BOOM and BOOM again!

Each file is a JSON file

2016-11-13_20-07-21

Now to copy the files out from the downloads folder, remove any duplicates, and combine using Alteryx.

2016-11-13_20-04-59

As you can see using the wildcard input of the resulting json file and a transpose was simple.

2016-11-13_20-08-31

Finally to combine with the google sheet (called “Extract” below) and the hexmap data (Sheet 1) in Tableau…..

2016-11-13_20-09-41

Not the most straightforward data extract I’ve done but I thought it was useful blogging about so others could see that extracting data from visualisation online is possible.

You can see the resulting visualisation my previous post.

Conclusion

No one taught me this method, and I have never been taught how to code. The techniques described here are simply the result of continuous curiosity and exploration of how interactive tables and visualisations are built.

I have used similar techniques in other places to extract data visualisations, but no two methods are the same, nor can a generic tutorial be written. Simply have curiosity and patience and explore everything.

 

Titanic Analysis – Little Effort

Following on from my last post I wanted to pick up on some of the analytical capabilities in Alteryx through their existing set of tools which use open-source R. These tools have been progressively added over the last year and, I’ll be honest, while I’ve used them I haven’t properly taken the time to explore them in anger. So the perfect subject for a blog post because while I use them I can write about my experiences at the same time.

I want to write this in a way that people downloading the Project Edition of Alteryx i.e. the free version, can play along once they’d done the tutorials that are now bundled with the software. So to explore the tools, and be able to write about the full results, I want to use an open data set and so I decided to go to Kaggle.com which describes itself as

… the world’s largest community of data scientists. They compete with each other to solve complex data science problems, and the top competitors get invited to consult on interesting projects from some of the world’s biggest companies through Kaggle Connect.

Okay, so it’s an interesting idea and the perfect setting to pick up some data and indulge my competitive side. So, having registered for Kaggle.com I went straight to their Titanic competition which described itself as “an ideal starting place for people who may not have a lot of experience in data science and machine learning”. In simple terms the premise of the “competition” is to use a set of training data to build a model predicting the survivors of the Titanic sinking based on age, sex, class, fare, etc. You then run the model against a second set of data and see how your model performs vs other people’s.

So my first step was to download the csv file of “training” data – the data which tells you who survived for building the model – and start analysing it using Alteryx. This proved a relatively simple exercise because I could use just a few tools: an Input, Auto Field (to change the data from the default csv “string” type into a numeric format where applicable) and a Field Summary to report on the data (connected in that order with a few browse tools). Like so:

Alteryx Data Flow Field Summary

So as you can see this gives me a useful report to start analysing the data. Next step in my approach was to look at the variables from the report and in the browse tool and see which might make good ones to try and model.

For my first run through I’ve ignored some of the more advanced analysis tools in Alteryx and gone with gut instinct (mainly for time reasons as it’s getting late). I decided that Name and Ticket Number were probably not good variables to rely on for modelling purposes and so made a decision to build a model based primarily on age, sex, gender and class.

I also decided to use a Boosted model, mainly from experience I know we don’t want to overfit the model (over-fitting is when we predict based on particular nuances of this single data that might result in a good model for the training data, but when applied to the held-back data the model would not perform as well if the nuances didn’t extend across the full dataset). The Boosted Model is new in v8.6 and I know it is a Machine Learning algorithm which deliberately tries to avoid overfitting.

So I quickly built up the model which was surprisingly easy and took only a few tools. The key things to realise are firstly that the Boosted Model needs a string predictor variable (easy to create using a quick formula) and secondly that the model itself can be passed from tool to tool. So to score out a model you simply connect the top output (containing the R model) from the Boosted Tool into the Scoring Tool, then connect your data to score into the second connection on the Score tool. This was how I then scored out the final model against the Test data file:

Alteryx Data Flow Model

The final thing I needed to do was output my results with only the my predicted survival variable (zero or one) and the passenger_id from the Test data set. Alteryx makes things like this so easy with just one tool, a Select tool (probably one of the most used tools).

Kaggle lets you upload your results and get a score based on the percentage you predicted correctly from the Test data set (50% is also held back to give a final private score that can be used to judge the final entries). My best model tonight scored 0.71292 which wasn’t too bad for a first attempt.

You can download my module here, along with the training and test data: https://www.dropbox.com/s/ilw0w612g8w8ijj/FirstGo.zip

So am I a data scientist now? No certainly not (not yet anyway!), for starters I can’t explain why the Boosted model works better with the fields from the csv file left as strings, the results are much worse when I convert the appropriate fields to numbers. Maybe someone can help me with that one.

What I can say is that I wouldn’t have sat here in an hour or so and been able to program that model in R from scratch. The Kaggle website includes some tutorials for the Titanic data, while they don’t include one for SAS or R they do one for Python and looking at the lines of code required for even simple things, like say importing a csv file, makes me remember what a pleasure it is to use Alteryx – something it’s easy to take for granted once you’ve been using the tool a while.

If anyone wants to work/play with me to perfect my model in Alteryx (via a Kaggle “team”), or wants to compete against me, drop a comment on this blog or connect with me on Twitter: @ChrisLuv and I’ll be happy to.