Titanic Analysis – Little Effort


Following on from my last post I wanted to pick up on some of the analytical capabilities in Alteryx through their existing set of tools which use open-source R. These tools have been progressively added over the last year and, I’ll be honest, while I’ve used them I haven’t properly taken the time to explore them in anger. So the perfect subject for a blog post because while I use them I can write about my experiences at the same time.

I want to write this in a way that people downloading the Project Edition of Alteryx i.e. the free version, can play along once they’d done the tutorials that are now bundled with the software. So to explore the tools, and be able to write about the full results, I want to use an open data set and so I decided to go to Kaggle.com which describes itself as

… the world’s largest community of data scientists. They compete with each other to solve complex data science problems, and the top competitors get invited to consult on interesting projects from some of the world’s biggest companies through Kaggle Connect.

Okay, so it’s an interesting idea and the perfect setting to pick up some data and indulge my competitive side. So, having registered for Kaggle.com I went straight to their Titanic competition which described itself as “an ideal starting place for people who may not have a lot of experience in data science and machine learning”. In simple terms the premise of the “competition” is to use a set of training data to build a model predicting the survivors of the Titanic sinking based on age, sex, class, fare, etc. You then run the model against a second set of data and see how your model performs vs other people’s.

So my first step was to download the csv file of “training” data – the data which tells you who survived for building the model – and start analysing it using Alteryx. This proved a relatively simple exercise because I could use just a few tools: an Input, Auto Field (to change the data from the default csv “string” type into a numeric format where applicable) and a Field Summary to report on the data (connected in that order with a few browse tools). Like so:

Alteryx Data Flow Field Summary

So as you can see this gives me a useful report to start analysing the data. Next step in my approach was to look at the variables from the report and in the browse tool and see which might make good ones to try and model.

For my first run through I’ve ignored some of the more advanced analysis tools in Alteryx and gone with gut instinct (mainly for time reasons as it’s getting late). I decided that Name and Ticket Number were probably not good variables to rely on for modelling purposes and so made a decision to build a model based primarily on age, sex, gender and class.

I also decided to use a Boosted model, mainly from experience I know we don’t want to overfit the model (over-fitting is when we predict based on particular nuances of this single data that might result in a good model for the training data, but when applied to the held-back data the model would not perform as well if the nuances didn’t extend across the full dataset). The Boosted Model is new in v8.6 and I know it is a Machine Learning algorithm which deliberately tries to avoid overfitting.

So I quickly built up the model which was surprisingly easy and took only a few tools. The key things to realise are firstly that the Boosted Model needs a string predictor variable (easy to create using a quick formula) and secondly that the model itself can be passed from tool to tool. So to score out a model you simply connect the top output (containing the R model) from the Boosted Tool into the Scoring Tool, then connect your data to score into the second connection on the Score tool. This was how I then scored out the final model against the Test data file:

Alteryx Data Flow Model

The final thing I needed to do was output my results with only the my predicted survival variable (zero or one) and the passenger_id from the Test data set. Alteryx makes things like this so easy with just one tool, a Select tool (probably one of the most used tools).

Kaggle lets you upload your results and get a score based on the percentage you predicted correctly from the Test data set (50% is also held back to give a final private score that can be used to judge the final entries). My best model tonight scored 0.71292 which wasn’t too bad for a first attempt.

You can download my module here, along with the training and test data: https://www.dropbox.com/s/ilw0w612g8w8ijj/FirstGo.zip

So am I a data scientist now? No certainly not (not yet anyway!), for starters I can’t explain why the Boosted model works better with the fields from the csv file left as strings, the results are much worse when I convert the appropriate fields to numbers. Maybe someone can help me with that one.

What I can say is that I wouldn’t have sat here in an hour or so and been able to program that model in R from scratch. The Kaggle website includes some tutorials for the Titanic data, while they don’t include one for SAS or R they do one for Python and looking at the lines of code required for even simple things, like say importing a csv file, makes me remember what a pleasure it is to use Alteryx – something it’s easy to take for granted once you’ve been using the tool a while.

If anyone wants to work/play with me to perfect my model in Alteryx (via a Kaggle “team”), or wants to compete against me, drop a comment on this blog or connect with me on Twitter: @ChrisLuv and I’ll be happy to.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s