October 23, 2018
Last year, 39.2 million people from abroad visited London, making it one of the most popular destinations in the world. The UK Office for National Statistics releases quarterly international visitor data, which includes countries of origin, duration, purpose of stay and even spend. We used Mind Foundry to visualise this data and build a machine learning model which can predict international visitor spend.
Curious to find out which features are most relevant?
Loading the data
The data can be freely downloaded from the London Datastore which is a free and open data-sharing portal which hosts over 700 London-related datasets. The data set containts over 56 thousand rows and 11 columns. AuDaS automatically scans the data set, detecting column types and levels. When we preview the data in Mind Foundry we are given advice to drop the “area column” which is constant.
After applying the advice we can then visualise automatically generated histograms which can help us understand whether there is any “structure” in the data set. We can easily get insights from these histograms such as the largest community of visitors (French) or the most popular stay duration.
We can also see that the data is unbalanced, with the vast majority of visitors spending less than 25 (scaled £). By using the percentage scale feature we can negate the effect of this and analyse distributions across other variables for individual spending amounts.
We wish to predict the spend of a visitor which involves performing a regression. To do so, we need to specify the target column as well as the model training framework. Mind Foundry will automatically withhold a 10% balanced sample of the data set for model validation purposes. During the training, it will generate scores from a 10 fold Cross validation, where each fold is uniformly balanced by class.
When we are happy with the training set up we just need to launch the task. Mind Foundry will then use its internal Bayesian Optimizer, Mind Foundry Optimize, to efficiently navigate the space of potential pipelines and configurations (data preparation steps, models and parameter values). The user can access the full history of each tested pipeline and view their performance metrics. The best pipeline is provided with full transparency.
Mind Foundry also provides explanations of each technical term. Users can also follow specific Mind Foundry Machine Learning courses if they wish to learn more about Data Sciences.
Mind Foundry provides model interpretability by highlighting the relative influence of each feature on its predictive power.
In our case, the most important “spend” features are the number of visits, nights (the longer you stay, the more you’ll spend) and the year. However, it's interesting to note that certain nationalities (e.g. US, Saudi Arabia) have a strong impact on the spend.
When you are happy with the “best” found model, Mind Foundry then tests it on the 10% of unseen data and provides model health advice. In our case, our model health is good because we are able to predict the visitor spend with an RMSE error of 4.199 which is consistent with the cross validated tests during the optimization.
The model can then be used to make predictions within Mind Foundry or can be put into production as a web service.
Our visitor spend predicting model could potentially allow London based hotels, restaurants, shops and businesses to optimize their offerings for their international customers. Understanding the factors which drive spending is crucial for making sure businesses target the right segments in order to capture that spend and increase their revenue.
Overall it took me less than 10 minutes to load, explore the data and build an accurate model on unseen test data (without writing a single line of code). Moreover, I didn’t have to worry about over-fitting as all of this was taken care of automatically by Mind Foundry!