Hyper-parameter tuning can be observed in many everyday processes, but few people are aware of its existence. This blog aims to change that, and demonstrate its true value in data science.
What is a hyper-parameter?
Machine learning, in its simplest form, is about selecting an algorithm to work on a particular problem, training it on example data, resulting in a model that represents the solution. And there is an art for selecting the right type of model for your particular problem (which we won’t go into here).
Let’s assume that you’ve chosen your model. Now there are two types of parameters we can talk about:
One is about how to optimise your model for learning with the training data - these are the ‘external’ hyperparameters of the model. You specify these hyperparameters.
The second are the variables that configure the model - these are the ‘internal’ parameters of the model. You generally don’t specify these parameters, they are learnt during the training process.
Take a music ensemble for example. You might specify how many instruments you want in the group and you might specify what piece they are going to play. These would be the hyperparameters. But you won’t control how fast they might play a piece, or how loud or how quietly. These are the variables that would come out of the rehearsals that your newly formed ensemble has, and would be the parameters.
Hyper-parameter tuning explained
The good news is that you don’t have to do hyperparameter tuning for every machine learning model. You can often set your hyperparameters based on best practices, or based on previous uses of the model. For example, if you want to play a Mozart String Quartet, you will likely choose two violins, a viola and a cello.
However, if you want to “fine-tune” your model, then you can delve into the hyperparameter settings that are going to optimise the training process and result in the most accurate predictive models.
Hyper-parameter tuning might not be critical for the majority of use cases, but can significantly affect the efficiency of your processes and the quality of the results you are trying to obtain.
Applying hyper-parameter tuning in data science
Most machine learning models have hyper-parameters. For example, a simple random forest algorithm will have quite a few:
Once a user has specified the values of its hyper-parameters, the algorithm will be trained on the underlying data set. The choice of the hyper-parameters will affect the duration of the training and the accuracy of the predictions.
Unfortunately, choosing hyper-parameters isn’t straight forward. The number of possible configurations explodes with their number, and the evaluation time of a configuration is a function of the complexity of the model and the size of the underlying training data.
"Hyper-parameter tuning relies more on experimental results than theory, and thus the best method to determine the optimal settings is to try many different combinations evaluate the performance of each model."
With so much uncertainty, it's understandable that most data scientists forego hyper-parameter tuning and stick with the default values provided by the model.
This works in most cases, but to pick up on my introductory analogy, you might struggle to play a symphony with eight musicians. Knowing how best to calibrate your model to get the optimum performance can be really important. If you’re trying to sell out a concert hall, then make sure you’ve got enough violins!
Grid-search and random-search
The second most popular approach is to find the optimal configuration through grid-search (evaluating every single configuration) which will eventually lead to an answer.
This would be like playing the symphony a thousand times, changing the allocation of each type of instrument every time. Simply impractical!
The third most popular approach is to find the optimal configuration through random-search, but this is only marginally better than grid-search. I’ll spare you with an analogy for this one, as hopefully by now you see my point.
So how can hyper-parameter tuning be done more efficiently? Our Twitter followers seem to know the answer.
Bayesian optimisation addresses the pitfalls of grid and random search by incorporating a 'belief' of what the solution space looks like, and 'learning' from the configurations it evaluates. This belief can be specified by a domain expert but can also be flat at the beginning.
To learn more, you can download this report (with fewer analogies) explaining the intuitions behind Bayesian Optimisation.
To be honest, when you set out to perform a symphony, the composer will likely give you his or her best idea for the orchestration. Or if not, you can search previous recordings. Or ask a conductor friend. There are lots of ways of setting your initial ‘belief’, often called a ‘prior’, and working from there.
Extensions to Bayesian Optimisation
If you work closely with an orchestra, you can try multiple ways for performing a given piece. The same can be achieved in Bayesian Optimisation through intelligent batching.
As you work with different groups of musicians and different pieces of music, you will gain expertise in producing the best performance of a given piece. You should also realise quickly if you’ve not included enough players, or not provided enough sheet music - and be able to correct that quickly!
Well, this is possible in certain implementations of Bayesian Optimisation and we discuss all of that in the report below.