Charles Brecque

October 9, 2018

Even with massive computational resources, it can take hours, days or even weeks to train machine learning models on large data sets. This is expensive and a burden on your productivity. But what if we utilise active learning?

In most cases, you don't actually need all the available data to train your models. In this article, we compare data subsetting strategies and the impact they have on the performance of machine learning models (namely, training time and accuracy). We will then put them into practice on the training of a SVM classifier on sub-sets of the MNIST data set.

## Building subsets with active learning

Active learning is a special case of machine learning, in which a learning algorithm is able to interactively query the user to obtain the desired outputs at new data points.

We'll use active learning to build subsets of our data from the original training set.

The process of subsetting the data is done with an Active Learner, which is going to learn based on a strategy which training subsets are appropriate for maximising the accuracy of our model.

We're going to consider four different strategies for building these subsets of data:

**Random sampling**: the data points are sampled at random**Uncertainty sampling**: we select the points whose class we are most uncertain about**Entropy sampling**: we choose the points whose class probability have the largest entropy**Margin sampling**: we choose the points for whom the difference between the most and second most likely classes are the smallest

The probabilities in these strategies are associated with the predictions of the SVM classifier.

For this study, we are going to build subsets of 5,000 (8% of the data); 10,000 (17% of the data) and 15,000 (25% of the data) points from the original training set of 60,000 points.

## Active learning can train machine learning models with a fraction of the data and time

To measure the performance of our training on the subsets, we will measure the ** training accuracy** and

**ratios calculated the following ways:**

*training time*

We can calculate the same ratios for the test data sets. The results are summarised in the following graphs. The three data points for each strategy correspond to the size of the subset (5,000 ; 10,000 and 15,000).

As we can see, with the uncertainty sampling strategy we can achieve over 99% of the performance with a subset of 15,000 points, in only 35% of the time it took us to train the SVM on the full dataset.

This clearly shows that we can achieve comparable results to using the full data set but with only 25% of the data and in 35% of the time.

Random sampling is the fastest of all strategies, but also the worst in terms of accuracy ratios.

Working on subsets of data is therefore a reasonable approach for significantly reducing training time with less computation and without compromise on accuracy.

Subsetting the data works well on most classification data sets, but will require extensions to be applicable to Time Series data and the machine learning models you are training.

## How much of the data do we need?

Now that we have proven the value and feasibility of training machine learning models on subsets of data, how can we know what the optimal subset size should be?

One approach, called FABOLAS [Klein et al.] implemented here can recommend the size of the subset you should use. It does this by learning a relationship between the contextual variable (size of the data set to use) and the reliability of the final score achieved. This means that by training the model on a subset, it can extrapolate the performance of the model on the full data set.

At Mind Foundry, we are striving for optimal and efficient machine learning through active learning and Bayesian Optimization. If you have any questions or would like to try our products, feel free to email me!