If one excludes the data ingestion process (including cleaning, massaging, ingesting, formatting, …) which can account for up to 70% of all of the hands-on time, building and testing models is the most important, and perhaps rewarding task that a data scientist can undertake. In supervised learning (one of the three broad categories of machine learning models), a model is “just” a very complex mapping between an input dataset and an output variable. This is true for both inference and prediction tasks. Provided enough examples of a (statistically) representative input with the known associated output variables (e.g. labels, variable values), a machine-learnt model can be considered as a black box that can be used in a second stage to infer the output from a similar input.
Overfitting nine data points
For simplicity, I’m presenting a sample distribution of nine data points as extracted from a linear distribution to which I added some random noise. Our model is mapping the only output dimension (the vertical axis value) against a single input dimension (the horizontal axis value). Ideally, our model will learn the underlying relation between input and output and be able to infer the former from new (unseen) inputs.
If asked to extract the underlying relation from this plot alone, a human will easily draw a line close to the one (before noise-addition) that was used to create the points (see picture below).
In fact, our brains are quite good at identifying patterns in noisy data (a skill developed over millions of years of evolution that can sometimes cause us problems, e.g. apophenia).
Unfortunately, in the case of machine-learning algorithms we need to specify the kind of relation (i.e. kernel) we are looking for (e.g. linear, quadratic). If we adopt a conservative approach and just set the order of the relation to some large number, we introduce the risk that the model will overfit, since the algorithm will start almost-perfectly matching the data points (which include statistical noise as well as the real signal). In fact, if the algorithm has too much freedom in this sense, it’s typical to obtain a solution like this:
From a naive point of view, the red-line is perfectly passing through the data points, so it’s the best fit possible. However, by doing so the model is also fitting the noise associated with each value. If we extract new output from the same relation we used to create the initial ones, we see that in most cases they are not well-fitted by the overfitting red line, especially at the edges of the horizontal range we fit with it. On the contrary, the best model possible would instead be able to generalise its predictions to data-points unseen during training.
Causes of overfitting
In general, an overfitted model is an overly-complex (i.e. too many parameters, dimensions or degrees-of-freedom) model (compared to the available volume of training data) that works astonishingly well in training but not in testing (i.e. it doesn’t generalise to cases that weren’t included in the training set).
In fact, without a sufficiently large training dataset, only a very basic model (i.e. with a low number of free parameters) can generalise well. If the volume of data is not large enough and the model has a large number of free parameters, this is the perfect recipe for overfitting.
Enlarging the dataset and reducing the number of parameters alone is not enough. If there is a large number of dimensions/columns in the training dataset, a model with a large number of free parameters can learn to recognise the single elements of the dataset and map them to the associated output, one by one. As a consequence, the model won’t learn the underlying structure and will overfit. Adding new parameters is a temptation for each data scientist with active control on his/her model input, but sometimes the risks exceed the benefits.
Symptoms of overfitting
The easiest way to identify that something is wrong during the model-training step is to check how well it’s learning during this process (some models present the inference accuracy during training, e.g. neural networks). If this is close to 100%, this is likely to be one of those “too-good-to-be-true” cases. But to have a more solid assessment of how bad your model is overfitting, the best approach is the divide ’n’ conquer approach. In practice:
- Split your dataset into training and testing (independent and ideally random sampled) sets, with the former being at least 80% of the total dataset.
- Train on the training dataset
- Extract predictions/classifications from the testing dataset with the newly-trained model
- Hope for the best
- Compare the results with the ground truth or known labels of the testing dataset
Since the model hasn’t used the testing dataset for training, the accuracy on this latter set should be similar to that obtained during training if no overfitting occurs. On the contrary, if the accuracy on the testing dataset is significantly lower (similar to that obtainable from a random classification), overfitting is definitely present and a cure is necessary!
Let’s say that our model is presenting good results on the training set and poor results on the testing set. We can safely assume that, if our testing dataset is representative of the whole distribution, overfitting is a serious condition that will affect our model at least once in the real world (i.e. with data for which we don’t likely have labels/ground-truth to compare against). What follows are a list of potential remedies and palliative cures, with different degrees of effectiveness.
If a large number of dimensions/columns is a problem, reducing it sounds like the obvious first choice. However, blindly removing columns from our dataset (or features from our analysis) is probably not the smartest choice, especially given the risk that, with such information, we are also removing potentially important information. On the contrary, we might want to first assess which features are the most important or most used by our model to make a prediction and which ones, instead, cause just a minor reduction of the final accuracy if removed (i.e. smart reduction instead of dumb removal).
To know how many dimensions or features to remove from the dataset, it’s important to understand that (rule-of-thumb alert!) the number of data points in the training set should be more than 5PD, where P is the number of free parameters in the model and D is the number of columns/values.
My (personal) favourite approach in this sense is to instead build a suite of “simple” models, each using a limited subset of the input features/columns, and rank them according to their performance on the testing set. Then, I ensemble the best ones and use their prediction combined (if it’s a multi-class classification task, a voting system is the simplest approach). A sort of “wisdom of the crowd” approach.
Increase data volume
This is the dumbest suggestion I can give to anyone faced with an overfitting problem, but it’s the most effective one: if you can, get more data! Following the thumb-rule presented above, if I have access/control to the data-munging factory, I’ll spin up the production until I reach the magic threshold (and beyond). Unfortunately, this is a utopian scenario in most real-world data science cases. But here is where data augmentation can come to save us.
Data augmentation is a swift way to artificially accumulate or generate more data from a limited dataset in order to increase the generalisation power of a model (and, thus, reduce overfitting). As a data-generation method, it is mainly used in image classification/segmentation, since visual inputs are much easier to augment without blending the information in them.
For example, if we want to train a model to map images of cats and dogs into the “this is a cat” and “this is a dog” categories (a pretty abused example in the Machine Learning community), but we have just a few hundred images of these two universally adored categories of pets, the complexity of the task would prevent the use of deep convolutional neural networks (where the potential number of free parameters can reach the millions, e.g. GoogLeNet, and each pixel is potentially a different feature to consider). But for a single image we can produce almost an infinite number of new images with the same information by simple applying basic geometric transformations. Following the classic example, a picture of my cat is still a picture of my cat even if I rotate it by few degrees, but my model does not realise it, since on a pixel-by-pixel level the two images are different.
Applying augmentation to other domains is still something largely unexplored. If we extend the concept of “adding noise” to produce new data with the same underlying information, we have to move very carefully between the “too little noise” and “too much noise” extremes, with the former not preventing overfitting and the latter completely washing out the underlying information that the model is required to learn.
Long story short: data augmentation is great for image-based learning, not so much (yet) for other cases (e.g. time-series, text, …).
I already touched on this point, but it’s worth referring back to it for completeness. If the model is not self-learning its own number of free parameters/degrees of freedom (in which case, read the next paragraph), fine-tuning is required by the data scientist in order to achieve (or get close to) the perfect balance between model complexity and a lack of overfitting. This can be done iteratively by adding degrees of freedom and assessing what the accuracy is in the training phase.
If the model is self-adapting (i.e. it sets its own degrees of freedom), it will generally tend to increase them in order to minimise the distance between predictions and ground truth on the training, which can cause (you probably already guessed) massive overfitting. Regularisation is pretty much adding a “penalty” to such catastrophic behaviour that is proportional to the number of degrees of freedom used.
In the meme above there is an example where regularisation would have helped a Doge-model solving a question that a primary school student wouldn’t need more than few seconds to solve.
Overfitting is a very bad thing in data science. It is so bad, that it can really affect (and destroy) human lives (e.g. the Fukushima disaster might have been prevented with a non-overfitted statistical distribution of the high-magnitude earthquakes in the area, as suggested by Nate Silver in his best-selling book “The Signal and the Noise”, see here. So, for the sake of your data science career, and the humans that will be eventually be affected by your classifications, be aware of the importance of removing overfitting from your models!
Header image courtesy of W_Minshull.
Thanks to Shannon Pace, and Reuben Wilson for reviewing this post and providing suggestions.