Blog: AI Series: A walk into the theory of learning.
The goal of machine learning is to build the simplest approximating model able to explain past data and future instances provided by the data generating process.
The statistical learning theory, one of the most beautifully developed branches of artificial intelligence, provides us with the theoretical basis for many of today’s machine learning algorithms. It also attempts a philosophical approach trying to identify the fundamental element which underpins the process of drawing valid conclusions from empirical data.
In machine learning as a matter of fact, the target function which maps the data from an input domain to the corresponding data in the output domain, is unknown. If it was known, we would not need any learning at all and we would just need to implement it. The basic premise of learning is to use a set of observation to uncover an underlying process. In machine learning the objective is to find a function which approximate the target function learning it from finite sets of sample data.
Looking at the problem from a supervised learning perspective, our challenge involves learning from a training set of labelled data. Every point in the training set is an input-output pair, where the input maps to a known output. The learning problem consists of finding the best algorithm able to capture the unknown governing rules that are implicitly described by the distribution of the sample data pairs and build a hypothetical function that, approximating the target function, can be used to predict output from future, unknown input. The performances of the learned model, or generalization performances, are measured in terms of percentage of accuracy of its prediction on independent test data.
The learning theory and all its many subbranches and sub-theories, provide us with all the necessary tools for building models with the highest performances.
In order to select the best model and then assess its generalization performances, as a best practice and in a data-rich context, the sample data set available for training the algorithm is usually split into 3 randomly selected groups:
A training set, used by the algorithm to fit the models. By leveraging the algorithm’s tuning parameters (hyper parameters) we can develop different models of varying complexity and precision.
A validation set, used for estimating the different models’ performances, rank them against each other and select, among them, the model with the lowest estimated prediction error.
A test set for assessing the generalization performances (and error) of the fully specified, selected model.
The prediction error, or generalization error, often guides the model selection process as modelers explore different model types and the wide choices of parameters they bring along. But estimating the prediction error is not trivial as it is the result of 3 sub-classes of errors: bias, variance and noise.
Also known as the ‘irreducible error’, noise is the only type of error that cannot be reduced by our choice of the model as it depends solely from the data we have available for the training.
If the data in the training data set are generated by an inherently stochastic (random) process, a misframed problem, or the feature set is either wrong or incomplete we can spend all our life trying to improve our model prediction performances, but we’ll never get any better than our messy data will allow us to.
That is why data scientists spend around 19% of their time looking for good data and additional 60% on cleaning and organizing the data they have collected: machine-learning models predict exactly what they have been trained to predict: in the best case, their forecasts can only be as good as the data used for their training. Garbage in, garbage out. Or better, noisy data in …noisy, error-prone, predictions out.
But noise is not just…useless noise. It can actually act as an alert, indicating a wrong or incomplete choice of features or a bad training data set.
Let’s suppose that we are trying to learn a function capable of predicting the weight of people based on a training data set that contains their age. Well… yes, age is probably one of the elements (feature) to factor in when predicting weight but there are many others that characterize weight, including height, gender, geo, and so on. By considering only age you will very likely end up with a weak predictor and a big generalization error because for every age value in input you will have many different weights values to select or average from.
Therefore noise, for a given set of features, might not be really noise in the true distribution. It is possible that we simply didn’t select enough characterizing features from the data set, or have the wrong ones, to be able to model the true distributions.
Noise is also very important when setting the right expectation on the maximum achievable performance (or the smallest achievable generalization error) from a given data set.
For example, let’s say that you’re doing financial forecasting, trying to detect whether the market’s going up or down. The level of noise in this type of data is so high that you will be very happy to get it right 53% of the time, consistently. So, while 47% of error might seems a very high error rate, it is probably the best performance you can get in this context. But consider this: as long as the generalization performance it’s higher than 50% and it’s consistent, you are in business, pretty much like the many hedge funds which make lots of money with these predictions!
As we have seen, while noise can provide us with some important quality-related indicators, its presence can really hurt our model generalization performances. So, while training an algorithm we need to make sure it collects as little as possible data from the distribution inherent noise, which does not carry any useful reusable information or patterns that characterize the distribution of which we are trying to understand the governing function.
How? well…making sure that the algorithm does not learn…too much!
If an algorithm maps very precisely all the data points of a given training distribution, it will for sure produce a very little error on that specific training data (training error). However, while learning from the valid data points, it will also pick up the random variations in the data produced by its noise, at the much higher cost of a high error on the test data (test error) and consequently a poor performance due to a high generalization error.
In other words, more precision translates into collecting more noise which can ‘confuse’ the algorithm with apparent relationships in the training data that do not hold in general being based on a fallacious combinations of good data and noisy one.
The consequence for an algorithm that models the training data too rigorously absorbing too much of its noise, is that it will produce drastically different models depending on the training set and show high variations in its performances over different test data.
Variance measure exactly that: an algorithm’s sensitivity to specific sets of training data. High variance indicates that the algorithm fits too well the data and it is probably over-complicated for the data distribution is trying to model and therefore it is said to overfit the data.
That is why we need to look for the “simplest approximating model”!
On the other hand, we cannot either select a model that is just too simplistic and not expressive enough to capture and learn the complexity of an event data distribution.
Imagine using a linear regression to map a training data set that has a non-linear pattern: no matter how hard you will try and how many observations you will be able to collect, a linear regression is just a line and simply too rigid to model the curves of a non-linear data set. Bias, at the opposite corner from variance, essentially measure this inability of a machine learning algorithm to fit, or represent well enough, the distribution of the data set used to train the system. Bias, in other words, give a dimension to the simplifying assumptions made by a model to make the target function easier to learn, but at the cost of not finding the best possible function. When this is the case, the algorithm will have a high bias value and is said to underfit the data.
While overfitting is characterized by variance, underfitting is characterized by bias.
Well, if I was able to transfer at least part of the main concepts, you will have probably figured it out by now that we are in front of an interesting dilemma: on one hand, if the algorithm fits the training data tightly, or overfits the training data, it will show low bias but high variance. Which indicates not-too-good generalization performances.
On the other hand, when an algorithm is too simplistic to fit the training data well, or underfits the data, it will show low variance but higher bias values. Probably good generalization performances but maybe not the best algorithm we could choose.
In statistic, this is a very well-known dilemma and it’s called the bias-variance trade-off.
Ability to find an algorithm that balance well the values of bias and variance, will lead us exactly to what we are looking at: the simplest approximating model with the best generalization performances (or with the smallest generalization error).
The bias-variance trade-off concept is captured in machine learning also by the estimation-approximation trade-off. Bias measure a model’s poor approximation of the target function. To improve the performance, we’ll probably need to select a different algorithm or family of algorithms, that offer a larger hypothesis space which, covering a bigger area, will more likely get closer or, approximate better, the space covered by the target function we are aiming at. But let’s remember that the target function we are trying to get closer at is derived only from a finite set of sample data. Not from the real, future, full distribution. Sample data is all we have for learning, and a limited set of data can only represent an estimation of the real function that describes the entire phenomenon. So, similarly to the bias-variance trade-off case, if we approximate too well the function that describes the sample distribution, producing a low approximation error and a low bias, the risk is that when we’ll then use the newly built function to predict or classify future unseen data coming from the real distribution, our function will continue predicting too closely on the basis of the learned pattern from the sample data, and will result too rigid and specific to capture well the general progression of the real distribution. In these cases, every different training set will generate very specific models that greatly varies from each other. That is why in statistic the unity of measure to express this specific aspect is called variance and in a more machine learning oriented ‘dialect’ the same aspect is referred to as estimation error. High variance, big estimation errors.
Since, as we have just seen, complexity of our models affects their performance, we need to find a way to define complexity in a quantitative way and among others, the Vapnik-Chervonenkis dimension, is a widely used approach to find the right balance between bias and variance as it quantifies precision capturing very well this notion of complexity for a model.
Without getting into too much details, VC dimension relates to (but not necessarily equals to) the number of parameters that every model has which, in turn, is connected to the number of data points that the model can handle. The main idea is that the higher is the number of data points that the model want to approximate, the higher is the number of parameters the model needs to have to map them, which add complexity and makes the model very specific to that data set. And being specific does not go along well with being able to generalize well. VC dimension helps to capture the best balance between the two incompatible yet inseparable elements of the equation.
[For those of you who want to get deeper into VC dimension consider that: VC dimension measure the number of effective parameters or binary degrees of freedom of an algorithm which in turn express the number of data points that the particular algorithm can shatter. A set of points is said to be shattered by a class of functions if, no matter how we assign a binary label to each point, a member of the class can perfectly separate them.]
While measuring an algorithm complexity, VC dimension can also help us to estimate the prediction error, providing us with a probabilistic assessment on whether an algorithm can learn and generalize given the sample data set: a low VC dimension compared to the number of available training data will suggest that the test error will not be far from the training error.
This is an amazing result: we can actually make a strong statement about what our test performance will be on data we haven’t seen, using only a property of the model!
But surprises from the learning theory do not end here. It offers even more fundamental concepts and essential guidance into the fascinating journey of making learning possible.
And while driving the data scientists in their quest to the right alchemy to transform undisciplined data into valuable predictions, it also suggests lots of good common sense on how we, humans, should build more reliable hypothesis about the facts of life that happen around us: with a little bit more of realistic data and a little bit less of noisy opinions.
Originally published on LinkedIn.