Blog: Three Simple Theories to Help Us Understand Overfitting and Underfitting in Machine Learning Models
The two worst things that could happen to a machine learning model is either to build useless knowledge or to learn nothing relevant from a training dataset. In machine learning theory, these two phenomenon’s are described using the terms overfitting and underfitting respectively and they constitute two of the biggest challenges in modern deep learning solutions. I often like to compare deep learning overfitting to human hallucinations as the former occurs when algorithms start inferring non-existing patterns in datasets. Underfitting is closer to a learning disorder that prevents people from acquiring the relevant knowledge to perform a given task. Despite its importance, there is no easy solution to overfitting and deep learning application often need to use techniques very specific to individual algorithms in order to avoid overfitting behaviors. This problem get even more scarier if you consider that humans are also incredibly prompt to overfitting which translates into subjective evaluations of the machine learning models. Just think about how many stereotypes you used in the last week. Yeah, I know….Today, I would like to present three different theories that are helpful to better reason through overfitting and underfitting conditions in machine learning models.
Unquestionably, our hallucinations or illusions of validity are present somewhere in the datasets used in the training of deep learning algorithms which creates an even more chaotic picture. Intuitively, we think about data when working on deep learning algorithms but there is also another equally important and often forgotten element of deep learning models: knowledge. In the context of deep learning algorithms, data is often represented as persisted records in one or more databases while knowledge is typically represented as logic rules that can be validated in the data. The role of deep learning models is to infer rules that can be applied to new datasets in the same domain. Unfortunately for deep learning agents, powerful computation capabilities are not a direct answer to knowledge building and overfitting occursF
Challenges such as overfitting and underfitting are related to the capacity of a machine learning model to build relevant knowledge based on an initial set of training examples. Conceptually, underfitting is associated withe the inability of a machine learning algorithm to infer valid knowledge from the initial training data. Contrary to that, overfitting is associated with model that create hypothesis that are way too generic or abstract to result practical. Putting it in simpler terms, underfitting models are sort of dumb while overfitting models tend to hallucinate(imagine things that don’t exist ) :).
Model Capacity: The Main Element to Quantify Overfitting and Underfitting in Machine Learning Models
Let’s try to formulate a simple methodology to understand overfitting and underfitting in the context of machine learning algorithms.
A typical machine learning scenario starts with an initial data set that we use to train and test the performance of an algorithm. The statistical wisdom suggests that we use 80% of the dataset to train the model while maintaining the remaining 20% to test it. During the training phase, out model will produce certain deviation from the training data which we is often referred to the Training Error. Similarly, the deviation produced during the test phase is referred to as Test Error. From that perspective, the performance of a machine learning model can be judged on its ability to accomplish two fundamental things:
1 — Reduce the Training Error
2 — Reduce the gap between the Training and Test Errors
Those two simple rules can help us understand the concepts of overfitting and underfitting. Basically, underfitting occurs a model fails at rule #1 and is not able to obtain a sufficiently low error from the training set. Overfitting then happens when a model fails at rule #2 and the gap between the test and training errors is too large. You see? two simple rules to helps us quantify the levels of overfitting and underfitting in machine learning algorithms.
Another super important concept that tremendously helps machine learning practitioners deal with underfitting and overfitting is the notion of Capacity. Conceptually, Capacity represents the number of functions that a machine learning model can select as a possible solution. for instance, la linear regression model can have all degree 1 polynomials of the form y = w*x + b as a Capacity (meaning all the potential solutions).
Capacity is an incredibly relevant concept machine learning models. Technically, a machine learning algorithms performs best when it has a Capacity that is proportional to the complexity of its task and the input of the training data set. Machine learning models with low Capacity are impractical when comes to solve complex tasks and tend to underfit. Along the same lines, models with higher Capacity than needed are prompt to overfit. From that perspective, Capacity represents a measure by which we can estimate the propensity of the model to underfit or overfit.
Three Theories to Understand Overfitting and Underfitting in Machine Learning Models
The principle of Occam’s Razor is what happens when philosophers get involved in machine learning :) The origins of the this ancient philosophical theory dates back to somewhere between 1287 and 1347 associating it with philosophers like Ptolemy. In essence, the Occam’s Razor theory states that if we have competing hypothesis that explain known observations we should choose the simplest one. From Sherlock Holmes to Monk, Occam’s Razor has been omnipresent in world class’s detectives that often follow the simplest and most logical hypothesis to uncover complex mysteries.
The Occam’s Razor is a wise philosophical principle to follow in our daily lives but its application in machine learning results controversial at best. Simpler hypothesis are certainly preferred from a computational standpoint in a world in which algorithms are notorious for being resource expensive. Additionally, simpler hypothesis are computationally easier to generalize. However, the challenge with ultra-simple hypothesis is that they often result too abstract to model complex scenarios. As a result, a model with a large enough training set and a decent size number of dimensions should select a complex enough hypothesis that can produce a low training error. Otherwise it will be prompt to underfit.
The VC Dimension
The Occam’s Razor is a nice principle of parsimony but those abstract ideals don’t directly translate into machine learning models that live in the universe of numbers. That challenge was addressed by the founders to statistical theory Vapnik and Chervonekis(VC) who came out with a model to quantify the Capacity of a statistic algorithm. Known as the VC Dimension, this techniques is based on determining the largest number m from which exists a training set of m different x points that the target machine learning function can label arbitrarily.
The VC Dimension is one of the cornerstones of statistical learning and has been used as the basics of many interesting theories. For instance, the VC Dimension helps explain that the gap between the generalization error and the training error in a machine learning model decreases as the size of the training set increases but the same gap increases as the Capacity of the model grows. In other words, models with large training sets are more likely to pick the approximately correct hypothesis but if there are too many potential hypothesis then we are likely to end up with the wrong one.
The No Free Lunch Theorem
I would like to end this article with one of my favorite principles iof machine learning relevant to the the overfitting-underfitting problem. The No Free Lunch Theorem states that, averaged over all possible data-generating distributions, every classification algorithm has approximately the same error rate when classifying previously unobserved points. I like to think about the No Free Lunch Theorem as the mathematical counter-theory to the limitation of machine learning algorithms that force us to generalize semi-absolute knowledge using a finite training set. In logic, for instance, inferring universal rules from a finite set of examples is considered “illogical”. For machine learning practitioners, the No Free Lunch Theorem is another way to say that no algorithm is better than others given enough observations. In other words,thee role of a machine learning model is not to find a universal learning function but rather the hypothesis that better fits the target scenario.
Overfitting and underfitting remain two of the most serious challenges in machine learning applications. Theories like the VC Dimension, Occam’s Razor and the No Free Lunch Theorem provide a strong theoretical foundation to analyze the root of overfitting and underfitting conditions in machine learning solutions. Understanding and quantifying the capacity of a machine learning model remains the fundamental step to understands its propensity to overfit or underfit.