Blog: Epoch, Iterations & Batch Size
Difference and Essence
Some terms in Machine Learning (ML) are quite easy to misunderstand or mix up.
But why ?
My observation is that, most ML books or tutorials may not have necessarily dedicated time to explain them as much as they would for major topics, hence the confusion for most people.
In this short article I will take time to briefly explain the main difference between Epoch and Iteration of ML model training. This article assumes the reader has some previous knowledge or experience in training artificial neural networks and is not specific to any particular ML framework. Before I explain, let’s start with a very relatable example. Hopefully it helps connect the dots!
Assuming there is a nice music track that comes with with lyrics and made of six (6) verses that you want to learn. After playing the full music track (Epoch) and going through each of the 6 verses (Iterations) for the first time, there is no guarrantee you will be able to sing this song on your own without refering to the lyrics or replaying the song. Thus, you may need to replay the song for a sufficient number of times(Multiple Epochs) until perfection or confident enough to sing on your own or even recite any verse from any parts of the song (High learning acuracy).
Time to bring it down to training an artificial neural network. Before training a network we have a training dataset and a cost function to fit our dataset to. Depending on the number of features(Columns) of our input data we may choose a lower or higher order polynomial to fit it. Then we move on to optimize our cost function by calculating Gradient Descent. This is an iterative optimization algorithm with a learning rate alpa, that is used to achieve minimum cost for the learning algorithm. We try to minimize the Gradient Descent after each complete training cycle until we achieve the lowest Gradient Descent possible (Global minimum or minima of the learning curve).
At the global minima we can be confident that the learning algorithm has achieved a high level of accuracy, and is sufficient for making predictions on test or other unseen data.
If our training dataset has 1000 records, we could decide to split it into 10 batches (100 records per batch — Batch size of 100). Thus, 10 steps will be required to complete one learning cycle. Also if we decide to split the 1000 training set into 100 batches, we would then need 100 steps per each learning cycle. (10 records per batch — Batch size of 10).
Batch sizes are computationally efficient especially when dealing with massive datasets.
The 10 or 100 steps are Iterations. And by the End of the 10th or 100th step, we would have completed one Epoch, which is a complete learning cycle. By the end of an Epoch, the learning algorithm is able to compare or review actual outputs from the training data and optimize or make adjustments to its parameters in order to make better predictions in the next cycle. There is no guarantee that the Gradient Descent will be globally optimized or reach its best optimization by the end of the first optimization cycle(Epoch). Because of this, more than a few Epochs are often needed to achieve an ideal or high model accuracy. There is no set number of Epochs for optimizing a particular learning algorithm.
Just one Epoch can result to underfitting. However, too many Epochs after reaching global minimum can cause learning model to overfit. Ideally, the right number of epoch is one that results to the highest accuracy of the learning model.
While the concept of Epoch remains fundamental to optimization of a learning algorithm, its specific application to learning models such as Artificial Neural networks or Reinforcement learning can differ in terms of how the model is modified to perform better after each cycle.
To end with, Epoch is the complete cycle of an entire training data learned by a neural model. The training data can be split into batches for enhanced computations. Given 1000 datasets, it can be split into 10 batches. This creates 10 iterations. Each batch will contain 100 datasets. Thus, the batch size for each iteration will be 100.
Open to your questions, suggestions, feedback, etc…
“If you can’t explain it simply, you don’t understand it well enough. “
— Albert Einstein