Blog: Shallow Neural Networks
When we hear the name Neural Network, we feel that it consist of many and many hidden layers but there is a type of neural network with a few numbers of hidden layers. Shallow neural networks consist of only 1 or 2 hidden layers. Understanding a shallow neural network gives us an insight into what exactly is going on inside a deep neural network. In this post, let us see what is a shallow neural network and its working in a mathematical context. The figure below shows a shallow neural network with 1 hidden layer, 1 input layer and 1 output layer.
The neuron is the atomic unit of a neural network. Given an input, it provides the output and passes that output as an input to the subsequent layer. A neuron can be thought of as a combination of 2 parts:
- The first part computes the output Z, using the inputs and the weights.
- The second part performs the activation on Z to give out the final output A of the neuron.
The Hidden Layer
The hidden layer comprises of various neurons, each of which performs the above 2 calculations. The 4 neurons present in the hidden layer of our shallow neural network compute the following:
In the above equations,
- The superscript number [i] denotes the layer number and the subscript number j denotes the neuron number in a particular layer.
- X is the input vector consisting of 3 features.
- W[i]j is the weight associated with neuron j present in the layer i.
- b[i]j is the bias associated with neuron j present in the layer i.
- Z[i]j is the intermediate output associated with neuron j present in the layer i.
- A[i]j is the final output associated with neuron j present in the layer i.
- Sigma is the sigmoid activation function. Mathematically it is defined as:
As we can see, the above 4 equations seem redundant. Therefore we will vectorize them as:
- The first equation computes all the intermediate outputs Z in single matrix multiplication.
- The second equation computes all the activations A in single matrix multiplication.
The Shallow Neural Network
A neural network is built using various hidden layers. Now that we know the computations that occur in a particular layer, let us understand how the whole neural network computes the output for a given input X. These can also be called the forward-propagation equations.
- The first equation calculates the intermediate output Z of the first hidden layer.
- The second equation calculates the final output A of the first hidden layer.
- The third equation calculates the intermediate output Z of the output layer.
- The fourth equation calculates the final output A of the output layer which is also the final output of the whole neural network.
We know that a neural network is basically a set of mathematical equations and weights. To make the network robust, so that it performs well in different scenarios, we leverage activations functions. These activations functions introduce non-linear properties in the neural networks. Let us try to understand why the activation functions are crucial for any neural network with the help of our shallow neural network.
Without the activation functions, our shallow neural network can be represented as:
If we substitute the value of Z from equation 1 into equation 2, then we get the following equations:
As you can see, the output will become a linear combination of a new Weight Matrix W, Input X and a new Bias b, which means that there remains no significance of the neurons present in the hidden layer and the weights present in the hidden layer. Therefore, to introduce non-linearity in the network, we use the activation functions.
There are many activation functions that can be used. These include Sigmoid, Tanh, ReLU, Leaky ReLU and many others. It is not mandatory to use a particular activation function for all layers. You can select an activation function for a particular layer and a different activation for another layer and so on. You can read more about these activation functions in this post.
We know that the Weight Matrix W of a Neural Network is randomly initialized. One may wonder, why can’t initialize W with 0’s or any particular value. Let us understand this with the help of our Shallow Neural Network.
Let W1, the weight matrix of layer 1 and W2, the weight matrix of layer 2 be initialized with 0 or any other value. Now, if the weight matrices are the same, the activations of neurons in the hidden layer would be the same. Moreover, the derivatives of the activations would be the same. Therefore, the neurons in that hidden layer would be modifying the weights in a similar fashion i.e. there would be no significance of having more than 1 neuron in a particular hidden layer. But, we don’t want this. Instead, we want that each neuron in the hidden layer to be unique, have different weight and work as a unique function. Therefore, we initialize the weights randomly.
The best method of initialization is Xavier’s Initialization. Mathematically it is defined as:
It states that Weight Matrix W of a particular layer l are picked randomly from a normal distribution with mean μ= 0 and variance sigma² = multiplicative inverse of the number of neurons in layer l−1. The bias b of all layers is initialized with 0.
We know that the weights of a neural network are initialized randomly. In order to use the neural network for correct predictions, we need to update these weights. The method through which we update these weights is known as Gradient Descent. Let us understand this using a computation graph.
In the above figure, the forward propagation (indicated by black lines) is used to compute the output for a given input X. The backward propagation (indicated by red lines) is used to update the Weight Matrices W, W and biases b, b. This is done by calculating derivatives of the inputs at each step in the computation graph. We know that Loss L is mathematically defined as:
Using the above equation for loss L and using the sigmoid function as the activation function for the hidden and output layer, with the help of chain rule of derivatives we compute the following:
The above equations might be confusing, but they work out perfectly for gradient descent. In the equation of dZ, * represents product wise multiplication and σ’ represents the derivative of sigma.
I would urge the readers, if they know calculus, to figure the above equations by themselves to get a better understanding of how the gradient descent works.
Thus in this story, we studied how a Shallow Neural Network works in a mathematical context. Although, I have tried to explain everything in as much detail as possible if you feel that something is missing do check my previous posts or post your query in the comments section below.
I would like to thank the readers for reading the story. If you have any questions or doubts, feel free to ask them in the comments section below. I’ll be more than happy to answer them and help you out. If you like the story, please follow me to get regular updates when I publish a new story. I welcome any suggestions that will improve my stories.