# Blog

ProjectBlog: Implementing different Activation Functions and Weight Initialization Methods Using Python

### Blog: Implementing different Activation Functions and Weight Initialization Methods Using Python

In this post, we will discuss how to implement different combinations of non-linear activation functions and weight initialization methods in python. Also, we will analyze how the choice of activation function and weight initialization method will have an effect on accuracy and the rate at which we reduce our loss in a deep neural network using a non-linearly separable toy data set. This is a follow-up post to my previous post on activation functions and weight initialization methods.

Note: This article assumes that the reader has a basic understanding of Neural Network, weights, biases, and backpropagation. If you want to learn the basics of the feed-forward neural network, check out my previous article (Link at the end of this article).

### Activation Functions Overview

The activation function is the non-linear function that we apply over the input data coming to a particular neuron and the output from the function will be sent to the neurons present in the next layer as input.

Even if we use very very deep neural networks without the non-linear activation function, we will just learn the ‘y’ as a linear transformation of ‘x’. It can only represent linear relations between ‘x’ and ‘y’. In other words, we will be constrained to learning linear decision boundaries and we can’t learn any arbitrary non-linear decision boundaries. This is why we need activation functions — non-linear activation function to learn the complex non-linear relationship between input and the output.

Some of the commonly used activation functions,

• Logistic
• Tanh
• ReLU
• Leaky ReLU

### Weight Initialization Overview

When we are training deep neural networks, weights and biases are usually initialized with random values. In the process of initializing weights to random values, we might encounter the problems like vanishing gradient or exploding gradient. As a result, the network would take a lot of time to converge (if it converges at all). The most commonly used weight initialization methods:

• Xavier Initialization
• He Initialization

To understand the intuition behind the most commonly used activation functions and weight initialization methods, kindly refer to my previous post on activation functions and weight initialization methods.

### Let’s Code

In the coding section, we will be covering the following topics.

1. Generate data that is not linearly separable
2. Write a feedforward network class
3. Setup code for plotting
4. Analyze sigmoid activation
5. Analyze tanh activation
6. Analyze ReLU activation
7. Analyze Leaky ReLU activation

#### Overview

In this section, we will compare the accuracy of a simple feedforward neural network by trying out various combinations of activation functions and weight initialization methods.

The way we do that it is, first we will generate non-linearly separable data with two classes and write our simple feedforward neural network that supports all the activation functions and weight initialization methods. Then compare the different scenarios using loss plots.

If you want to skip the theory part and get into the code right away,

#### Import libraries as L

Before we start with our analysis of the feedforward network, first we need to import the required libraries. We are importing the `numpy` to evaluate the matrix multiplication and dot product between two vectors in the neural network, `matplotlib` to visualize the data and from the`sklearn` package, we are importing functions to generate data and evaluate the network performance. To display/render HTML content in-line in Jupiter notebook import `HTML`.

In line 19, we are creating a custom color map from a list of colors by using the `from_list()` method of `LinearSegmentedColormap`.

### Generate Dummy Data

Remember that we are using feedforward neural networks because we wanted to deal with non-linearly separable data. In this section, we will see how to randomly generate non-linearly separable data.

To generate data randomly we will use `make_blobs` to generate blobs of points with a Gaussian distribution. I have generated 1000 data points in 2D space with four blobs `centers=4` as a multi-class classification prediction problem. Each data point has two inputs and 0, 1, 2 or 3 class labels. Note that `make_blobs()` function will generate linearly separable data, but we need to have non-linearly separable data for binary classification.

`labels_orig = labelslabels = np.mod(labels_orig, 2)`

One way to convert the 4 classes to binary classification is to take the remainder of these 4 classes when they are divided by 2 so that I can get the new labels as 0 and 1.

From the plot, we can see that the centers of blobs are merged such that we now have a binary classification problem where the decision boundary is not linear. Once we have our data ready, I have used the `train_test_split `function to split the data for `training` and `validation` in the ratio of 90:10

### Feedforward Network

In this section, we will write a generic class where it can generate a neural network, by taking the number of hidden layers and the number of neurons in each hidden layer as input parameters.

The network has six neurons in total — two in the first hidden layer and four in the output layer. For each of these neurons, pre-activation is represented by ‘a’ and post-activation is represented by ‘h’. In the network, we have a total of 18 parameters — 12 weight parameters and 6 bias terms.

we will write our neural network in a class called FFNetwork.

In the class `FirstFFNetwork`we have 8 functions, we will go over these functions one by one.

`def __init__(self, init_method = 'random', activation_function = 'sigmoid', leaky_slope = 0.1): ......`

The `__init__` function initializes all the parameters of the network including weights and biases. The function takes accepts a few arguments,

• `init_method`: Initialization method to be used for initializing all the parameters of the network. Supports — “random”, “zeros”, “He” and “Xavier”.
• `activation_function`: Activation function to be used for learning non-linear decision boundary. Supports — “sigmoid”, “tanh”, “relu” and “leaky_relu”.
• `leaky_slope`: Negative slope of Leaky ReLU. Default value set to 0.1.

In Line 5–10, we are setting the network configuration and the activation function to be used in the network.

`self.layer_sizes = [2, 2, 4]`

`layer_sizes`represents that the network has two inputs, two neurons in the first hidden layer and 4 neurons in the second hidden layer which is also the final layer in this case. After that, we have a bunch of “if-else” weight initialization statements, in each of these statements we are only initializing the weights based on the method of choice and the biases are always initialized to the value one. The initialized values of weights and biases are stored in a dictionary `self.params`.

`def forward_activation(self, X):  if self.activation_function == "sigmoid":  return 1.0/(1.0 + np.exp(-X))  elif self.activation_function == "tanh":  return np.tanh(X)  elif self.activation_function == "relu":  return np.maximum(0,X)  elif self.activation_function == "leaky_relu":  return np.maximum(self.leaky_slope*X,X)`

Next, we have `forward_activation` function that takes input ‘X’ as an argument and computes the post-activation value of the input depending on the choice of the activation function.

`def grad_activation(self, X): ......`

The function `grad_activation` also takes input ‘X’ as an argument and computes the derivative of the activation function at given input and returns it.

`def forward_pass(self, X, params = None):`
`....... def grad(self, X, Y, params = None): .......`

After that, we have two functions `forward_pass` which characterize the forward pass. Forward pass involves two steps

1. Post Activation — Computes the dot product between the input x & weights wand adds bias b
2. Pre Activation — Takes the output of post activation and applies the activation function on top of it.

`grad`function characterize the gradient computation for each of the parameters present in the network and stores it in a list called `gradients`. Don’t worry too much in how we arrived at the gradients because we will be using Pytorch to do the heavy lifting, but if you are interested in learning them go through my previous article.

`def fit(self, X, Y, epochs=1, algo= "GD", display_loss=False,  eta=1, mini_batch_size=100, eps=1e-8,  beta=0.9, beta1=0.9, beta2=0.9, gamma=0.9 ):`

Next, we define `fit` method that takes input ‘X’ and ‘Y’ as mandatory arguments and a few optional arguments required for implementing the different variants of gradient descent algorithm. Kindly refer to my previous post for the detail explanation on how to implement the algorithms.

`def predict(self, X):`

Now we define our predict function takes inputs `X` as an argument, which it expects to be an `numpy` array. In the predict function, we will compute the forward pass of each input with the trained model and send back a numpy `array` which contains the predicted value of each input data.

### Setting Up for Plotting & Train Neural Network

In this section, we define a function to evaluate the performance of the neural network and create plots to visualize the working of the update rule. This kind of setup helps us to run different experiments with different activation functions, different weight initialization methods and plot update rule for different variants of the gradient descent

First, we instantiate the feedforward network class and then call the `fit` method on the training data with 10 epochs and learning rate set to 1 (These values are arbitrary not the optimal values for this data, you can play around these values and find the best number of epochs and the learning rate).

Then we will call `post_process` function to compute the training and validation accuracy of the neural network (Line 2–11). We are also plotting the scatter plot for the input points with different sizes based on the predicted value of the neural network.

The size of each point in the plot is given by a formula,

`s=15*(np.abs(Y_pred_binarised_train-Y_train)+.2)`

The formula takes the absolute difference between the predicted value and the actual value.

All the small points in the plot indicate that the model is predicting those observations correctly and large points indicate that those observations are incorrectly classified.

Line 20–29, we are plotting the updates each parameter getting from the network using backpropagation. In our network, there are 18 parameters in total so we are iterating 18 times, each time we will find the update each parameter gets and plot them using subplot. For example, update for weight Wᵢ at iᵗʰ epoch = Wᵢ ₊ ₁ — Wᵢ

### Analyze sigmoid activation

To analyze the effect of sigmoid activation function on the neural network, we will set the activation function as ‘sigmoid’ and execute the neural network class.

The Loss of the network is falling even though we have run it for very few iterations. By using the `post_process` function, we are able to plot the 18 subplots and we have not provided any axis labels because it is not required. The 18 plots for 18 parameters are plotted in row-major order representing the frequency of updates the parameter receives. The first 12 plots indicate the updates received by the weights and last 6 indicate the updates received by the bias terms in the network.

In any of the subplots, if the curve is closer to the middle indicates that the particular parameter is not getting any updates. Instead of executing each weight initialization manually, we will write a `for — loop` to execute all possible weight initialization combinations.

`for init_method in ['zeros', 'random', 'xavier', 'he']: for activation_function in ['sigmoid']: print(init_method, activation_function) model = FFNetwork(init_method=init_method,activation_function = activation_function) model.fit(X_train, y_OH_train, epochs=50, eta=1, algo="GD", display_loss=True) post_process(plot_scale=0.05) print('\n--\n')`

In the above code, I just added two ‘for’ loops. One ‘for’ loop for weight initialization and another ‘for’ loop for activation function. Once you execute the above code, you will see that the neural network tries all the possible weight initialization methods by keep activation function — sigmoid constant.