ProjectBlog: Implementing different Activation Functions and Weight Initialization Methods Using Python

Blog: Implementing different Activation Functions and Weight Initialization Methods Using Python

Non-Linearly Separable Data

In this post, we will discuss how to implement different combinations of non-linear activation functions and weight initialization methods in python. Also, we will analyze how the choice of activation function and weight initialization method will have an effect on accuracy and the rate at which we reduce our loss in a deep neural network using a non-linearly separable toy data set. This is a follow-up post to my previous post on activation functions and weight initialization methods.

Note: This article assumes that the reader has a basic understanding of Neural Network, weights, biases, and backpropagation. If you want to learn the basics of the feed-forward neural network, check out my previous article (Link at the end of this article).

Citation Note: The content and the structure of this article is based on the deep learning lectures from One-Fourth Labs — PadhAI.

Activation Functions Overview

The activation function is the non-linear function that we apply over the input data coming to a particular neuron and the output from the function will be sent to the neurons present in the next layer as input.

Even if we use very very deep neural networks without the non-linear activation function, we will just learn the ‘y’ as a linear transformation of ‘x’. It can only represent linear relations between ‘x’ and ‘y’. In other words, we will be constrained to learning linear decision boundaries and we can’t learn any arbitrary non-linear decision boundaries. This is why we need activation functions — non-linear activation function to learn the complex non-linear relationship between input and the output.

Some of the commonly used activation functions,

  • Logistic
  • Tanh
  • ReLU
  • Leaky ReLU

Weight Initialization Overview

When we are training deep neural networks, weights and biases are usually initialized with random values. In the process of initializing weights to random values, we might encounter the problems like vanishing gradient or exploding gradient. As a result, the network would take a lot of time to converge (if it converges at all). The most commonly used weight initialization methods:

  • Xavier Initialization
  • He Initialization

To understand the intuition behind the most commonly used activation functions and weight initialization methods, kindly refer to my previous post on activation functions and weight initialization methods.

Let’s Code

In the coding section, we will be covering the following topics.

  1. Generate data that is not linearly separable
  2. Write a feedforward network class
  3. Setup code for plotting
  4. Analyze sigmoid activation
  5. Analyze tanh activation
  6. Analyze ReLU activation
  7. Analyze Leaky ReLU activation


In this section, we will compare the accuracy of a simple feedforward neural network by trying out various combinations of activation functions and weight initialization methods.

The way we do that it is, first we will generate non-linearly separable data with two classes and write our simple feedforward neural network that supports all the activation functions and weight initialization methods. Then compare the different scenarios using loss plots.

If you want to skip the theory part and get into the code right away,

Import libraries as L

Before we start with our analysis of the feedforward network, first we need to import the required libraries. We are importing the numpy to evaluate the matrix multiplication and dot product between two vectors in the neural network, matplotlib to visualize the data and from thesklearn package, we are importing functions to generate data and evaluate the network performance. To display/render HTML content in-line in Jupiter notebook import HTML.

In line 19, we are creating a custom color map from a list of colors by using the from_list() method of LinearSegmentedColormap.

Generate Dummy Data

Remember that we are using feedforward neural networks because we wanted to deal with non-linearly separable data. In this section, we will see how to randomly generate non-linearly separable data.

To generate data randomly we will use make_blobs to generate blobs of points with a Gaussian distribution. I have generated 1000 data points in 2D space with four blobs centers=4 as a multi-class classification prediction problem. Each data point has two inputs and 0, 1, 2 or 3 class labels. Note that make_blobs() function will generate linearly separable data, but we need to have non-linearly separable data for binary classification.

Multi-Class Linearly Separable Data
labels_orig = labels
labels = np.mod(labels_orig, 2)

One way to convert the 4 classes to binary classification is to take the remainder of these 4 classes when they are divided by 2 so that I can get the new labels as 0 and 1.

From the plot, we can see that the centers of blobs are merged such that we now have a binary classification problem where the decision boundary is not linear. Once we have our data ready, I have used the train_test_split function to split the data for training and validation in the ratio of 90:10

Feedforward Network

In this section, we will write a generic class where it can generate a neural network, by taking the number of hidden layers and the number of neurons in each hidden layer as input parameters.

Simple Neural Network

The network has six neurons in total — two in the first hidden layer and four in the output layer. For each of these neurons, pre-activation is represented by ‘a’ and post-activation is represented by ‘h’. In the network, we have a total of 18 parameters — 12 weight parameters and 6 bias terms.

we will write our neural network in a class called FFNetwork.

In the class FirstFFNetworkwe have 8 functions, we will go over these functions one by one.

def __init__(self, init_method = 'random', activation_function = 'sigmoid', leaky_slope = 0.1):

The __init__ function initializes all the parameters of the network including weights and biases. The function takes accepts a few arguments,

  • init_method: Initialization method to be used for initializing all the parameters of the network. Supports — “random”, “zeros”, “He” and “Xavier”.
  • activation_function: Activation function to be used for learning non-linear decision boundary. Supports — “sigmoid”, “tanh”, “relu” and “leaky_relu”.
  • leaky_slope: Negative slope of Leaky ReLU. Default value set to 0.1.

In Line 5–10, we are setting the network configuration and the activation function to be used in the network.

self.layer_sizes = [2, 2, 4]

layer_sizesrepresents that the network has two inputs, two neurons in the first hidden layer and 4 neurons in the second hidden layer which is also the final layer in this case. After that, we have a bunch of “if-else” weight initialization statements, in each of these statements we are only initializing the weights based on the method of choice and the biases are always initialized to the value one. The initialized values of weights and biases are stored in a dictionary self.params.

def forward_activation(self, X): 
if self.activation_function == "sigmoid":
return 1.0/(1.0 + np.exp(-X))
elif self.activation_function == "tanh":
return np.tanh(X)
elif self.activation_function == "relu":
return np.maximum(0,X)
elif self.activation_function == "leaky_relu":
return np.maximum(self.leaky_slope*X,X)

Next, we have forward_activation function that takes input ‘X’ as an argument and computes the post-activation value of the input depending on the choice of the activation function.

def grad_activation(self, X):

The function grad_activation also takes input ‘X’ as an argument and computes the derivative of the activation function at given input and returns it.

def forward_pass(self, X, params = None):
def grad(self, X, Y, params = None):

After that, we have two functions forward_pass which characterize the forward pass. Forward pass involves two steps

  1. Post Activation — Computes the dot product between the input x & weights wand adds bias b
  2. Pre Activation — Takes the output of post activation and applies the activation function on top of it.

gradfunction characterize the gradient computation for each of the parameters present in the network and stores it in a list called gradients. Don’t worry too much in how we arrived at the gradients because we will be using Pytorch to do the heavy lifting, but if you are interested in learning them go through my previous article.

def fit(self, X, Y, epochs=1, algo= "GD", display_loss=False, 
eta=1, mini_batch_size=100, eps=1e-8,
beta=0.9, beta1=0.9, beta2=0.9, gamma=0.9 ):

Next, we define fit method that takes input ‘X’ and ‘Y’ as mandatory arguments and a few optional arguments required for implementing the different variants of gradient descent algorithm. Kindly refer to my previous post for the detail explanation on how to implement the algorithms.

def predict(self, X):

Now we define our predict function takes inputs X as an argument, which it expects to be an numpy array. In the predict function, we will compute the forward pass of each input with the trained model and send back a numpy array which contains the predicted value of each input data.

Setting Up for Plotting & Train Neural Network

In this section, we define a function to evaluate the performance of the neural network and create plots to visualize the working of the update rule. This kind of setup helps us to run different experiments with different activation functions, different weight initialization methods and plot update rule for different variants of the gradient descent

First, we instantiate the feedforward network class and then call the fit method on the training data with 10 epochs and learning rate set to 1 (These values are arbitrary not the optimal values for this data, you can play around these values and find the best number of epochs and the learning rate).

Then we will call post_process function to compute the training and validation accuracy of the neural network (Line 2–11). We are also plotting the scatter plot for the input points with different sizes based on the predicted value of the neural network.

Scatter Plot

The size of each point in the plot is given by a formula,


The formula takes the absolute difference between the predicted value and the actual value.

All the small points in the plot indicate that the model is predicting those observations correctly and large points indicate that those observations are incorrectly classified.

Line 20–29, we are plotting the updates each parameter getting from the network using backpropagation. In our network, there are 18 parameters in total so we are iterating 18 times, each time we will find the update each parameter gets and plot them using subplot. For example, update for weight Wᵢ at iᵗʰ epoch = Wᵢ ₊ ₁ — Wᵢ

Analyze sigmoid activation

To analyze the effect of sigmoid activation function on the neural network, we will set the activation function as ‘sigmoid’ and execute the neural network class.

The Loss of the network is falling even though we have run it for very few iterations. By using the post_process function, we are able to plot the 18 subplots and we have not provided any axis labels because it is not required. The 18 plots for 18 parameters are plotted in row-major order representing the frequency of updates the parameter receives. The first 12 plots indicate the updates received by the weights and last 6 indicate the updates received by the bias terms in the network.

In any of the subplots, if the curve is closer to the middle indicates that the particular parameter is not getting any updates. Instead of executing each weight initialization manually, we will write a for — loop to execute all possible weight initialization combinations.

for init_method in ['zeros', 'random', 'xavier', 'he']:
for activation_function in ['sigmoid']:
print(init_method, activation_function)
model = FFNetwork(init_method=init_method,activation_function = activation_function), y_OH_train, epochs=50, eta=1, algo="GD", display_loss=True)

In the above code, I just added two ‘for’ loops. One ‘for’ loop for weight initialization and another ‘for’ loop for activation function. Once you execute the above code, you will see that the neural network tries all the possible weight initialization methods by keep activation function — sigmoid constant.

If you observe the output of zero weight initialization method with sigmoid, you can see that the symmetry breaking problem occurs in the sigmoid neuron. Once we initialize the weights to zero, in all subsequent iterations the weights are going to remain the same (they will move away from zero but they will be equal), this symmetry will never break during the training. This kind of phenomenon is known as symmetry breaking problem. Because of this problem, we are getting very low accuracy of 54%.

In the random initialization, we can see that the problem of symmetry breaking doesn’t occur. It means that all the weights & biases are taking different values during the training. By using the Xavier initialization, we are getting the highest accuracy across different weight initialization method. Xavier is the recommended weight initialization method for sigmoid and tanh activation function.

Analyzing Tanh Activation

We will use the same code for executing the tanh activation function with different combinations of weight initialization methods by including the keyword ‘tanh’ in the second ‘for’ loop.

for activation_function in ['tanh']:

In the zero initialization with tanh activation, from the weight update subplots, we can see that tanh activation is hardly learning anything. In all the plots the curve is closer to zero, indicating that the parameters are not getting updates from optimization algorithm. The reason behind this phenomenon is that the value of tanh at x = 0 is zero and the derivative of tanh is also zero.

When we do Xavier initialization with tanh, we are able to get higher performance from the neural network. Just by changing the method of weight initialization we are able to get higher accuracy (86.6%).

Analyzing ReLU Activation

We will use the same code for executing the ReLU activation function with different combinations of weight initialization methods by including the keyword ‘relu’ in the second ‘for’ loop.

for activation_function in ['relu']:

Similar to tanh with zero weight initialization, we observed that setting weights to zero doesn’t work with ReLU because the value of ReLU at zero is equal to zero itself. As a result, weights won’t be propagated back into the network and network won’t learn anything. So it’s not a good idea to set weights to zero either in case of tanh or ReLU.

The recommended initialization method for ReLU is He-initialization, by using He-initialization we are able to get the highest accuracy.

Analyzing Leaky ReLU Activation

We will use the same code for executing the ReLU activation function with different combinations of weight initialization methods by including the keyword ‘relu’ in the second ‘for’ loop.

for activation_function in ['leaky_relu']:

Similar to ReLU with zero weight initialization, we observed that setting weights to zero doesn’t work with Leaky ReLU because the value of Leaky ReLU at zero is equal to zero itself. As a result, weights won’t be propagated back into the network and network won’t learn anything.

Coming to random initialization, we can see that the network achieves very good accuracy but there are a lot of oscillations in the update subplots. The large oscillations might be occurring due to a large learning rate. By using He initialization, we get the highest accuracy of 92% on the test data. To avoid the large oscillations, we should set a smaller learning rate in any method of weight initialization. The recommended initialization method for Leaky ReLU is He-initialization.

There you have it, we have successfully analyzed the different combinations of weight initialization methods and activation functions.

What’s Next?


In this article, we have used make_blobs function to generate toy data and we have seen that make_blobs generate linearly separable data. If you want to generate some complex non-linearly separable data to train your feedforward neural network, you can use make_moons function from sklearn package.

You can also try, changing the learning algorithm (we been using vanilla gradient descent) to a different variant of gradient descent like Adam, NAG, etc… and study the impact of the learning algorithm on network performance. Using our feedforward neural network class you can create a much deeper network with more number of neurons in each layer ([2,2,2,4] — two neurons each in first 3 hidden layers and 4 neurons in the output layer) and play with learning rate & a number of epochs to check under which parameters neural network is able to arrive at the best decision boundary possible.

The entire code discussed in the article is present in this GitHub repository. Feel free to fork it or download it. The best part is that you can directly run the code in google colab, don’t need to worry about installing the packages.


In this post, we briefly looked at the overview of weight initialization methods and activation functions. Then we have seen how to build a generic simple neuron network class that supports different variants of gradient descent, weight initialization, and activation functions. After that, we have analyzed each of the activation function with different weight initialization methods.

Recommended Reading

Author Bio

Niranjan Kumar is Retail Risk Analyst at HSBC Analytics division. He is passionate about Deep learning and Artificial Intelligence. He is one of the top writers at Medium in Artificial Intelligence. You can find all of Niranjan’s blog here. You can connect with Niranjan on LinkedIn, Twitter and GitHub to stay up to date with his latest blog posts.

I am looking for opportunities either full-time or freelance projects, in the field of Machine Learning and Deep Learning. If there are any relevant opportunities, feel free to drop me a message on LinkedIn or you can reach me through email as well. I would love to discuss.

Source: Artificial Intelligence on Medium

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top

Display your work in a bold & confident manner. Sometimes it’s easy for your creativity to stand out from the crowd.