## Blog: Implementing different Activation Functions and Weight Initialization Methods Using Python

In this post, we will discuss how to implement different combinations of non-linear activation functions and weight initialization methods in python. Also, we will analyze how the choice of activation function and weight initialization method will have an effect on accuracy and the rate at which we reduce our loss in a deep neural network using a non-linearly separable toy data set. This is a follow-up post to my previous post on activation functions and weight initialization methods.

Note: This article assumes that the reader has a basic understanding of Neural Network, weights, biases, and backpropagation. If you want to learn the basics of the feed-forward neural network, check out my previous article (Link at the end of this article).

Citation Note: The content and the structure of this article is based on the deep learning lectures from One-Fourth Labs —PadhAI.

### Activation Functions Overview

The activation function is the non-linear function that we apply over the input data coming to a particular neuron and the output from the function will be sent to the neurons present in the next layer as input.

Even if we use very very deep neural networks without the non-linear activation function, we will just learn the ‘**y**’ as a linear transformation of ‘**x**’. It can only represent linear relations between ‘**x**’ and ‘**y**’. In other words, we will be constrained to learning linear decision boundaries and we can’t learn any arbitrary non-linear decision boundaries. This is why we need activation functions — non-linear activation function to learn the complex non-linear relationship between input and the output.

Some of the commonly used activation functions,

**Logistic****Tanh****ReLU****Leaky ReLU**

### Weight Initialization Overview

When we are training deep neural networks, weights and biases are usually initialized with random values. In the process of initializing weights to random values, we might encounter the problems like vanishing gradient or exploding gradient. As a result, the network would take a lot of time to converge (if it converges at all). The most commonly used weight initialization methods:

- Xavier Initialization
- He Initialization

To understand the intuition behind the most commonly used activation functions and weight initialization methods, kindly refer to my previous post on activation functions and weight initialization methods.

**Deep Learning Best Practices: Activation Functions & Weight Initialization Methods — Part 1**

*Best Activation functions & Weight Initialization Methods for better accuracy*medium.com

### Let’s Code

In the coding section, we will be covering the following topics.

**Generate data that is not linearly separable****Write a feedforward network class****Setup code for plotting****Analyze sigmoid activation****Analyze tanh activation****Analyze ReLU activation****Analyze Leaky ReLU activation**

#### Overview

In this section, we will compare the accuracy of a simple feedforward neural network by trying out various combinations of activation functions and weight initialization methods.

The way we do that it is, first we will generate non-linearly separable data with two classes and write our simple feedforward neural network that supports all the activation functions and weight initialization methods. Then compare the different scenarios using loss plots.

If you want to skip the theory part and get into the code right away,

**Niranjankumar-c/DeepLearning-PadhAI**

*All the code files related to the deep learning course from PadhAI – Niranjankumar-c/DeepLearning-PadhAI*github.com

#### Import libraries as L

In line 19, we are creating a custom color map from a list of colors by using the `from_list()`

method of `LinearSegmentedColormap`

.

### Generate Dummy Data

Remember that we are using feedforward neural networks because we wanted to deal with non-linearly separable data. In this section, we will see how to randomly generate non-linearly separable data.

To generate data randomly we will use `make_blobs`

to generate blobs of points with a Gaussian distribution. I have generated 1000 data points in 2D space with four blobs `centers=4`

as a multi-class classification prediction problem. Each data point has two inputs and 0, 1, 2 or 3 class labels. Note that `make_blobs()`

function will generate linearly separable data, but we need to have non-linearly separable data for binary classification.

labels_orig = labels

labels = np.mod(labels_orig, 2)

One way to convert the 4 classes to binary classification is to take the remainder of these 4 classes when they are divided by 2 so that I can get the new labels as 0 and 1.

From the plot, we can see that the centers of blobs are merged such that we now have a binary classification problem where the decision boundary is not linear. Once we have our data ready, I have used the `train_test_split `

function to split the data for `training`

and `validation`

in the ratio of 90:10

### Feedforward Network

In this section, we will write a generic class where it can generate a neural network, by taking the number of hidden layers and the number of neurons in each hidden layer as input parameters.

The network has six neurons in total — two in the first hidden layer and four in the output layer. For each of these neurons, pre-activation is represented by ‘a’ and post-activation is represented by ‘h’. In the network, we have a total of 18 parameters — 12 weight parameters and 6 bias terms.

we will write our neural network in a class called **FFNetwork***.*

In the class `FirstFFNetwork`

we have 8 functions, we will go over these functions one by one.

def __init__(self, init_method = 'random', activation_function = 'sigmoid', leaky_slope = 0.1):

......

The `__init__`

function initializes all the parameters of the network including weights and biases. The function takes accepts a few arguments,

`init_method`

: Initialization method to be used for initializing all the parameters of the network. Supports — “random”, “zeros”, “He” and “Xavier”.`activation_function`

: Activation function to be used for learning non-linear decision boundary. Supports — “sigmoid”, “tanh”, “relu” and “leaky_relu”.`leaky_slope`

: Negative slope of Leaky ReLU. Default value set to 0.1.

In Line 5–10, we are setting the network configuration and the activation function to be used in the network.

self.layer_sizes = [2, 2, 4]

`layer_sizes`

represents that the network has two inputs, two neurons in the first hidden layer and 4 neurons in the second hidden layer which is also the final layer in this case. After that, we have a bunch of “if-else” weight initialization statements, in each of these statements we are only initializing the weights based on the method of choice and the biases are always initialized to the value one. The initialized values of weights and biases are stored in a dictionary `self.params`

.

def forward_activation(self, X):

if self.activation_function == "sigmoid":

return 1.0/(1.0 + np.exp(-X))

elif self.activation_function == "tanh":

return np.tanh(X)

elif self.activation_function == "relu":

return np.maximum(0,X)

elif self.activation_function == "leaky_relu":

return np.maximum(self.leaky_slope*X,X)

Next, we have `forward_activation`

function that takes input ‘X’ as an argument and computes the post-activation value of the input depending on the choice of the activation function.

def grad_activation(self, X):

......

The function `grad_activation`

also takes input ‘X’ as an argument and computes the derivative of the activation function at given input and returns it.

def forward_pass(self, X, params = None):

.......

def grad(self, X, Y, params = None):

.......

After that, we have two functions `forward_pass`

which characterize the forward pass. Forward pass involves two steps

- Post Activation — Computes the dot product between the input
**x**& weights**w**and adds bias**b** - Pre Activation — Takes the output of post activation and applies the activation function on top of it.

`grad`

function characterize the gradient computation for each of the parameters present in the network and stores it in a list called `gradients`

. Don’t worry too much in how we arrived at the gradients because we will be using Pytorch to do the heavy lifting, but if you are interested in learning them go through my previous article.

**Building a Feedforward Neural Network from Scratch in Python**

*Build your first generic feed forward neural network without any framework*hackernoon.com

def fit(self, X, Y, epochs=1, algo= "GD", display_loss=False,

eta=1, mini_batch_size=100, eps=1e-8,

beta=0.9, beta1=0.9, beta2=0.9, gamma=0.9 ):

**Implementing different variants of Gradient Descent Optimization Algorithm in Python using Numpy**

*Learn how tensorflow or pytorch implement optimization algorithms but using numpy and create beautiful animations using…*hackernoon.com

def predict(self, X):

### Setting Up for Plotting & Train Neural Network

The size of each point in the plot is given by a formula,

s=15*(np.abs(Y_pred_binarised_train-Y_train)+.2)

The formula takes the absolute difference between the predicted value and the actual value.

- If the ground truth is equal to the predicted value then size = 3
- If the ground truth is not equal to the predicted value the size = 18

### Analyze sigmoid activation

for init_method in ['zeros', 'random', 'xavier', 'he']:

for activation_function in ['sigmoid']:

print(init_method, activation_function)

model = FFNetwork(init_method=init_method,activation_function = activation_function)

model.fit(X_train, y_OH_train, epochs=50, eta=1, algo="GD", display_loss=True)

post_process(plot_scale=0.05)

print('\n--\n')

### Analyzing Leaky ReLU Activation

for activation_function in ['leaky_relu']:

### What’s Next?

**Niranjankumar-c/DeepLearning-PadhAI**

*All the code files related to the deep learning course from PadhAI – Niranjankumar-c/DeepLearning-PadhAI*github.com

### Conclusion

Niranjan Kumar is Retail Risk Analyst at HSBC Analytics division. He is passionate about Deep learning and Artificial Intelligence. He is one of the top writers at Medium in Artificial Intelligence. You can find all of Niranjan’s blog here. You can connect with Niranjan on LinkedIn, Twitter and GitHub to stay up to date with his latest blog posts.

**I am looking for opportunities either full-time or freelance projects, in the field of Machine Learning and Deep Learning. If there are any relevant opportunities, feel free to drop me a message on ****LinkedIn**** or you can reach me through ****email**** as well. I would love to discuss**.

*Source: Artificial Intelligence on Medium*