### Blog: Implementing different variants of Gradient Descent Optimization Algorithm in Python using Numpy

#### In-Depth Analysis

## Learn how tensorflow or pytorch implement optimization algorithms by using numpy and create beautiful animations using matplotlib

In this post, we will discuss how to implement different variants of gradient descent optimization technique and also visualize the working of the update rule for these variants using matplotlib. This is a follow-up to my previous post on optimization algorithms.

Citation Note: The content and the structure of this article is based on the deep learning lectures from One-Fourth Labs — PadhAI.

Gradient Descent is one of the most commonly used optimization techniques to optimize neural networks. Gradient descent algorithm updates the parameters by moving in the direction opposite to the gradient of the objective function with respect to the network parameters.

**Demystifying Different Variants of Gradient Descent Optimization Algorithm**

*Learn different improvements made to gradient descent and compare their update rule using 2D Contour plots.*hackernoon.com

### Implementation In Python Using Numpy

In the coding part, we will be covering the following topics.

**Sigmoid Neuron Class****Overall setup — What is the data, model, task****Plotting functions — 3D & contour plots****Individual algorithms and how they perform**

Individual Algorithms include,

**Batch Gradient Descent****Momentum GD****Nesterov Accelerated GD****Mini Batch and Stochastic GD****AdaGrad GD****RMSProp GD****Adam GD**

If you want to skip the theory part and get into the code right away,

**Niranjankumar-c/GradientDescent_Implementation**

*Implement different variants of gradient descent in python using numpy – Niranjankumar-c/GradientDescent_Implementation*github.com

### Sigmoid Neuron Implementation

#### Sigmoid Neuron Recap

#### Learning Algorithm

#### Sigmoid Neuron Class

#constructor

def __init__(self, w_init, b_init, algo):

self.w = w_init

self.b = b_init

self.w_h = []

self.b_h = []

self.e_h = []

self.algo = algo

`w_init,b_init`

— These parameters take the initial values for the parameters ‘**w**’ and ‘**b**’ instead of setting parameters randomly, we are setting it to specific values. This allows us to understand how an algorithm performs by visualizing for different initial points. Some algorithms get stuck in local minima at some parameters.`algo`

— It tells which variant of gradient descent algorithm to use for finding the optimal parameters.

def sigmoid(self, x, w=None, b=None):

if w is None:

w = self.w

if b is None:

b = self.b

return 1. / (1. + np.exp(-(w*x + b)))

def error(self, X, Y, w=None, b=None):

if w is None:

w = self.w

if b is None:

b = self.b

err = 0

for x, y in zip(X, Y):

err += 0.5 * (self.sigmoid(x, w, b) - y) ** 2

return err

def grad_w(self, x, y, w=None, b=None):

.....

def grad_b(self, x, y, w=None, b=None):

.....

def fit(self, X, Y, epochs=100, eta=0.01, gamma=0.9, mini_batch_size=100, eps=1e-8,beta=0.9, beta1=0.9, beta2=0.9):

self.w_h = []

.......

def append_log(self):

self.w_h.append(self.w)

self.b_h.append(self.b)

self.e_h.append(self.error(self.X, self.Y))

### Setting Up for Plotting

### 3D & 2D Plotting Setup

Similar to the 3D plot, we can create a function to plot 2D contour plots.

### Algorithm Implementation

### Vanilla Gradient Descent

Parameter update rule will be given by,

for i in range(epochs):

dw, db = 0, 0

for x, y in zip(X, Y):

dw += self.grad_w(x, y)

db += self.grad_b(x, y)

self.w -= eta * dw / X.shape[0]

self.b -= eta * db / X.shape[0]

self.append_log()

To execute the gradient descent algorithm change the configuration settings as shown below.

### Momentum-based Gradient Descent

The code for the Momentum GD is given below,

v_w, v_b = 0, 0

for i in range(epochs):

dw, db = 0, 0

for x, y in zip(X, Y):

dw += self.grad_w(x, y)

db += self.grad_b(x, y)

v_w = gamma * v_w + eta * dw

v_b = gamma * v_b + eta * db

self.w = self.w - v_w

self.b = self.b - v_b

self.append_log()

X = np.asarray([0.5, 2.5])

Y = np.asarray([0.2, 0.9])

algo = 'Momentum'

w_init = -2

b_init = -2

w_min = -7

w_max = 5

b_min = -7

b_max = 5

epochs = 1000

mini_batch_size = 6

gamma = 0.9

eta = 1

animation_frames = 20

plot_2d = True

plot_3d = True

### Nesterov Accelerated Gradient Descent

The code for the Momentum GD is given below,

v_w, v_b = 0, 0

for i in range(epochs):

dw, db = 0, 0

v_w = gamma * v_w

v_b = gamma * v_b

for x, y in zip(X, Y):

dw += self.grad_w(x, y, self.w - v_w, self.b - v_b)

db += self.grad_b(x, y, self.w - v_w, self.b - v_b)

v_w = v_w + eta * dw

v_b = v_b + eta * db

self.w = self.w - v_w

self.b = self.b - v_b

self.append_log()

v_w = gamma * v_w

v_b = gamma * v_b

for x, y in zip(X, Y):

dw += self.grad_w(x, y, self.w - v_w, self.b - v_b)

db += self.grad_b(x, y, self.w - v_w, self.b - v_b)

v_w = v_w + eta * dw

v_b = v_b + eta * db

### Mini-Batch and Stochastic Gradient Descent

The code for the Mini-Batch GD is given below,

for i in range(epochs):

dw, db = 0, 0

points_seen = 0

for x, y in zip(X, Y):

dw += self.grad_w(x, y)

db += self.grad_b(x, y)

points_seen += 1

if points_seen % mini_batch_size == 0:

self.w -= eta * dw / mini_batch_size

self.b -= eta * db / mini_batch_size

self.append_log()

dw, db = 0, 0

### AdaGrad Gradient Descent

The code for Adagrad is given below,

v_w, v_b = 0, 0

for i in range(epochs):

dw, db = 0, 0

for x, y in zip(X, Y):

dw += self.grad_w(x, y)

db += self.grad_b(x, y)

v_w += dw**2

v_b += db**2

self.w -= (eta / np.sqrt(v_w) + eps) * dw

self.b -= (eta / np.sqrt(v_b) + eps) * db

self.append_log()

### RMSProp Gradient Descent

The code for RMSProp is given below,

v_w, v_b = 0, 0

for i in range(epochs):

dw, db = 0, 0

for x, y in zip(X, Y):

dw += self.grad_w(x, y)

db += self.grad_b(x, y)

v_w = beta * v_w + (1 - beta) * dw**2

v_b = beta * v_b + (1 - beta) * db**2

self.w -= (eta / np.sqrt(v_w) + eps) * dw

self.b -= (eta / np.sqrt(v_b) + eps) * db

self.append_log()

### Adam Gradient Descent

The code for Adam GD is given below,

v_w, v_b = 0, 0

m_w, m_b = 0, 0

num_updates = 0

for i in range(epochs):

dw, db = 0, 0

for x, y in zip(X, Y):

dw = self.grad_w(x, y)

db = self.grad_b(x, y)

num_updates += 1

m_w = beta1 * m_w + (1-beta1) * dw

m_b = beta1 * m_b + (1-beta1) * db

v_w = beta2 * v_w + (1-beta2) * dw**2

v_b = beta2 * v_b + (1-beta2) * db**2

m_w_c = m_w / (1 - np.power(beta1, num_updates))

m_b_c = m_b / (1 - np.power(beta1, num_updates))

v_w_c = v_w / (1 - np.power(beta2, num_updates))

v_b_c = v_b / (1 - np.power(beta2, num_updates))

self.w -= (eta / np.sqrt(v_w_c) + eps) * m_w_c

self.b -= (eta / np.sqrt(v_b_c) + eps) * m_b_c

self.append_log()

To execute the Adam gradient descent algorithm change the configuration settings as shown below.

X = np.asarray([3.5, 0.35, 3.2, -2.0, 1.5, -0.5])

Y = np.asarray([0.5, 0.50, 0.5, 0.5, 0.1, 0.3])

algo = 'Adam'

w_init = -6

b_init = 4.0

w_min = -7

w_max = 5

b_min = -7

b_max = 5

epochs = 200

gamma = 0.9

eta = 0.5

eps = 1e-8

animation_frames = 20

plot_2d = True

plot_3d = False

### What’s Next?

**Niranjankumar-c/GradientDescent_Implementation**

*Implement different variants of gradient descent in python using numpy – Niranjankumar-c/GradientDescent_Implementation*github.com

### Conclusion

**Demystifying Different Variants of Gradient Descent Optimization Algorithm**

*Learn different improvements made to gradient descent and compare their update rule using 2D Contour plots.*hackernoon.com

**Building a Feedforward Neural Network from Scratch in Python**

*Build your first generic feed forward neural network without any framework*hackernoon.com

In my next post, we will discuss different activation functions such as logistic, ReLU, LeakyReLU, etc… and some best initialization techniques like Xavier and He initialization. So make sure you follow me on medium to get notified as soon as it drops.

Until then Peace :)

NK.

Niranjan Kumar is an intern at HSBC Analytics division. He is passionate about deep learning and AI. He is one of the top writers at Medium in Artificial Intelligence. Connect with me on LinkedIn or follow me on twitter for updates about upcoming articles on deep learning and Artificial Intelligence.

## Leave a Reply