Blog: Code a Deep Neural Net From Scratch in Python
Get a feel of what these optimization frameworks like pytorch, tensorflow really do! This blog-post is not intended to make you know about what and why, it’s about how!
Have you watched Iron Man? Well, I haven’t watched all the Marvel Series yet, but I noticed one thing in Iron Man’s Arc Reactor first his Arc Reactor had Palladium core which was poisoning him… Right?
So what he did next that he created a new element that didn’t poison him and which made all the difference!
I just meant to say that, in order to create something new, we should know how to create it from scratch. And for that, we need to know what’s going on under the hood ;)
Deep Learning is so exciting field. So, I decided to code a neural net from scratch and wrote a blog post that you’re reading right now. In the end, I’ll add my favorite resources regarding Deep Learning.
Okay, So Let’s Start
We’ve used neural nets using high-level frameworks like Keras, Pytorch, Tensorflow but, knowing how things are working under the hood gives us wings. This will surely help us understand how things work so that in future we’d be able to develop more awesome networks/hacks.
Before Starting with the code, I’ll be sharing some basic stuff, if you’d just want the code, Get it here.
I’ll be going step by step, You can’t build a great building on a weak foundation.
I understand not all people would like to go through all the blog and that’s why this blog runs in sub-parts. Wanna read about specific things? Click on that part only in Table Of Contents
Table of Contents
The magic of deep learning literally compromises of two steps
- Forward Propagation
- Backward Propagation
In a Layman’s term, we first propagate forward, we compute how unhappy (unhappiness is the loss) we are with the results, we then move back, update weights in order to become happy!
Initializing the Net
For example, in the image in the left, we have 4 neurons in the first hidden layer.
On the right, we have a weight matrix. It is generated by multiplying each neuron with all available inputs(random weights in start) and that’s what is shown in the matrix.
Before starting construction of anything, first, it’s blueprint’s (architecture) is prepared.
Neural Network Architecture includes the following:
Input Dimension, Output Dimension, Activation Functions for all the layers.
The above architecture has 3 things to ponder
- Input Layer, Hidden Layer(s), Output Layer.
These circle in layers from hidden layer1 onwards are neurons stacked together, you’ll find these everywhere because it’s what let us make us of vectorization.
Okay, So We’ll need this architecture before constructing our layer, So we’ll be needing a list, in which we have details of each layer, For example, See code below!
Input Dim, Output Dim, and Activation Function | Know these terms?
Above architecture, is of what you’ve seen in the neural net image above, and you’ll find all these terms in every neural net. There are many ways to create the structure for your net, but I’m just showing you one amongst many.
What above gist is speaking up is:
The output of one layer is input to another, Check dimensions again :)
- Input Dimension: Dimension of the input vector, i.e the dimension of the input vector to our model.
- Output Dimension: Dimensions of the final output layer of the network, that is predicting the class.
Activation Function: This is for introducing non-linearities. Non-Linearities help our net to learn things in depth. There are many activation functions out there, but most commonly used ones are ReLU, Sigmoid, Tanh, etc.
So, now we’ve got the architecture, but what next?
Let me show you a snippet that might raise some urge to ask more questions.
Here what’s going on in the above snippet.
Well, we are initializing weights
b for every layer in the architecture and saving that value in order to access them later on.
We’ll be needing these values while doing forward pass and backward pass, also for updating the parameters.
But Why Random Weights?
# Weight Initialization
Well, Here’s a deal
Yes, we’ve initialized random weights because this breaks the symmetry and then no-two neurons perform the same computation. This helps our network to learn properly and this yields greater accuracy. Although accuracy is not the concern every time
There are many ways for initializing weights, one above you’re seeing is a naive way. Why?
When speaking of dense networks, our gradients can die while backpropagating them, if we’ve initialized weights in our network like above. and hence it is naive!
I’m just letting you think and research!
There are many different methods to initialize weights which will help in learning our model faster and efficiently
- Xavier initialization
- He-et-al Initialization
Okay, We’ve initialized the wights, Next is,
Before proceeding further, let’s just walk through some basic terminologies.
Activation Function, The Nonlinearities!
Activation Functions are very important for a neural network to learn and understand the complex patterns. There are many types of activation functions available, by far the most commonly used activation function is ReLU.
I’ll add the link to my favorite article on activation functions link in the resource section.
To keep it simple, In the neural network, that we’re going to build will uses ReLU and at last a sigmoid. Since we’ll be performing a binary classification problem, so sigmoid is enough.
Let’s have a sneak-peek at sigmoid and ReLU.
- Sigmoid is nothing but 1/(1+np.exp(-z))
- ReLU is just max(0, z)
Neurons, The Nodes!
These circles are Nodes. These nodes are performing two functions. Let’s break inside a node!
- w for weight matrix
- X is input vector (input to node)
- Sigma is summation
- f is the Activation Function
Think of it like this,
We are trying to build at each node is like a switch (like a neuron…) that turns on and off, depending on whether or not it should let the signal of the input pass through to affect the ultimate decisions of the network.
From what you’ve learned above…
For each node of a single layer, input from each node of the previous layer is recombined with input from every other node. That is, the inputs are mixed in different proportions, according to their weights, which are different leading into each node of the subsequent layer.
Once they are summed up, they are passed to the nonlinear activation function, in a similar way, the whole network works and gives the result for our problem, which is completely random at the start.
The above process continues until we reach the end of the network, i.e the output layer.
Now, here’s code for
The Forward propagation
For simplicity, I’m showing snippets which at last are the building blocks of our neural network. The whole code can be found here
We’ll be caching values which we’ll need at the time of backpropagation. If you don’t get that why are we doing this then just wait, you’ll find out why we’re doing this pretty soon.
Since forward propagation is similar in every layer, so there’s a helper function that just performs
z = w.x + b followed by
a = g(z) #g is activation function
So, there’s one function that is looping over all layers, and one function is used for the forward propagation. Note that, We’re caching values, which will be used later.
This is all it, next is the Loss Function!
So, what is a loss function? I’m just giving you an idea here
Loss Function is something that allows us to see how good or bad our model is, how good or bad my model’s parameters are.
There are many kinds of Loss functions and we choose the one suitable for us depending upon the nature of the problem.
Want to know more about Loss function and their types? I’ve added the link in resources, check them out.
Next, is Optimization Algorithms
Optimization Algorithms requires a whole new blog post. Check out resources in the end, I’ve added a good blog post for this topic. If you wanna dive deep into optimization functions. It’ll be a good refresher.
We’ll be using the evergreen Gradient Descent in our Neural net, I’ll update the code for other algorithms from scratch too, in my another blog post focusing on just optimization algorithms.
Here comes my favorite thing…
Let me tell you, backpropagation is a little bit tricky but the fun lies in the process of understanding it. I’m adding some resources from where I learned about it.
Backpropagation is beautiful!
Let’s take an example, Say a NAND gate. It outputs True(1) only when inputs to this gate is False(0).
So, if we make an ANN, to have this functionality, we provide 0, 0 as inputs and we expect the output to be 1, but say our network outputs 0(False).
We’ll backtrack and adjust the weights of all connections such that now, it will be more likely to output the desired results. We do this, till our loss get minimized.
So, Backpropagation is basically modifying the weights between neurons so that next time we get the input, it will be more likely to give correct results.
Enough talking, here’s the backpropagation code snippet
We’re calculating gradients by what we call it as a chain rule.
Mind the parameters, are parameters are related to the loss function L.
dz = dL/dz; da=dL/da; dw=dL/dw; db=dL/db and you’ll find layer_id in the code snippet, which is literally layer_id that identify parameters of that particular layer_id.
Here’s the code for backpropagation
Don’t fade away you didn’t like it, let me make you understand you each line here, and I know there’s a lot going on.
Like forward propagation, there are two functions, one is looping all the layers from the backside(hence the name backpropagation) and for each node’s parameters, we’re calculating the gradients, which we’ll use to update the parameters.
The actual code is started from line number 17:
We’re backpropagating in the reverse direction of our ANN.
Line wise breakdown
At Line number 20, we’re calculating
dL/da i.e we are calculating the gradient with respect to our final activation function at the output layer.
Line 21: We’re looping in reverse direction and hence the name Backpropagation
Line 26–27: Using our cached values, we cached these in the forward pass.
Line 32: Calling our backward function
Line number 1–7: backward_activation_function which is basically
np.dot(da, dz) as hence returning
Now you need to know one thing,
dz = da*da/dz and
dL/dw = dw = dL/dz*dz/dw = dz*dz/dw also,
dz/dw = d(np.dot(w, a_prev)+b)/dw this will return
dw = dz*a_prev which is what represented in line number 11.
Similarly at line 12,
db is calculated.
We’re also finding
da_prev which will become
da_curr for the next of backward iteration.
In this way, we’ve stored gradients for every layer, very intelligently. Now we just need to update the parameters!
Just 1 more paragraph, I promise :)
Updating the Parameters #Optimizer.step()
We’ve accumulated the gradients, we’ve parameters saved, so what are we waiting for? Let’s update the parameters. This is the simplest thing here 😜
Note that, we tend to minimize the loss, so we move opposite to the gradients, and hence we update the weights in the opposite direction of the gradients.
Notice the minus sign in update code.
def update(self, grads, learning_rate):
for layer_id, layer in enumerate(self.architecture, 1):
self.params['W'+str(layer_id)] -= learning_rate * grads['dw'+str(layer_id)]
self.params['b'+str(layer_id)] -= learning_rate * grads['db'+str(layer_id)]
First of all congratulations to you for coming this, far, we’ve made our neural net from scratch and to see, it works, I quickly train a classifier on sklearn’s breast cancer dataset, Although data was quite less, got not so bad results. Although in this case, I’ll not considering accuracy as my metrics!
Our code is not optimized much as these deep learning frameworks have, also I’ve just fit the neural net on the dataset directly without any preprocessing and we’ve only coded it up with Gradient Descent, I’m sure results will be better with fewer epochs in Adam or SGD optimization algorithms.
The same can be coded it up in Pytorch, Keras, Tensorflow very easily. I’ve included the code for Pytorch implementation in the same notebook. It is runnable on Google Colab, play with the code and we can talk for what architecture will work well. 😏
As I’ve promised here are some links that you’ll find most useful
- For backpropagation, I’d suggest you go through this video from the famous cs231n course
- Activation Functions and its types
- Loss Functions in Machine Learning
- Types of Optimization Functions
- Neural Network and Deep Learning Book
This exercise of coding things from scratch has been a great investment of my time, and I hope that it’ll be useful for you as well! I’d be really glad to see your neural net from scratch implementation.
Thank you for reading this far, I’m also learning as you are, If you’ve any suggestions email me, or let me know in the comments. I’d be really grateful.
Follow me on Twitter I post about Deep Learning and my experiments with it👌