### Blog: Logistic Regression with Python

Logistic regression was once the most popular machine learning algorithm, but the advent of more accurate algorithms for classification such as support vector machines, random forest, and neural networks has induced some machine learning engineers to view logistic regression as obsolete. Though it may have been overshadowed by more advanced methods, its simplicity makes it the ideal algorithm to use as an introduction to the study of machine learning.

Like most machine learning algorithms, logistic regression creates a boundary edge between binary labels. The purpose of a training process is to place this edge in such a way that most of the labels are divided so as to maximize the accuracy of predictions. The training process requires correct model architecture and fine-tuned hyperparameters, whereas data play the most significant role in determining the prediction accuracy.

*[Related Article: Handling Missing Data in Python/Pandas]*

In a nutshell, the idea behind the process of training logistic regression is to maximize the likelihood of the hypothesis that the data are split by sigmoid. To do this, we should find optimal coefficients for the sigmoid function (x)=11+e–x. These coefficients are iteratively approximated with minimizing the loss function of logistic regression using gradient descent. As soon as losses reach the minimum, or come very close, we can use our model for prediction.

To create a logistic regression with Python from scratch we should import numpy and matplotlib libraries.

import numpy as np

import matplotlib.pyplot as plt

We also have to input the dataset. Here we use the classic scikit-learn example of classifying breast cancer, which is often used for the “hello-world” machine learning examples. This dataset consists of 569 training samples and includes 30 features for binary classification.

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

The features are tumor properties, such as radius, texture, smoothness, and the expected output is prediction whether the tumor is malignant or benign. These data are completely prepared for training, so we can omit the preprocessing step here. Parse the data and extract either features and labels:

data = load_breast_cancer()[‘data’]

target = load_breast_cancer()[‘target’]

Then split all the data with labels into training and test samples. The size of the test sample should be approximately 20–25% of the whole data. We use this sample to validate training results and to be sure that there is no overfitting. Also, we shuffle the data we are splitting, to be sure that training and test data are of the same distribution.

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, shuffle=True)

Now let’s define the sigmoid function. Since the output can be rounded by sigmoid function to 0 or 1 because some input values are too small or too large, we later get the logarithm of zero which equals to the negative infinitely large number and cannot be used for further computations. That’s why we manually bound our sigmoid function:

def sigmoid(x):

return np.maximum(np.minimum(1 / (1 + np.exp(-x)), 0.9999), 0.0001)

As there was mentioned above, we need to define the cost function for the logistic regression. The most common approach is to iterate over training examples to apply sigmoid to them, then iterate one more time to count the sum of losses. However, we use numpy to apply sigmoid to the whole array and count losses of all the array with just a few lines of code:

def cost_function(x, y, theta):

t = x.dot(theta)

return — np.sum(y * np.log(sigmoid(t)) + (1 — y) * np.log(1 — sigmoid(t))) / x.shape[0]

The idea of cost function is that we count the sum of the metric distances between our hypothesis and real labels on the training data. The more optimized our parameters are, the less is the distance between the true value and hypothesis. But how can we minimize this distance?

The approach to solving most of the optimization problems is plain differentiation. We are going to use it here too. As soon as we have 30 variables, we should have found partial derivatives. To simplify the differentiation in this case we will assume that all these variables are independent. We also represent the gradient of cost function through itself. Here is what we get:

def gradient_cost_function(x, y, theta):

t = x.dot(theta)

return x.T.dot(y — sigmoid(t)) / x.shape[0]

The next step is called a stochastic gradient descent. This is the main part of the training process because at this step we update model weights. Here we use the hyperparameter called learning rate, that sets the intensity of the training process. Learning rate is very important hyperparameter and has to be selected very carefully. Using too small learning rate, you can move too slow and never reach the minimum of the loss function, while using too large a learning rate may cause you to miss the minimum entirely.

def update_theta(x, y, theta, learning_rate):

return theta + learning_rate * gradient_cost_function(x, y, theta)

Now let’s put everything together to see the whole picture. We initiate as a zero-vector of size 30. In terms of logistic regression, we can also use an array of ones or random values between 0 and 1. Then we iterate over the manually set number of iterations to minimize losses to maximize the likelihood of our predictions. If the difference between the two last values of the cost function is smaller than some threshold value, we break the training:

def train(x, y, learning_rate, iterations=500, threshold=0.0005):

theta = np.zeros(x.shape[1])

costs = []

print(‘Start training’)

for i in range(iterations):

theta = update_theta(x, y, theta, learning_rate)

cost = cost_function(x, y, theta)

print(f’[Training step #{i}] — Cost function: {cost:.4f}’)

costs.append({‘cost’: cost, ‘weights’: theta})

if i > 15 and abs(costs[-2][‘cost’] — costs[-1][‘cost’]) < threshold:

break

return theta, costs

Theta, costs = train(X_train, y_train, learning_rate=0.0001)

Finally having trained the model, we can make predictions, based on our model:

def predict(x, theta):

return (sigmoid(x.dot(theta)) >= 0.5).astype(int)

Let’s compare, how predicted data are different than real:

def get_accuracy(x, y, theta):

y_pred = predict(x, theta)

return (y_pred == y).sum() / y.shape[0]

print(f’Accuracy on the training set: {get_accuracy(X_train, y_train, theta)}’)

print(f’Accuracy on the test set: {get_accuracy(X_test, y_test, theta)}’)

We can also make some visualizations with pyplot to watch, how accurate our predictions are:

plt.figure(figsize=(15,10))

plt.title(‘Model accuracy depending on the training step’)

plt.plot(np.arange(0, len(costs)),

[get_accuracy(X_train, y_train, c[‘weights’]) for c in costs],

alpha=0.7,

label=’Train’, color=’r’)

plt.plot(np.arange(0, len(costs)),

[get_accuracy(X_test, y_test, c[‘weights’]) for c in costs],

alpha=0.7,

label=’Test’, color=’b’)

plt.xlabel(‘Number of iterations’)

plt.ylabel(‘Accuracy, %’)

plt.legend(loc=’best’)

plt.grid(True)

plt.xticks(np.arange(0, len(costs)+1, 40))

plt.yticks(np.arange(0.5, 1, 0.1))

plt.show()

*[Related Article: Tips for Linear Regression Diagnostics]*

The training accuracy between the two neighboring iterations is changing rapidly. This is the result of using stochastic gradient descent and bounding values inside the sigmoid function. Nevertheless, we can see that the training was successful and the accuracy on the training set is approximately 92%, while on the training set it is almost 88%, which is a pretty good result for such a tiny algorithm.

*Read more data science articles on **OpenDataScience.com**, including tutorials and guides from beginner to advanced levels! **Subscribe to our weekly newsletter here** and receive the latest news every Thursday.*

## Leave a Reply