Blog: Built your own Diabetes Prediction Portal using Flask
In this tutorial, we’ll see how to built a Diabetes Prediction Portal using Flask,sklearn(Scikit Learn), numpy,pandas. It was my weekend project. In this we will be using KNN Neighbors Classifier to trained model that is used to predict the positive or negative result.The model has been trained based on the Pima Diabetes dataset. Given set of inputs are BMI(Body Mass Index),BP(Blood Pressure),Glucose Level,Insulin Level based on this features it predict whether you have diabetes or not.
The Github repository of this tutorial can be found at my Github repository the was originally published .
Diabetes Prediction is my weekend practice project. In this I used KNN Neighbors Classifier to trained model that is…github.com
Pycharm,Python 2.7,Sklearn(Scikit Learn),Pandas,Numpy
To install flask,sklearn(Scikit Learn),Pandas,Numpy used these commands.
pip install flask
pip install scikit-learn
pip install pandas
pip install numpy
Training and saving the model
Before we built the Diabetes Prediction Portal in flask we have to train and save the trained model as a pickle file. So the we can used this trained model into our Portal for prediction.
You can find the pima indian diabetes dataset on kaggle that is used for training. The dataset is of 768 rows * 9 columns
Download link: https://www.kaggle.com/uciml/pima-indians-diabetes-database
I had used the Jupyter Notebook for the training the model. Open jupyter notebook and follow the steps below for creating the model.
Step 1: Importing the modules required
import numpy as np
import pandas as pd
from sklearn.model_selection import RepeatedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
Step 2: Loading the dataset
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("diabetes.csv", header=None, names=col_names)
pima #print dataset
The dataset is in csv format , fullform is “comma separated values”, it means the every value is been separated by a comma in the file. read_csv() function of pandas is used to read csv files and return it into dataframes, col_names contains the column names that we want to give to the dataset, we pass names = col_names as keyword of read_csv() function so it gives column name according to col_names will loading the dataset. After adding the column names the dataset becomes of 769 rows * 9 columns
Step 3.1: Selecting the features to train model
# change datatype to nd array float
data = pima.tail(768).astype(np.float)
After Step 2 we need to select the features based on which we want to trained our model. For this we have to find correlation between all the data with one another data. So to find correlation it is necessary the change the type dataframes to numpy float. tail() function of pandas is used to select the dataframes from last, here we have select last 768 rows out of 769 rows because the first row contains the name of all columns. astype() change the type, in this case np.float i.e numpy float. corr() is used to find the correlation .
After finding the correlation we have select atleast four features that have high correlation with the label column. High correlation means the values which are more close to 1. We have avoid the features whose correlation with label is negative and more close to zero. You can select more than four feature according the correlation.
Here I selected the four feature bmi, bp, glucose, insulin.
Step 3.2: Split features and target variables
#split dataset in features and target variable
feature_cols = [ 'bmi', 'bp','glucose','insulin']
X = pima[feature_cols][1:] # Features
y = pima.label[1:]# Target variable
Features are the factors or the facts that are been observed to know whether the person has diabetes or not. And the label are target variable having values 0 or 1 with respect to features being examine, 0 indicate a non diabetic patient and 1 indicate diabetic patient.
Step 4: Apply KFold for spliting data into train and test
skf = RepeatedKFold(n_splits=100,n_repeats=7,random_state=10)
for train_index, test_index in skf.split(X,y):
X_train,X_test = X.iloc[train_index],X.iloc[test_index]
y_train,y_test = y.iloc[train_index],y.iloc[test_index]
print(X_test.shape,y_test.shape) # (7,4) (7,)
print(X_train.shape,y_train.shape) # (761,4) (761,)
Using Repeated KFold we split the data into X_train , y_train for training the model and X_test, y_test for testing the trained model.
Step 5: Model Development and Prediction
# instantiate the model (using the default parameters)
kn = KNeighborsClassifier()
# fit the model with data
fit() function train the model with the given data.
y_pred = kn.predict(X_test)
y_pred # print y_pred
predict() is used for making the prediction. After training our model using kn.fit() we can check the accuracy trained model using kn.predict() for predicting the target variable i.e results so we have passed the X_test to predict whether the person will have diabetes or not. If we compare y_pred and y_test both are same means the our model has 100% test accuracy. 100% accuracy is possible we have trained our model using 761 rows of data out of 768 rows but we change the parameters in KFold for split it might also change the train and test size. So to have high accuracy we must have to rain model with best features and more data.
Step 6: Saving and loading the model
from sklearn.externals import joblib
#save model as diabetes.pkl
#loading the model
model1 = joblib.load('diabetes.pkl')
Developing the Portal in Flask
After the training and saving the model we have to built the Diabetes Prediction Portal using the Flask. Now, open the pycharm create a new flask project give the name “Portal” or what you want.
Create a new folder inside static folder named json, it will store user input in form json file that will used in our model for prediction. After that create a new folder named model inside static and placed our trained model inside it.
Now, create index.html inside template that will contain our portal. Write your own css sheet for giving unique look and feel to your portal.
<title>Welcome to Diabetes Prediction</title>
<center><h1>Welcome to Diabetes Prediction Portal</h1></center><br>
<form action="/predict" method="post">
<label>Enter BMI(Body Mass Index):</label><input type="text" name="bmi" placeholder="BMI(Body Mass Index)" required>
<label>Enter BP(Body Pressure):</label><input type="text" name="bp" placeholder="BP(Blood Pressure)" required>
<label>Enter Glucose Level:</label><input type="text" name="glucose" placeholder="Glucose" required>
<label>Enter Insulin Level:</label><input type="text" name="insulin" placeholder="Insulin Level" required><br><br>
<center><input type="submit" value="Submit" id="btn"></center>
<main><p><b>Note: </b><i>Body Mass Index (BMI) is a measure of body fat based on height and weight.Body Mass Index is a simple calculation using a person’s height and weight. The formula is BMI = kg/m<sup>2</sup> where kg is a person’s weight in kilograms and m<sup>2</sup> is their height in metres squared.
A BMI of 25.0 or more is overweight, while the healthy range is 18.5 to 24.9. BMI applies to most adults 18-65 years.</i></p></main><br>
<p><b>Important Notice: </b><i>It is for only Educational purpose.</i></p>
Now, in app.py
#import required packages
from flask import *
from sklearn.externals import joblib
import pandas as pd
from pandas.io.json import json_normalize
import numpy as np
app = Flask(__name__)
formvalues = request.form
path1 = "/static/json/"
with open(os.path.join(os.getcwd()+"/"+path1,'file.json'), 'w') as f:
json.dump(formvalues, f)"""saved form input as a json file"""
with open(os.path.join(os.getcwd()+"/"+path1,'file.json'), 'r') as f:
values = json.load(f)"""load json"""
df = pd.DataFrame(json_normalize(values))"""convert json to dataframes"""
model_path=os.getcwd()+"/static/model/diabetes.pkl"""get model path"""
model = joblib.load(model_path)"""load"""
result = model.predict(df)"""predict"""
msg = "Unsuccess"
if __name__ == '__main__':