Blog

ProjectBlog: Ensemble Learning In Artificial Intelligence

Blog: Ensemble Learning In Artificial Intelligence


Ensemble Learning refers to the process of building multiple models and then combining them in a way that can produce better results than individual models.

Go to the profile of Sid

These individual models can be classifiers, regressors, or anything else that models data in some way. Ensemble learning is used extensively across multiple fields including data classification, predictive modelling, anomaly detection, and so on.

Why do we need ensemble learning in the first place? In order to understand this, let’s take a real-life example. You want to buy a new TV, but you don’t know what the latest models are. Your goal is to get the best value for your money, but you don’t have enough knowledge on this topic to make an informed decision. When you have to make a decision about something like this, you go around and try to get the opinions of multiple experts in the domain. This will help you make the best decision. More often than not, instead of just relying on a single opinion, you tend to make a final decision by combining the individual decisions of those experts. The reason we do that is that we want to minimize the possibility of a wrong or suboptimal decision.

Building learning models with Ensemble Learning

When we select a model, the most commonly used procedure is to choose the one with the smallest error on the training dataset. The problem with this approach is that it will not always work. The model might get biased or overfit the training data. Even when we compute the model using cross-validation, it can perform poorly on unknown data.

One of the main reasons ensemble learning is so effective is because it reduces the overall risk of making a poor model selection. This enables it to train in a diverse manner and then perform well on unknown data. When we build a model using ensemble learning, the individual models need to exhibit some diversity. This would allow them to capture various nuances in our data; hence the overall model becomes more accurate.

The diversity is achieved by using different training parameters for each individual model. This allows individual models to generate different decision boundaries for training data. This means that each model will use different rules to make an inference, which is a powerful way of validating the final result. If there is agreement among the models, then we know that the output is correct.

What are Decision Trees?

A Decision Tree is a structure that allows us to split the dataset into branches and then make simple decisions at each level. This will allow us to arrive at the final decision by walking down the tree. Decision Trees are produced by training algorithms, which identify how we can split the data in the best possible way.

Any decision process starts at the root node at the top of the tree. Each node in the tree is basically a decision rule. Algorithms construct these rules based on the relationship between the input data and the target labels in the training data. The values in the input data are utilized to estimate the value for the output.

Now that we understand the basic concept of Decision Trees, the next thing is to understand how the trees are automatically constructed. We need algorithms that can construct the optimal tree based on our data. In order to understand it, we need to understand the concept of entropy. In this context, entropy refers to information entropy and not thermodynamic entropy. Entropy is basically a measure of uncertainty. One of the main goals of a decision tree is to reduce uncertainty as we move from the root node towards the leaf nodes. When we see an unknown data point, we are completely uncertain about the output. By the time we reach the leaf node, we are certain about the output. This means that we need to construct the decision tree in a way that will reduce the uncertainty at each level. This implies that we need to reduce the entropy as we progress down the tree.

Building a Decision Tree Classifier

Let’s see how to build a classifier using Decision Trees in Python. Create a new Python file and import the following packages:

import numpy as np 
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn import cross_validation
from sklearn.tree import DecisionTreeClassifier

from utilities import visualize_classifier

We will be using the data in the data_decision_trees.txtthe file that’s provided to you. In this file, each line contains comma-separated values. The first two values correspond to the input data and the last value corresponds to the target label. Let’s load the data from that file:

# Load input data 
input_file = 'data_decision_trees.txt'
data = np.loadtxt(input_file, delimiter=',')
X, y = data[:, :-1], data[:, -1]

Separate the input data into two separate classes based on the labels:

# Separate input data into two classes based on labels 
class_0 = np.array(X[y==0])
class_1 = np.array(X[y==1])

Let’s visualize the input data using a scatter plot:

# Visualize input data 
plt.figure()
plt.scatter(class_0[:, 0], class_0[:, 1], s=75, facecolors='black',
edgecolors='black', linewidth=1, marker='x')
plt.scatter(class_1[:, 0], class_1[:, 1], s=75, facecolors='white',
edgecolors='black', linewidth=1, marker='o')
plt.title('Input data')

We need to split the data into training and testing datasets:

# Split data into training and testing datasets 
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y, test_size=0.25, random_state=5)

Create, build, and visualize a decision tree classifier based on the training dataset. The random_stateparameter refers to the seed used by the random number generator required for the initialization of the decision tree classification algorithm. The max_depthparameter refers to the maximum depth of the tree that we want to construct:

# Decision Trees classifier 
params = {'random_state': 0, 'max_depth': 4}
classifier = DecisionTreeClassifier(**params)
classifier.fit(X_train, y_train)
visualize_classifier(classifier, X_train, y_train, 'Training dataset')

Compute the output of the classifier on the test dataset and visualize it:

y_test_pred = classifier.predict(X_test) 
visualize_classifier(classifier, X_test, y_test, 'Test dataset')

Evaluate the performance of the classifier by printing the classification report:

# Evaluate classifier performance 
class_names = ['Class-0', 'Class-1']
print("\n" + "#"*40)
print("\nClassifier performance on training dataset\n")
print(classification_report(y_train, classifier.predict(X_train), target_names=class_names))
print("#"*40 + "\n")

print("#"*40)
print("\nClassifier performance on test dataset\n")
print(classification_report(y_test, y_test_pred, target_names=class_names))
print("#"*40 + "\n")

plt.show()

The full code is given in the decision_trees.pyfile. If you run the code, you will see a few figures. The first screenshot is the visualization of input data:

The second screenshot shows the classifier boundaries on the test dataset:

You will see the following printed on your Terminal:

The performance of a classifier is characterized by precision, recall, and f1-scores. Precision refers to the accuracy of the classification and recall refers to the number of items that were retrieved as a percentage of the overall number of items that were supposed to be retrieved. A good classifier will have high precision and high recall, but it is usually a trade-off between the two. Hence we have f1-scoreto characterize that. F1 score is the harmonic mean of precision and recall, which gives it a good balance between precision and recall values.

What are Random Forests and Extremely Random Forests?

A Random Forest is a particular instance of ensemble learning where individual models are constructed using Decision Trees. This ensemble of Decision Trees is then used to predict the output value. We use a random subset of training data to construct each Decision Tree. This will ensure diversity among various decision trees. In the first section, we discussed that one of the most important things in ensemble learning is to ensure that there’s diversity among individual models.

One of the best things about Random Forests is that they do not overfit. As we know, overfitting is a problem that we encounter frequently in machine learning. By constructing a diverse set of Decision Trees using various random subsets, we ensure that the model does not overfit the training data. During the construction of the tree, the nodes are split successively and the best thresholds are chosen to reduce the entropy at each level. This split doesn’t consider all the features in the input dataset. Instead, it chooses the best split among the random subset of the features that are under consideration. Adding this randomness tends to increase the bias of the random forest, but the variance decreases because of averaging. Hence, we end up with a robust model.

Extremely Random Forests take randomness to the next level. Along with taking a random subset of features, the thresholds are chosen at random too. These randomly generated thresholds are chosen as the splitting rules, which reduce the variance of the model even further. Hence the decision boundaries obtained using Extremely Random Forests tend to be smoother than the ones obtained using Random Forests.

Building Random Forest and Extremely Random Forest classifiers

Let’s see how to build a classifier based on Random Forests and Extremely Random Forests. The way to construct both classifiers is very similar, so we will use an input flag to specify which classifier needs to be built.

Create a new Python file and import the following packages:

import argparse 

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn import cross_validation
from sklearn.metrics import classification_report

from utilities import visualize_classifier

Define an argument parser for Python so that we can take the classifier type as an input parameter. Depending on this parameter, we can construct a Random Forest classifier or an Extremely Random forest classifier:

# Argument parser 
def build_arg_parser():
parser = argparse.ArgumentParser(description='Classify data using \
Ensemble Learning techniques')
parser.add_argument('--classifier-type', dest='classifier_type',
required=True, choices=['rf', 'erf'], help="Type of classifier
\to use; can be either 'rf' or 'erf'")
return parser

Define the main function and parse the input arguments:

if __name__=='__main__': 
# Parse the input arguments
args = build_arg_parser().parse_args()
classifier_type = args.classifier_type

We will be using the data from the data_random_forests.txta file that is provided to you. Each line in this file contains comma-separated values. The first two values correspond to the input data and the last value corresponds to the target label. We have three distinct classes in this dataset. Let’s load the data from that file:

# Load input data 
input_file = 'data_random_forests.txt'
data = np.loadtxt(input_file, delimiter=',')
X, y = data[:, :-1], data[:, -1]

Separate the input data into three classes:

# Separate input data into three classes based on labels 
class_0 = np.array(X[y==0])
class_1 = np.array(X[y==1])
class_2 = np.array(X[y==2])

Let’s visualize the input data:

# Visualize input data 
plt.figure()
plt.scatter(class_0[:, 0], class_0[:, 1], s=75, facecolors='white',
edgecolors='black', linewidth=1, marker='s')
plt.scatter(class_1[:, 0], class_1[:, 1], s=75, facecolors='white',
edgecolors='black', linewidth=1, marker='o')
plt.scatter(class_2[:, 0], class_2[:, 1], s=75, facecolors='white',
edgecolors='black', linewidth=1, marker='^')
plt.title('Input data')

Split the data into training and testing datasets:

# Split data into training and testing datasets 
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y, test_size=0.25, random_state=5)

Define the parameters to be used when we construct the classifier. The parameter refers to the number of trees that will be constructed. The parameter refers to the maximum number of levels in each tree. The random_stateparameter refers to the seed value of the random number generator needed to initialize the random forest classifier algorithm:

# Ensemble Learning classifier 
params = {'n_estimators': 100, 'max_depth': 4, 'random_state': 0}

Depending on the input parameter, we either construct a random forest classifier or an extremely random forest classifier:

if classifier_type == 'rf': 
classifier = RandomForestClassifier(**params)
else:
classifier = ExtraTreesClassifier(**params)

Train and visualize the classifier:

classifier.fit(X_train, y_train) 
visualize_classifier(classifier, X_train, y_train, 'Training dataset')

Compute the output based on the test dataset and visualize it:

y_test_pred = classifier.predict(X_test) 
visualize_classifier(classifier, X_test, y_test, 'Test dataset')

Evaluate the performance of the classifier by printing the classification report:

# Evaluate classifier performance 
class_names = ['Class-0', 'Class-1', 'Class-2']
print("\n" + "#"*40)
print("\nClassifier performance on training dataset\n")
print(classification_report(y_train, classifier.predict(X_train), target_names=class_names))
print("#"*40 + "\n")

print("#"*40)
print("\nClassifier performance on test dataset\n")
print(classification_report(y_test, y_test_pred, target_names=class_names))
print("#"*40 + "\n")

The full code is given in the file. Let’s run the code with the Random Forest classifier using the flag in the input argument. Run the following command on your Terminal:

$ python3 random_forests.py --classifier-type rf

You will see a few figures pop up. The first screenshot is the input data:

In the preceding screenshot, the three classes are being represented by squares, circles, and triangles. We see that there is a lot of overlap between classes, but that should be fine for now. The second screenshot shows the classifier boundaries:

Now let’s run the code with the Extremely Random Forest classifier by using the flag in the input argument. Run the following command on your Terminal:

$ python3 random_forests.py --classifier-type erf

You will see a few figures pop up. We already know what the input data looks like. The second screenshot shows the classifier boundaries:

If you compare the preceding screenshot with the boundaries obtained from Random Forest classifier, you will see that these boundaries are smoother. The reason is that Extremely Random Forests have more freedom during the training process to come up with good Decision Trees, hence they usually produce better boundaries.

Estimating the confidence measure of the predictions

If you observe the outputs obtained on the terminal, you will see that the probabilities are printed for each data point. These probabilities are used to measure the confidence values for each class. Estimating confidence values is an important task in machine learning. In the same python file, add the following line to define an array of test data points:

# Compute confidence 
test_datapoints = np.array([[5, 5], [3, 6], [6, 4], [7, 2], [4, 4], [5, 2]])

The classifier object has an inbuilt method to compute the confidence measure. Let’s classify each point and compute the confidence values:

print("\nConfidence measure:") 
for datapoint in test_datapoints:
probabilities = classifier.predict_proba([datapoint])[0]
predicted_class = 'Class-' + str(np.argmax(probabilities))
print('\nDatapoint:', datapoint)
print('Predicted class:', predicted_class)

Visualize the test data points based on classifier boundaries:

# Visualize the datapoints 
visualize_classifier(classifier, test_datapoints,
[0]*len(test_datapoints),
'Test datapoints')

plt.show()

If you run the code with the rfflag, you will get the following output:

You will get the following output on your Terminal:

For each data point, it computes the probability of that point belonging to our three classes. We pick the one with the highest confidence. If you run the code with the erfflag, you will get the following output:

You will get the following output on your Terminal:

As we can see, the outputs are consistent with our observations.

Dealing with class imbalance

A classifier is only as good as the data that’s used for training. One of the most common problems we face in the real world is the quality of data. For a classifier to perform well, it needs to see an equal number of points for each class. But when we collect data in the real world, it’s not always possible to ensure that each class has the exact same number of data points. If one class has 10 times the number of data points of the other class, then the classifier tends to get biased towards the first class. Hence we need to make sure that we account for this imbalance algorithmically. Let’s see how to do that.

Create a new Python file and import the following packages:

import sys 

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import ExtraTreesClassifier
from sklearn import cross_validation
from sklearn.metrics import classification_report

from utilities import visualize_classifier

We will use the data in the file data_imbalance.txtfor our analysis. Let’s load the data. Each line in this file contains comma-separated values. The first two values correspond to the input data and the last value corresponds to the target label. We have two classes in this dataset. Let’s load the data from that file:

# Load input data 
input_file = 'data_imbalance.txt'
data = np.loadtxt(input_file, delimiter=',')
X, y = data[:, :-1], data[:, -1]

Separate the input data into two classes:

# Separate input data into two classes based on labels 
class_0 = np.array(X[y==0])
class_1 = np.array(X[y==1])

Visualize the input data using a scatter plot:

# Visualize input data 
plt.figure()
plt.scatter(class_0[:, 0], class_0[:, 1], s=75, facecolors='black',
edgecolors='black', linewidth=1, marker='x')
plt.scatter(class_1[:, 0], class_1[:, 1], s=75, facecolors='white',
edgecolors='black', linewidth=1, marker='o')
plt.title('Input data')

Split the data into training and testing datasets:

# Split data into training and testing datasets 
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y, test_size=0.25, random_state=5)

Next, we define the parameters for the Extremely Random Forest classifier. Note that there is an input parameter called balancethat controls whether or not we want to algorithmically account for class imbalance. If so, then we need to add another parameter called class_weightthat tells the classifier that it should balance the weight so that it’s proportional to the number of data points in each class:

# Extremely Random Forests classifier 
params = {'n_estimators': 100, 'max_depth': 4, 'random_state': 0}
if len(sys.argv) > 1:
if sys.argv[1] == 'balance':
params = {'n_estimators': 100, 'max_depth': 4, 'random_state': 0, 'class_weight': 'balanced'}
else:
raise TypeError("Invalid input argument; should be 'balance'")

Build, train, and visualize the classifier using training data:

classifier = ExtraTreesClassifier(**params) 
classifier.fit(X_train, y_train)
visualize_classifier(classifier, X_train, y_train, 'Training dataset')

Predict the output for test dataset and visualize the output:

y_test_pred = classifier.predict(X_test) 
visualize_classifier(classifier, X_test, y_test, 'Test dataset')

Compute the performance of the classifier and print the classification report:

# Evaluate classifier performance 
class_names = ['Class-0', 'Class-1']
print("\n" + "#"*40)
print("\nClassifier performance on training dataset\n")
print(classification_report(y_train, classifier.predict(X_train), target_names=class_names))
print("#"*40 + "\n")

print("#"*40)
print("\nClassifier performance on test dataset\n")
print(classification_report(y_test, y_test_pred, target_names=class_names))
print("#"*40 + "\n")

plt.show()

The full code is given in the file class_imbalance.py. If you run the code, you will see a few screenshots. The first screenshot shows the input data:

The second screenshot shows the classifier boundary for the test dataset:

The preceding screenshot indicates that the boundary was not able to capture the actual boundary between the two classes. The black patch near the top represents the boundary. You should see the following output on your Terminal:

You see a warning because the values are 0in the first row, which leads to a divide-by-zero error (ZeroDivisionErrorexception) when we compute the f1-score. Run the code on the terminal using the ignore flag so that you do not see the divide-by-zero warning:

$ python3 --W ignore class_imbalance.py

Now if you want to account for class imbalance, run it with the balance flag:

$ python3 class_imbalance.py balance

The classifier output looks like this:

You should see the following output on your Terminal:

By accounting for the class imbalance, we were able to classify the data points in class-0with non-zero accuracy.

Finding optimal training parameters using a grid search

When you are working with classifiers, you do not always know what the best parameters are. You cannot brute-force it by checking for all possible combinations manually. This is where grid search becomes useful. Grid search allows us to specify a range of values and the classifier will automatically run various configurations to figure out the best combination of parameters. Let’s see how to do it.

Create a new Python file and import the following packages:

import numpy as np 
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn import cross_validation, grid_search
from sklearn.ensemble import ExtraTreesClassifier
from sklearn import cross_validation
from sklearn.metrics import classification_report

from utilities import visualize_classifier

We will use the data available in data_random_forests.txtfor analysis:

# Load input data 
input_file = 'data_random_forests.txt'
data = np.loadtxt(input_file, delimiter=',')
X, y = data[:, :-1], data[:, -1]

Separate the data into three classes:

# Separate input data into three classes based on labels 
class_0 = np.array(X[y==0])
class_1 = np.array(X[y==1])
class_2 = np.array(X[y==2])

Split the data into training and testing datasets:

# Split the data into training and testing datasets 
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y, test_size=0.25, random_state=5)

Specify the grid of parameters that you want the classifier to test. Usually, we keep one parameter constant and vary the other parameter. We then do it vice versa to figure out the best combination. In this case, we want to find the best values for n_estimators and max_depth. Let’s specify the parameter grid:

# Define the parameter grid 
parameter_grid = [ {'n_estimators': [100], 'max_depth': [2, 4, 7, 12, 16]},
{'max_depth': [4], 'n_estimators': [25, 50, 100, 250]}
]

Let’s define the metrics that the classifier should use to find the best combination of parameters:

metrics = ['precision_weighted', 'recall_weighted']

For each metric, we need to run the grid search, where we train the classifier for a particular combination of parameters:

for metric in metrics: 
print("\n##### Searching optimal parameters for", metric)

classifier = grid_search.GridSearchCV(
ExtraTreesClassifier(random_state=0),
parameter_grid, cv=5, scoring=metric)
classifier.fit(X_train, y_train)

Print the score for each parameter combination:

print("\nGrid scores for the parameter grid:") 
for params, avg_score, _ in classifier.grid_scores_:
print(params, '-->', round(avg_score, 3))

print("\nBest parameters:", classifier.best_params_)

Print the performance report:

y_pred = classifier.predict(X_test) 
print("\nPerformance report:\n")
print(classification_report(y_test, y_pred))

The full code is given in the file run_grid_search.py. If you run the code, you will get this output on the Terminal for the precision metric:

Based on the combinations in the grid search, it will print out the best combination for the precision metric. If we want to know the best combination for recall, we need to check the following output on the Terminal:

It is a different combination of recall, which makes sense because precision and recall are different metrics that demand different parameter combinations.

Computing relative feature importance

When we are working with a dataset that contains N-dimensional data points, we have to understand that not all features are equally important. Some are more discriminative than others. If we have this information, we can use it to reduce the dimensional. This is very useful in reducing the complexity and increasing the speed of the algorithm. Sometimes, a few features are completely redundant. Hence they can be easily removed from the dataset.

We will be using the regressor to compute feature importance. AdaBoost, short for Adaptive Boosting, is an algorithm that’s frequently used in conjunction with other machine learning algorithms to improve their performance. In, the training data points are drawn from a distribution to train the current classifier. This distribution is updated iteratively so that the subsequent classifiers get to focus on the more difficult data points. The difficult data points are the ones that are misclassified. This is done by updating the distribution at each step. This will make the data points that were previously misclassified more likely to come up in the next sample dataset that’s used for training. These classifiers are then cascaded and the decision is taken through weighted majority voting.

Create a new Python file and import the following packages:

import numpy as np 
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn import datasets
from sklearn.metrics import mean_squared_error, explained_variance_score
from sklearn import cross_validation
from sklearn.utils import shuffle

We will use the inbuilt housing dataset available in scikit-learn:

# Load housing data 
housing_data = datasets.load_boston()

Shuffle the data so that we don’t bias our analysis:

# Shuffle the data 
X, y = shuffle(housing_data.data, housing_data.target, random_state=7)

Split the dataset into training and testing:

# Split data into training and testing datasets 
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y, test_size=0.2, random_state=7)

Define and train an AdaBoostregressorusing the Decision Tree regressor as the individual model:

# AdaBoost Regressor model 
regressor = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4),
n_estimators=400, random_state=7)
regressor.fit(X_train, y_train)

Estimate the performance of the regressor:

# Evaluate performance of AdaBoost regressor 
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
evs = explained_variance_score(y_test, y_pred )
print("\nADABOOST REGRESSOR")
print("Mean squared error =", round(mse, 2))
print("Explained variance score =", round(evs, 2))

This regressor has an inbuilt method that can be called to compute the relative feature importance:

# Extract feature importances 
feature_importances = regressor.feature_importances_
feature_names = housing_data.feature_names

Normalize the values of the relative feature importance:

# Normalize the importance values 
feature_importances = 100.0 * (feature_importances / max(feature_importances))

Sort them so that they can be plotted:

# Sort the values and flip them 
index_sorted = np.flipud(np.argsort(feature_importances))

Arrange the sticks on the X-axis for the bar graph:

# Arrange the X ticks 
pos = np.arange(index_sorted.shape[0]) + 0.5

Plot the bar graph:

# Plot the bar graph 
plt.figure()
plt.bar(pos, feature_importances[index_sorted], align='center')
plt.xticks(pos, feature_names[index_sorted])
plt.ylabel('Relative Importance')
plt.title('Feature importance using AdaBoost regressor')
plt.show()

The full code is given in the file feature_importance.py. If you run the code, you should see the following output:

According to this analysis, the feature LSAT is the most important feature in that dataset.

Predicting traffic using Extremely Random Forest regressor

Let’s apply the concepts we learned in the previous sections to a real-world problem. We will be using the dataset available at: https://archive.ics.uci.edu/ml/datasets/Dodgers+Loop+Sensor.This dataset consists of data that counts the number of vehicles passing by on the road during baseball games played at Los Angeles Dodgers stadium. In order to make the data readily available for analysis, we need to pre-process it. The pre-processed data is in the file traffic_data.txt. In this file, each line contains comma-separated strings. Let’s take the first line as an example:

Tuesday,00:00,San Francisco,no,3

With reference to the preceding line, it is formatted as follows:

Day of the week, time of the day, opponent team, a binary value indicating whether or not a baseball game is currently going on (yes/no), number of vehicles passing by.

Our goal is to predict the number of vehicles going by using the given information. Since the output variable is continuous-valued, we need to build a regressor that can predict the output. We will be using Extremely Random Forests to build this regressor. Let’s go ahead and see how to do that.

Create a new Python file and import the following packages:

import numpy as np 
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, mean_absolute_error
from sklearn import cross_validation, preprocessing
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import classification_report

Load the data in the file traffic_data.txt:

# Load input data 
input_file = 'traffic_data.txt'
data = []
with open(input_file, 'r') as f:
for line in f.readlines():
items = line[:-1].split(',')
data.append(items)

data = np.array(data)

We need to encode the non-numerical features in the data. We also need to ensure that we don’t encode numerical features. Each feature that needs to be encoded needs to have a separate label encoder. We need to keep track of these encoders because we will need them when we want to compute the output for an unknown data point. Let’s create those label encoders:

# Convert string data to numerical data 
label_encoder = []
X_encoded = np.empty(data.shape)
for i, item in enumerate(data[0]):
if item.isdigit():
X_encoded[:, i] = data[:, i]
else:
label_encoder.append(preprocessing.LabelEncoder())
X_encoded[:, i] = label_encoder[-1].fit_transform(data[:, i])

X = X_encoded[:, :-1].astype(int)
y = X_encoded[:, -1].astype(int)

Split the data into training and testing datasets:

# Split data into training and testing datasets 
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y, test_size=0.25, random_state=5)

Train an extremely Random Forests regressor:

# Extremely Random Forests regressor 
params = {'n_estimators': 100, 'max_depth': 4, 'random_state': 0}
regressor = ExtraTreesRegressor(**params)
regressor.fit(X_train, y_train)

Compute the performance of the regressor on testing data:

# Compute the regressor performance on test data 
y_pred = regressor.predict(X_test)
print("Mean absolute error:", round(mean_absolute_error(y_test, y_pred), 2))

Let’s see how to compute the output for an unknown data point. We will be using those label encoders to convert non-numerical features into numerical values:

# Testing encoding on single data instance 
test_datapoint = ['Saturday', '10:20', 'Atlanta', 'no']
test_datapoint_encoded = [-1] * len(test_datapoint)
count = 0
for i, item in enumerate(test_datapoint):
if item.isdigit():
test_datapoint_encoded[i] = int(test_datapoint[i])
else:
test_datapoint_encoded[i] = int(label_encoder[count].transform(test_datapoint[i]))
count = count + 1

test_datapoint_encoded = np.array(test_datapoint_encoded)

Predict the output:

# Predict the output for the test datapoint 
print("Predicted traffic:", int(regressor.predict([test_datapoint_encoded])[0]))

The full code is given in the file traffic_prediction.py. If you run the code, you will get 26as the output, which is pretty close to the actual value. You can confirm this from the data file.

Source: Artificial Intelligence on Medium

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top
a

Display your work in a bold & confident manner. Sometimes it’s easy for your creativity to stand out from the crowd.

Social