### Blog: Detecting Patterns with Unsupervised Learning

## Unsupervised learning refers to the process of building machine learning models without using labelled training data. Unsupervised learning finds applications in diverse fields of study, including market segmentation, stock markets, natural language processing, computer vision, and so on.

In the previous articles, we were dealing with data that had labels associated with it. When we have labelled training data, the algorithms learn to classify data based on those labels. In the real world, we might not always have access to labelled data. Sometimes, we just have a lot of data and we need to categorize it in some way. This is where unsupervised learning comes into the picture. Unsupervised learning algorithms attempt to build learning models that can find subgroups within the given dataset using some similarity metric.

Let’s see how we formulate the learning problem in unsupervised learning. When we have a dataset without any labels, we assume that the data is generated because of latent variables that govern the distribution in some way. The process of learning can then proceed in a hierarchical manner, starting from the individual data points. We can build deeper levels of representation for the data.

#### Clustering data with K-Means algorithm

Clustering is one of the most popular unsupervised learning techniques. This technique is used to analyze data and find clusters within that data. In order to find these clusters, we use some kind of similarity measure such as Euclidean distance, to find the subgroups. This similarity measure can estimate the tightness of a cluster. We can say that clustering is the process of organizing our data into subgroups whose elements are similar to each other.

Our goal is to identify the intrinsic properties of data points that make them belong to the same subgroup. There is no universal similarity metric that works for all the cases. It depends on the problem at hand. For example, we might be interested in finding the representative data point for each subgroup or we might be interested in finding the outliers in our data. Depending on the situation, we will end up choosing the appropriate metric.

K-Means algorithm is a well-known algorithm for clustering data. In order to use this algorithm, we need to assume that the number of clusters is known beforehand. We then segment data into K subgroups using various data attributes. We start by fixing the number of clusters and classify our data based on that. The central idea here is that we need to update the locations of these K centroids with each iteration. We continue iterating until we have placed the centroids at their optimal locations.

We can see that the initial placement of centroids plays an important role in the algorithm. These centroids should be placed in a clever manner because this directly impacts the results. A good strategy is to place them as far away from each other as possible. The basic K-Means algorithm places these centroids randomly where `K-Means++`

chooses these points algorithmically from the input list of data points. It tries to place the initial centroids far from each other so that it converges quickly. We then go through our training dataset and assign each data point to the closest centroid.

Once we go through the entire dataset, we say that the first iteration is over. We have grouped the points based on the initialized centroids. We now need to recalculate the location of the centroids based on the new clusters that we obtain at the end of the first iteration. Once we obtain the new set of K centroids, we repeat the process again, where we iterate through the dataset and assign each point to the closest centroid.

As we keep repeating these steps, the centroids keep moving to their equilibrium position. After a certain number of iterations, the centroids do not change their locations anymore. This means that we have arrived at the final locations of the centroids. These K centroids are the final K Means that will be used for inference.

Let’s apply K-Means clustering on two-dimensional data to see how it works. We will be using the data in the `data_clustering.txt`

file provided to you. Each line contains two comma-separated numbers.

Create a new Python file and import the following packages:

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn import metrics

Load the input data from the file:

# Load input data

X = np.loadtxt('data_clustering.txt', delimiter=',')

We need to define the number of clusters before we can apply `K-Means`

algorithm:

num_clusters = 5

Visualize the input data to see what the spread looks like:

# Plot input data

plt.figure()

plt.scatter(X[:,0], X[:,1], marker='o', facecolors='none',

edgecolors='black', s=80)

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

plt.title('Input data')

plt.xlim(x_min, x_max)

plt.ylim(y_min, y_max)

plt.xticks(())

plt.yticks(())

We can visually see that there are five groups within this data. Create the `KMeans`

object using the initialization parameters. The `init`

the parameter represents the method of initialization to select the initial centres of clusters. Instead of selecting them randomly, we use `k-means++`

to select these centres in a smarter way. This ensures that the algorithm converges quickly. The `n_clusters`

parameter refers to the number of clusters. The `n_init`

parameter refers to the number of times the algorithm should run before deciding upon the best outcome:

# Create KMeans object

kmeans = KMeans(init='k-means++', n_clusters=num_clusters, n_init=10)

Train the K-Means model with the input data:

# Train the KMeans clustering model

kmeans.fit(X)

To visualize the boundaries, we need to create a grid of points and evaluate the model on all those points. Let’s define the step size of this grid:

# Step size of the mesh

step_size = 0.01

We define the grid of points and ensure that we are covering all the values in our input data:

# Define the grid of points to plot the boundaries

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

x_vals, y_vals = np.meshgrid(np.arange(x_min, x_max, step_size),

np.arange(y_min, y_max, step_size))

Predict the outputs for all the points on the grid using the trained K-Means model:

# Predict output labels for all the points on the grid

output = kmeans.predict(np.c_[x_vals.ravel(), y_vals.ravel()])

Plot all output values and colour each region:

# Plot different regions and color them

output = output.reshape(x_vals.shape)

plt.figure()

plt.clf()

plt.imshow(output, interpolation='nearest',

extent=(x_vals.min(), x_vals.max(),

y_vals.min(), y_vals.max()),

cmap=plt.cm.Paired,

aspect='auto',

origin='lower')

Overlay input data points on top of these coloured regions:

# Overlay input points

plt.scatter(X[:,0], X[:,1], marker='o', facecolors='none',

edgecolors='black', s=80)

Plot the centres of the clusters obtained using the K-Means algorithm:

# Plot the centers of clusters

cluster_centers = kmeans.cluster_centers_

plt.scatter(cluster_centers[:,0], cluster_centers[:,1],

marker='o', s=210, linewidths=4, color='black',

zorder=12, facecolors='black')

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

plt.title('Boundaries of clusters')

plt.xlim(x_min, x_max)

plt.ylim(y_min, y_max)

plt.xticks(())

plt.yticks(())

plt.show()

The full code is given in the `kmeans.py`

file. If you run the code, you will see two screenshots. The first screenshot is the input data:

The second screenshot represents the boundaries obtained using K-Means:

The black filled circle at the centre of each cluster represents the centroid of that cluster.

#### Estimating the number of clusters with Mean Shift algorithm

Mean Shift is a powerful algorithm used in unsupervised learning. It is a non-parametric algorithm used frequently for clustering. It is non-parametric because it does not make any assumptions about the underlying distributions. This is in contrast to parametric techniques, where we assume that the underlying data follows a standard probability distribution. Mean Shift finds a lot of applications in fields like object tracking and real-time data analysis.

In the Mean Shift algorithm, we consider the whole feature space as a probability density function. We start with the training dataset and assume that they have been sampled from a probability density function. In this framework, the clusters correspond to the local maxima of the underlying distribution. If there are K clusters, then there are K peaks in the underlying data distribution and Mean Shift will identify those peaks.

The goal of Mean Shift is to identify the location of centroids. For each data point in the training dataset, it defines a window around it. It then computes the centroid for this window and updates the location to this new centroid. It then repeats the process for this new location by defining a window around it. As we keep doing this, we move closer to the peak of the cluster. Each data point will move towards the cluster it belongs to. The movement is towards a region of higher density.

We keep shifting the centroids, also called means, towards the peaks of each cluster. Since we keep shifting the means, it is called Mean Shift! We keep doing this until the algorithm converges, at which stage the centroids don’t move anymore.

Let’s see how to use `MeanShift`

to estimate the optimal number of clusters in the given dataset. We will be using data in the `data_clustering.txt`

file for analysis. It is the same file we used in the *KMeans* section.

Create a new Python file and import the following packages:

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import MeanShift, estimate_bandwidth

from itertools import cycle

Load input data:

# Load data from input file

X = np.loadtxt('data_clustering.txt', delimiter=',')

Estimate the bandwidth of the input data. Bandwidth is a parameter of the underlying kernel density estimation process used in Mean Shift algorithm. The bandwidth affects the overall convergence rate of the algorithm and the number of clusters that we will end up with in the end. Hence this is a crucial parameter. If the bandwidth is small, it might result in too many clusters, whereas if the value is large, then it will merge distinct clusters.

The `quantile`

parameter impacts how the bandwidth is estimated. A higher value for quantile will increase the estimated bandwidth, resulting in a lesser number of clusters:

# Estimate the bandwidth of X

bandwidth_X = estimate_bandwidth(X, quantile=0.1, n_samples=len(X))

Let’s train the Mean Shift clustering model using the estimated bandwidth:

# Cluster data with MeanShift

meanshift_model = MeanShift(bandwidth=bandwidth_X, bin_seeding=True)

meanshift_model.fit(X)

Extract the centres of all the clusters:

# Extract the centers of clusters

cluster_centers = meanshift_model.cluster_centers_

print('\nCenters of clusters:\n', cluster_centers)

Extract the number of clusters:

# Estimate the number of clusters

labels = meanshift_model.labels_

num_clusters = len(np.unique(labels))

print("\nNumber of clusters in input data =", num_clusters)

Visualize the data points:

# Plot the points and cluster centers

plt.figure()

markers = 'o*xvs'

for i, marker in zip(range(num_clusters), markers):

# Plot points that belong to the current cluster

plt.scatter(X[labels==i, 0], X[labels==i, 1], marker=marker, color='black')

Plot the centre of the current cluster:

# Plot the cluster center

cluster_center = cluster_centers[i]

plt.plot(cluster_center[0], cluster_center[1], marker='o',

markerfacecolor='black', markeredgecolor='black',

markersize=15)

plt.title('Clusters')

plt.show()

The full code is given in the `mean_shift.py`

file. If you run the code, you will see the following screenshot representing the clusters and their centres:

You will see the following on your Terminal:

#### Estimating the quality of clustering with silhouette scores

If the data is naturally organized into a number of distinct clusters, then it is easy to visually examine it and draw some inferences. But this is rarely the case in the real world. The data in the real world is huge and messy. So we need a way to quantify the quality of the clustering.

Silhouette refers to a method used to check the consistency of clusters in our data. It gives an estimate of how well each data point fits with its cluster. The silhouette score is a metric that measures how similar a data point is to its own cluster, as compared to other clusters. The silhouette score works with any similarity metric.

For each data point, the silhouette score is computed using the following formula:

silhouette score = (p — q) / max(p, q)

Here, *p* is the mean distance to the points in the nearest cluster that the data point is not a part of, and *q* is the mean intra-cluster distance to all the points in its own cluster.

The value of the silhouette score range lies between *-1* to *1*. A score closer to *1* indicates that the data point is very similar to other data points in the cluster, whereas a score closer to *-1* indicates that the data point is not similar to the data points in its cluster. One way to think about it is if you get too many points with negative silhouette scores, then we may have too few or too many clusters in our data. We need to run the clustering algorithm again to find the optimal number of clusters.

Let’s see how to estimate the clustering performance using silhouette scores. Create a new Python file and import the following packages:

import numpy as np

import matplotlib.pyplot as plt

from sklearn import metrics

from sklearn.cluster import KMeans

We will be using the data in the `data_quality.txt`

the file provided to you. Each line contains two comma-separated numbers:

# Load data from input file

X = np.loadtxt('data_quality.txt', delimiter=',')

Initialize the variables. The `values`

the array will contain a list of values we want to iterate on and find the optimal number of clusters:

# Initialize variables

scores = []

values = np.arange(2, 10)

Iterate through all the values and build a K-Means model during each iteration:

# Iterate through the defined range

for num_clusters in values:

# Train the KMeans clustering model

kmeans = KMeans(init='k-means++', n_clusters=num_clusters, n_init=10)

kmeans.fit(X)

Estimate the silhouette score for the current clustering model using the Euclidean distance metric:

score = metrics.silhouette_score(X, kmeans.labels_,

metric='euclidean', sample_size=len(X))

Print the silhouette score for the current value:

print("\nNumber of clusters =", num_clusters)

print("Silhouette score =", score)

scores.append(score)

Visualize the silhouette scores for various values:

# Plot silhouette scores

plt.figure()

plt.bar(values, scores, width=0.7, color='black', align='center')

plt.title('Silhouette score vs number of clusters')

Extract the best score and the corresponding value for the number of clusters:

# Extract best score and optimal number of clusters

num_clusters = np.argmax(scores) + values[0]

print('\nOptimal number of clusters =', num_clusters)

Visualize input data:

# Plot data

plt.figure()

plt.scatter(X[:,0], X[:,1], color='black', s=80, marker='o', facecolors='none')

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

plt.title('Input data')

plt.xlim(x_min, x_max)

plt.ylim(y_min, y_max)

plt.xticks(())

plt.yticks(())

plt.show()

The full code is given in the file `clustering_quality.py`

. If you run the code, you will see two screenshots. The first screenshot is the input data:

We can see that there are six clusters in our data. The second screenshot represents the scores for various values of a number of clusters:

We can verify that the silhouette score is peaking at the value of *6*, which is consistent with our data. You will see the following on your Terminal:

#### What are Gaussian Mixture Models?

Before we discuss Gaussian Mixture Models (GMMs), let’s understand what Mixture Models are. A Mixture Model is a type of probability density model where we assume that the data is governed by a number of component distributions. If these distributions are Gaussian, then the model becomes a Gaussian Mixture Model. These component distributions are combined in order to provide a multi-modal density function, which becomes a mixture model.

Let’s look at an example to understand how Mixture Models work. We want to model the shopping habits of all the people in South America. One way to do it would be to model the whole continent and fit everything into a single model. But we know that people in different countries shop differently. We need to understand how people in individual countries shop and how they behave.

If we want to get a good representative model, we need to account for all the variations within the continent. In this case, we can use mixture models to model the shopping habits of individual countries and then combine all of them into a Mixture Model. This way, we are not missing the nuances of the underlying behaviour of individual countries. By not enforcing a single model on all the countries, we are able to extract a more accurate model.

An interesting thing to note is that mixture models are semi-parametric, which means that they are partially dependent on a set of predefined functions. They are able to provide greater precision and flexibility in modelling the underlying distributions of our data. They can smooth the gaps that result from having sparse data.

If we define the function, then the mixture model goes from being semi-parametric to parametric. Hence a GMM is a parametric model represented as a weighted summation of component Gaussian functions. We assume that the data is being generated by a set of Gaussian models that are combined in some way. GMMs are very powerful and are used across many fields. The parameters of the GMM are estimated from training data using algorithms like Expectation-Maximization (EM) or Maximum A-Posteriori (MAP) estimation. Some of the popular applications of GMM include image database retrieval, modelling stock market fluctuations, biometric verification, and so on.

#### Building a classifier based on Gaussian Mixture Models

Let’s build a classifier based on a Gaussian Mixture Model. Create a new Python file and import the following packages:

import numpy as np

import matplotlib.pyplot as plt

from matplotlib import patches

from sklearn import datasets

from sklearn.mixture import GMM

from sklearn.cross_validation import StratifiedKFold

Let’s use the iris dataset available in scikit-learn for analysis:

# Load the iris dataset

iris = datasets.load_iris()

Split the dataset into training and testing using an 80/20 split. The `n_folds`

parameter specifies the number of subsets you’ll obtain. We are using a value of 5, which means the dataset will be split into five parts. We will use four parts for training and one part for testing, which gives a split of 80/20:

# Split dataset into training and testing (80/20 split)

indices = StratifiedKFold(iris.target, n_folds=5)

Extract the training data:

# Take the first fold

train_index, test_index = next(iter(indices))

# Extract training data and labels

X_train = iris.data[train_index]

y_train = iris.target[train_index]

# Extract testing data and labels

X_test = iris.data[test_index]

y_test = iris.target[test_index]

Extract the number of classes in the training data:

# Extract the number of classes

num_classes = len(np.unique(y_train))

Build a GMM-based classifier using the relevant parameters. The `n_components`

parameter specifies the number of components in the underlying distribution. In this case, it will be the number of distinct classes in our data. We need to specify the type of covariance to use. In this case, we will be using full covariance. The `init_params`

parameter controls the parameters that need to be updated during the training process. We have used `wc`

, which means weights and covariance parameters will be updated during training. The `n_iter`

parameter refers to the number of Expectation-Maximization iterations that will be performed during training:

# Build GMM

classifier = GMM(n_components=num_classes, covariance_type='full',

init_params='wc', n_iter=20)

Initialize the means of the classifier:

# Initialize the GMM means

classifier.means_ = np.array([X_train[y_train == i].mean(axis=0)

for i in range(num_classes)])

Train the Gaussian mixture model classifier using the training data:

# Train the GMM classifier

classifier.fit(X_train)

Visualize the boundaries of the classifier. We will extract the eigenvalues and eigenvectors to estimate how to draw the elliptical boundaries around the clusters. If you need a quick refresher on eigenvalues and eigenvectors, please refer to: http://bit.ly/2YeNeQV;. Let’s go ahead and plot:

# Draw boundaries

plt.figure()

colors = 'bgr'

for i, color in enumerate(colors):

# Extract eigenvalues and eigenvectors

eigenvalues, eigenvectors = np.linalg.eigh(

classifier._get_covars()[i][:2, :2])

Normalize the first eigenvector:

# Normalize the first eigenvector

norm_vec = eigenvectors[0] / np.linalg.norm(eigenvectors[0])

The ellipses need to be rotated to accurately show the distribution. Estimate the angle:

# Extract the angle of tilt

angle = np.arctan2(norm_vec[1], norm_vec[0])

angle = 180 * angle / np.pi

Magnify the ellipses for visualization. The eigenvalues control the size of the ellipses:

# Scaling factor to magnify the ellipses

# (random value chosen to suit our needs)

scaling_factor = 8

eigenvalues *= scaling_factor

Draw the ellipses:

# Draw the ellipse

ellipse = patches.Ellipse(classifier.means_[i, :2],

eigenvalues[0], eigenvalues[1], 180 + angle,

color=color)

axis_handle = plt.subplot(1, 1, 1)

ellipse.set_clip_box(axis_handle.bbox)

ellipse.set_alpha(0.6)

axis_handle.add_artist(ellipse)

Overlay input data on the figure:

# Plot the data

colors = 'bgr'

for i, color in enumerate(colors):

cur_data = iris.data[iris.target == i]

plt.scatter(cur_data[:,0], cur_data[:,1], marker='o',

facecolors='none', edgecolors='black', s=40,

label=iris.target_names[i])

Overlay test data on this figure:

test_data = X_test[y_test == i]

plt.scatter(test_data[:,0], test_data[:,1], marker='s',

facecolors='black', edgecolors='black', s=40,

label=iris.target_names[i])

Compute the predicted output for training and testing data:

# Compute predictions for training and testing data

y_train_pred = classifier.predict(X_train)

accuracy_training = np.mean(y_train_pred.ravel() == y_train.ravel()) * 100

print('Accuracy on training data =', accuracy_training)

y_test_pred = classifier.predict(X_test)

accuracy_testing = np.mean(y_test_pred.ravel() == y_test.ravel()) * 100

print('Accuracy on testing data =', accuracy_testing)

plt.title('GMM classifier')

plt.xticks(())

plt.yticks(())

plt.show()

The full code is given in the file `gmm_classifier.py`

. If you run the code, you will see the following output:

The input data consists of three distributions. The three ellipses of various sizes and angles represent the underlying distributions in the input data. You will see the following printed on your Terminal:

Accuracy on training data = 87.5

Accuracy on testing data = 86.6666666667

#### Finding subgroups in the stock market using Affinity Propagation model

Affinity Propagation is a clustering algorithm that doesn’t require us to specify the number of clusters beforehand. Because of its generic nature and simplicity of implementation, it has found a lot of applications across many fields. It finds out representatives of clusters, called exemplars, using a technique called message passing. We start by specifying the measures of similarity that we want it to consider. It simultaneously considers all training data points as potential exemplars. It then passes messages between the data points until it finds a set of exemplars.

The message passing happens in two alternate steps, called responsibility and availability. Responsibility refers to the message sent from members of the cluster to candidate exemplars, indicating how well suited the data point would be as a member of this exemplar’s cluster. Availability refers to the message sent from candidate exemplars to potential members of the cluster, indicating how well suited it would be as an exemplar. It keeps doing this until the algorithm converges on an optimal set of exemplars.

There is also a parameter called preference that controls the number of exemplars that will be found. If you choose a high value, then it will cause the algorithm to find too many clusters. If you choose a low value, then it will lead to a small number of clusters. A good value to choose would be the median similarity between the points.

Let’s use the Affinity Propagation model to find subgroups in the stock market. We will be using the stock quote variation between opening and closing as the governing feature. Create a new Python file and import the following packages:

import datetime

import json

import numpy as np

import matplotlib.pyplot as plt

from sklearn import covariance, cluster

from matplotlib.finance import quotes_historical_yahoo_ochl as quotes_yahoo

We will be using the stock market data available in `matplotlib`

. The company symbols are mapped to their full names in the file `company_symbol_mapping.json`

:

# Input file containing company symbols

input_file = 'company_symbol_mapping.json'

Load the company symbol map from the file:

# Load the company symbol map

with open(input_file, 'r') as f:

company_symbols_map = json.loads(f.read())

symbols, names = np.array(list(company_symbols_map.items())).T

Load the stock quotes from matplotlib:

# Load the historical stock quotes

start_date = datetime.datetime(2003, 7, 3)

end_date = datetime.datetime(2007, 5, 4)

quotes = [quotes_yahoo(symbol, start_date, end_date, asobject=True)

for symbol in symbols]

Compute the difference between the opening and closing quotes:

# Extract opening and closing quotes

opening_quotes = np.array([quote.open for quote in quotes]).astype(np.float)

closing_quotes = np.array([quote.close for quote in quotes]).astype(np.float)

# Compute differences between opening and closing quotes

quotes_diff = closing_quotes - opening_quotes

Normalize the data:

# Normalize the data

X = quotes_diff.copy().T

X /= X.std(axis=0)

Create a graph model:

# Create a graph model

edge_model = covariance.GraphLassoCV()

Train the model:

# Train the model

with np.errstate(invalid='ignore'):

edge_model.fit(X)

Build the affinity propagation clustering model using the edge model we just trained:

# Build clustering model using Affinity Propagation model

_, labels = cluster.affinity_propagation(edge_model.covariance_)

num_labels = labels.max()

Print the output:

# Print the results of clustering

for i in range(num_labels + 1):

print("Cluster", i+1, "==>", ', '.join(names[labels == i]))

The full code is given in the file `stocks.py`

. If you run the code, you will see the following output on your Terminal:

This output represents the various subgroups in the stock market during that time period. Please note that the clusters might appear in a different order when you run the code.

#### Segmenting the market based on shopping patterns

Let’s see how to apply unsupervised learning techniques to segment the market based on customer shopping habits. You have been provided with a file named `sales.csv`

. This file contains the sales details of a variety of tops from a number of retail clothing stores. Our goal is to identify the patterns and segment the market based on the number of units sold in these stores.

Create a new Python file and import the following packages:

import csv

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import MeanShift, estimate_bandwidth

Load the data from the input file. Since it’s a CSV file, we can use the CSV reader in python to read the data from this file and convert it into a `NumPy`

array:

# Load data from input file

input_file = 'sales.csv'

file_reader = csv.reader(open(input_file, 'r'), delimiter=',')

X = []

for count, row in enumerate(file_reader):

if not count:

names = row[1:]

continue

X.append([float(x) for x in row[1:]])

# Convert to numpy array

X = np.array(X)

Let’s estimate the bandwidth of the input data:

# Estimating the bandwidth of input data

bandwidth = estimate_bandwidth(X, quantile=0.8, n_samples=len(X))

Train a mean shift model based on the estimated bandwidth:

# Compute clustering with MeanShift

meanshift_model = MeanShift(bandwidth=bandwidth, bin_seeding=True)

meanshift_model.fit(X)

Extract the labels and the centres of each cluster:

labels = meanshift_model.labels_

cluster_centers = meanshift_model.cluster_centers_

num_clusters = len(np.unique(labels))

Print the number of clusters and the cluster centres:

print("\nNumber of clusters in input data =", num_clusters)

print("\nCenters of clusters:")

print('\t'.join([name[:3] for name in names]))

for cluster_center in cluster_centers:

print('\t'.join([str(int(x)) for x in cluster_center]))

We are dealing with six-dimensional data. In order to visualize the data, let’s take two-dimensional data formed using second and third dimensions:

# Extract two features for visualization

cluster_centers_2d = cluster_centers[:, 1:3]

Plot the centres of clusters:

# Plot the cluster centers

plt.figure()

plt.scatter(cluster_centers_2d[:,0], cluster_centers_2d[:,1],

s=120, edgecolors='black', facecolors='none')

offset = 0.25

plt.xlim(cluster_centers_2d[:,0].min() - offset * cluster_centers_2d[:,0].ptp(),

cluster_centers_2d[:,0].max() + offset * cluster_centers_2d[:,0].ptp(),)

plt.ylim(cluster_centers_2d[:,1].min() - offset * cluster_centers_2d[:,1].ptp(),

cluster_centers_2d[:,1].max() + offset * cluster_centers_2d[:,1].ptp())

plt.title('Centers of 2D clusters')

plt.show()

The full code is given in the file `market_segmentation.py`

. If you run the code, you will see the following output:

You will see the following on your Terminal:

Until then see you next time!

## Leave a Reply