ProjectBlog: Unsupervised Learning Project: Creating Customer Segments

Blog: Unsupervised Learning Project: Creating Customer Segments

Learn how to develop and end-to-end Clustering and Dimensionality Reduction Project!


Throughout this project, we are going to analyze the spending behaviours of several customers in some product categories. The main goals of the project are:

  • Grouping customers in clusters of similar spending characteristics.
  • Describing the variations within the different clusters, in order to find the best delivery structure for each group.

To carry out this project we will use the dataset that can be found in the following UCI Machine Learning Repository.

You can find the complete project, documentation and dataset on my GitHub page:

We will focus our analysis on the six product categories recorded for the customers, excluding the ‘Channel’ and ‘Region’ fields.

# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames

# Import supplementary visualizations code
import visuals as vs

# Pretty display for notebooks
%matplotlib inline

# Load the wholesale customers dataset
data = pd.read_csv("customers.csv")
data.drop(['Region', 'Channel'], axis = 1, inplace = True)
print("Wholesale customers dataset has {} samples with {} features each.".format(*data.shape))
print("Dataset could not be loaded. Is the dataset missing?")

Data Exploration

Now we will explore the dataset through visualizations and code in order to understand the relationships between features. In addition, we will calculate a statistical description of the dataset and consider the overall relevance of each feature.

# Display a description of the dataset
# Display the head of the dataset

Selecting Samples

In order to understand better our dataset and how the data will be transformed through the analysis, we will select a few sample points and explore them in detail.

# Select three indices to sample from the dataset
indices = [85,181,338]

# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print("Chosen samples of wholesale customers dataset:")


Let us consider now the total purchase cost of each product category and the statistical description of the dataset above for our sample customers. If we would have to predict what kind of establishment (customer) each of the three samples chosen represent:

Considering the mean values:

  • Fresh: 12000.2977
  • Milk: 5796.2
  • Grocery: 3071.9
  • Detergents_paper: 2881.4
  • Delicatessen: 1524.8

We can make the following predictions:

1) Index 85: Retailer:

– Largest spending on detergents and paper and groceries of the entire dataset, which usually are products for houses.

– Higher than average spending on milk.

– Lower than average spending on frozen products.

2) Index 181: Large market

– High spending on almost every product category.

– Highest spending on fresh products of the entire dataset. Likely to be a large market.

– Low spending on detergents.

3) Index 338: Restaurant

– The amount of every product is significantly lower than the previous two customers considered.

– The spending on fresh products is the lowest of the entire dataset.

– The spending on milk and detergent and papers is in the bottom quartile.

  • It may be a small and cheap restaurant which needs groceries and frozen food to serve the meals.

Feature Relevance

We will now analyze the relevance of the features for understanding the purchasing behaviours of the customers. In other words, to determine if a customer that purchase some amount of one category of products will necessarily purchase some proportional amount of another category of products.

We will study this by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

# Display the head of the dataset
# Make a copy of the DataFrame, using the 'drop' function to drop the given feature
new_data = data.drop('Grocery', axis=1)

# Split the data into training and testing sets(0.25) using the given feature as the target
# Set a random state.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(new_data, data.Grocery, test_size=0.25, random_state=42)

# Create a decision tree regressor and fit it to the training set
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor()
regressor =, y_train)
prediction = regressor.predict(X_test)

# Report the score of the prediction using the testing set
from sklearn.metrics import r2_score
score = r2_score(y_test, prediction)
print("Prediction score is: {}".format(score))
  • We tried to predict the Grocery feature.
  • The reported prediction score was 67.25%.
  • As we obtained a high score, it as indicator of a very good fit. So this feature is easy to predict considering the rest of spendign habits and, therefore, not very necessary for identifying customers’ spending habits.

Visualizing Feature Distributions

In order to understand better our dataset, we will display a scatter matrix of every product feature.

The product features that show a correlation in the scatter matrix will be relevant to predict others.

# Produce a scatter matrix for each pair of features in the data
pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
# Display a correlation matrix
import seaborn as sns


Using the scatter matrix and the correlation matrix as a references, we can infer the following:

  • Data is not normally distributed, it is positively skewed and it resemeble the log-normal distribution.
  • In most plots, most data points lie near the origin which shows little correlation between them.
  • From the scatter plots and the heatmap of correlation, we can see that there is a strong correlation between the ‘Grocery’ and ‘Detergent_paper’ features. The features ‘Grocery’ and ‘Milk’ also show a good degree of correlation.
  • This correlation confirms my guess about the relevance of the ‘Grocery’ feature, which can be accurately predicted with the ‘Detergent_paper’ feature. And, therefore, is not an absolutely necessary feature in the dataset.

Data Preprocessing

This step is crucial to make sure that the results obtained are significant, meaningful and they are optimized. We will preprocess data by scaling it and detecting potential outliers.

Feature Scaling

Usually, when data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data.

One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.

# Scale the data using the natural logarithm 
log_data = np.log(data)
# Scale the sample data using the natural logarithm 
log_samples = np.log(samples)
# Produce a scatter matrix for each pair of newly-transformed features 
pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');


After applying a natural logarithm scaling to the data, the distribution of each feature appear much more normal. For any pairs of features we have identified earlier as being correlated, we observe here that correlation is still present (and whether it is now stronger or weaker than before).

Displaying the actual data:

# Display the log-transformed sample data

Outlier Detection

Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points.

Here, we will use Tukey’s Method for identfying outliers: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

outliers = []

# For each feature find the data points with extreme high or low values
for feature in log_data.keys():

# Calculate Q1 (25th percentile of the data) for the given feature
Q1 = np.percentile(log_data[feature],25)

# Calculate Q3 (75th percentile of the data) for the given feature
Q3 = np.percentile(log_data[feature],75)

# Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
step = 1.5 * (Q3-Q1)

# Display the outliers
print("Data points considered outliers for the feature '{}':".format(feature))
display(log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))])
lista = log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))].index.tolist()
# Detecting outliers that appear in more than one product
seen = {}
dupes = []

for lista in outliers:
for index in lista:
if index not in seen:
seen[index] = 1
if seen[index] == 1:
seen[index] += 1
dupes = sorted(dupes)
# Removing outliers  
good_data = log_data.drop(dupes, axis=0).reset_index(drop=True)


  • Datapoints considered outliers that are present in more than one feature are: 65, 66, 75, 128, 154.
  • K-Means is heavily influenced by the presence of outliers as they increase significantly the loss function that the algorithm tries to minimize. This loss function is the squared sum of the distances of each datapoint to the centroid, so, if the outlier is far enough, the centroid will be incorrectly situated. Because of this, the outliers shoul be removed.

Feature Transformation

Now we will use Principal Component Analysis (PCA) to extract conclusions about the hidden structure of the dataset. PCA is used to calculate those dimensions that maximize variance, so we will find the combination of features that describe best each customer.

Principal Component Analysis (PCA)

Once the data has been scaled to a normal distribution and the necessary outliers have been removed, we can apply PCA to the good_data to discover which dimensions about the data best maximize the variance of features involved.

In addition to finding these dimensions, PCA will also report the explained variance ratio of each dimension — how much variance within the data is explained by that dimension alone.

# Get the shape of the log_samples
# Apply PCA by fitting the good data with the same number of dimensions as features
from sklearn.decomposition import PCA
pca = PCA(n_components=good_data.shape[1])
pca =

# Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)

# Generate PCA results plot
pca_results = vs.pca_results(good_data, pca)


  • The variance explained by the first two Principal Components is the 70.68% of the total.
  • The variance explained by the first three Principal Components is the 93.11% of the total.

Dimensions discussion

  • Dimension 1: This dimension represents well, in terms of negative variance, the following features: Detergent_Paper, Milk and groceries. Mostly utilities for everyday consuming.
  • Dimension 2: This dimension represents well, in terms of negative variance, the following features: Fresh, Frozen and Delicatessen. Mostly food consuming.
  • Dimension 3: This dimension represents well, in terms of positive variance, the Delicatessen features, and in terms of negative variance de Fresh feature. Food to be consumed on the day.
  • Dimension 4: This dimension represents well, in terms of positive variance, the Frozen feature, and in terms of negative variance, the Delicatessen Feature. Food that can be storaged.
# Display sample log-data after having a PCA transformation applied
display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))

Dimensionality Reduction

When using principal component analysis, one of the main goals is to reduce the dimensionality of the data.

Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained. Because of this, the cumulative explained variance ratio is extremely important for knowing how many dimensions are necessary for the problem. Additionally, if a signifiant amount of variance is explained by only two or three dimensions, the reduced data can be visualized afterwards.

# Apply PCA by fitting the good data with only two dimensions
pca = PCA(n_components=2).fit(good_data)

# Transform the good data using the PCA fit above
reduced_data = pca.transform(good_data)

# Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)

# Create a DataFrame for the reduced data
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])

The cell below show how the log-transformed sample data has changed after having a PCA transformation applied to it using only two dimensions. Observe how the values for the first two dimensions remains unchanged when compared to a PCA transformation in six dimensions.

# Display sample log-data after applying PCA transformation in two dimensions
display(pd.DataFrame(np.round(pca_samples, 4), columns = ['Dimension 1', 'Dimension 2']))

Visualizing a Biplot

A biplot is a scatterplot where each data point is represented by its scores along the principal components. The axes are the principal components (in this case Dimension 1 and Dimension 2).

The biplot shows the projection of the original features along the components. A biplot can help us interpret the reduced dimensions of the data, and discover relationships between the principal components and original features.

# Create a biplot
vs.biplot(good_data, reduced_data, pca)

Once we have the original feature projections (in red), it is easier to interpret the relative position of each data point in the scatterplot.

For instance, a point the lower right corner of the figure will likely correspond to a customer that spends a lot on 'Milk', 'Grocery'and 'Detergents_Paper', but not so much on the other product categories.


In this section, we will choose to use either a K-Means clustering algorithm or a Gaussian Mixture Model (GMM) clustering algorithm to identify the various customer segments hidden in the data.

We will then recover specific data points from the clusters to understand their significance by transforming them back into their original dimension and scale.

K-Means vs GMM

1) The main advantages of using K-Means as a cluster algorithm are:

– It is easy to implement.

– With large number of variables, if (K is small), it may be computationally faster than hierarchichal clustering.

– Consistent and scale-invariant.

– It is guaranteed to converge.

2) The main advantages of using Gaussian Mixture Models as a cluster algorithm are:

– It is much more flexible in terms of cluster covariance. Which means that each cluster can have unconstrained covariance structure. In other words, whereas K-means assumes that every cluster have spherical estructure, GMM allows eliptical.

– Points can belong to different clusters, with different level of memebership. This level of membership is the probability of each point to belong to each cluster.

3) Chosen algorithm:

  • The chosen algorithm is Gaussian Mixture Model. Because data is not splitted in clear and different clusters, so we do not know how many clusters there are.

Creating Clusters

When the number of clusters is not known a priori, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data.

However, we can quantify the “goodness” of a clustering by calculating each data point’s silhouette coefficient. The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.

# Import the necessary libraries
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score

scores = {}
for i in range(2,7):

print('Number of clusters: ' + str(i))

# Apply your clustering algorithm of choice to the reduced data
clusterer = GaussianMixture(random_state=42, n_components=i)

# Predict the cluster for each data point
preds = clusterer.predict(reduced_data)

# Find the cluster centers
centers = clusterer.means_
print('Cluster Center: ' + str(centers))

# Predict the cluster for each transformed sample data point
sample_preds = clusterer.predict(pca_samples)
print('Sample predictions: ' + str(sample_preds))

# Calculate the mean silhouette coefficient for the number of clusters chosen
score = silhouette_score(reduced_data, preds)
scores[i] = score
print('Silhouette score is: ' + str(score), '\n')

print('Scores: ' + str(scores))

The number of cluster with the best Silhouette Score is 2, with a score of 0.42.

Cluster Visualization

Once we’ve chosen the optimal number of clusters for the clustering algorithm using the scoring metric above, we can now visualize the results in the code block below.

# Apply your clustering algorithm of choice to the reduced data 
clusterer = GaussianMixture(random_state=42, n_components=2)

# Predict the cluster for each data point
preds = clusterer.predict(reduced_data)

# Find the cluster centers
centers = clusterer.means_
print('Cluster Center: ' + str(centers))

# Predict the cluster for each transformed sample data point
sample_preds = clusterer.predict(pca_samples)
print('Sample predictions: ' + str(sample_preds))

# Calculate the mean silhouette coefficient for the number of clusters chosen
score = silhouette_score(reduced_data, preds)
scores[i] = score
print('Silhouette score is: ' + str(score), '\n')
# Display the results of the clustering from implementation
vs.cluster_results(reduced_data, preds, centers, pca_samples)

Data Recovery

Each cluster present in the visualization above has a central point. These centers (or means) are not specifically data points from the data, but rather the averages of all the data points predicted in the respective clusters.

For the problem of creating customer segments, a cluster’s center point corresponds to the average customer of that segment. Since the data is currently reduced in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations.

# Inverse transform the centers
log_centers = pca.inverse_transform(centers)

# Exponentiate the centers
true_centers = np.exp(log_centers)

# Display the true centers
segments = ['Segment {}'.format(i) for i in range(0,len(centers))]
true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())
true_centers.index = segments
  • Segment 0 may represent a a fresh food market as every feature except Frozen and Fresh are below the median.
  • Segment 1 may represent a supermarket as every feature except fresh and frozen are above the median.

The code below shows to which ckustr each sample point is predicted ot belong.

# Display the predictions
for i, pred in enumerate(sample_preds):
print("Sample point", i, "predicted to be in Cluster", pred)


  • Sample point 0 → Supermarket and the original guess was a retailer. This difference may be explained because of the size of the cluster (which is pretty big)
  • Sample point 1 → Supermarket and the originak guess was the same.
  • Sample point 2 → Fresh food market and the original guess was a restaurant which is reasonable considering the amount of the spending of the features.


How can the wholesale distributor label the new customers using only their estimated product spending and the customer segment data?

A supervised learning algorithm could be used with the estimated product spending as attributes and the customer segment as the target variable, making it a classification problem (we would have 2 possible labels). As there is not a clear mathematical relationship between the customer segment and the product spending KNN could be a good algorithm to work with.

Visualizing Underlying Distributions

At the beginning of this project, it was discussed that the 'Channel' and 'Region' features would be excluded from the dataset so that the customer product categories were emphasized in the analysis. By reintroducing the 'Channel' feature to the dataset, an interesting structure emerges when considering the same PCA dimensionality reduction applied earlier to the original dataset.

The code block below shows how each data point is labeled either 'HoReCa' (Hotel/Restaurant/Cafe) or 'Retail' the reduced space.

# Display the clustering results based on 'Channel' data vs.channel_results(reduced_data, preds, pca_samples)

We can observe that the cluster algorithm does a pretty good job of clustering the data to the underlying distribution as the cluster 0 can be associated perfectly with a retailer and the cluster 1 to the Ho/Re/Ca.

Source: Artificial Intelligence on Medium

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top

Display your work in a bold & confident manner. Sometimes it’s easy for your creativity to stand out from the crowd.