Creating Customer Segmentation

In this post, we will identify customers segments using data collected from customers of a wholesale distributor in Lisbon (Portugal). The dataset includes the various customers annual spending amounts (reported in monetary units) of diverse product categories for internal structure. The project includes several steps: explore data (determine if any product categories are highly correlated), scale each product category, identify and remove outliers, dimension reduction using PCA, implement a clustering algorithm to segment the customer data and finally compare segmentation. The dataset for this project can be found on the UCI Machine Learning Repository. For the purposes of this project, the features 'Channel' and 'Region' will be excluded in the analysis — with focus instead on the six product categories recorded for customers.

Let’s load the dataset along with a few of the necessary Python libraries.

Wholesale customers dataset has 440 samples with 6 features each.

Data Exploration

In this section, we begin exploring the data through visualizations to understand how each feature is related to the others. We will observe a statistical description of the dataset, consider the relevance of each feature, and select a few sample data points from the dataset which you will track through the course of this project.

Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
count 440.000000 440.000000 440.000000 440.000000 440.000000 440.000000
mean 12000.297727 5796.265909 7951.277273 3071.931818 2881.493182 1524.870455
std 12647.328865 7380.377175 9503.162829 4854.673333 4767.854448 2820.105937
min 3.000000 55.000000 3.000000 25.000000 3.000000 3.000000
25% 3127.750000 1533.000000 2153.000000 742.250000 256.750000 408.250000
50% 8504.000000 3627.000000 4755.500000 1526.000000 816.500000 965.500000
75% 16933.750000 7190.250000 10655.750000 3554.250000 3922.000000 1820.250000
max 112151.000000 73498.000000 92780.000000 60869.000000 40827.000000 47943.000000

Product Categories - Here are possible products that may be purchased within the 6 product categories:

  1. Fresh: fruits, vegetables, fish, meat
  2. Milk: dairy products i.e Milk, cheese, cream
  3. Grocery: flour, rice, cereals, beverages, can food
  4. Frozen: frozen vegetables, processed food
  5. Detergents_paper: household products, Detergent, hand/floor soap, napkins
  6. Delicatessen: fine chocolate, french cheese
    The integer represents the annual spending of a customer in monetary unit (mu) for a particular product category.

Implementation: Selecting Samples

To get a better understanding of the customers and how their data will transform through the analysis, it would be best to select a few sample data points and explore them in more detail.

Chosen samples of wholesale customers dataset:

Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 12126 3199 6975 480 3140 545
1 4155 367 1390 2306 86 130
2 17360 6200 9694 1293 3620 1721
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 126.0 -2597.0 -976.0 -2592.0 259.0 -980.0
1 -7845.0 -5429.0 -6561.0 -766.0 -2795.0 -1395.0
2 5360.0 404.0 1743.0 -1779.0 739.0 196.0
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 3622.0 -428.0 2219.0 -1046.0 2324.0 -421.0
1 -4349.0 -3260.0 -3366.0 780.0 -730.0 -836.0
2 8856.0 2573.0 4938.0 -233.0 2804.0 755.0

[+] customer0’s spending on Milk, Frozen and Delicatessen is lower than the median, but it spends more on Fresh, Grocery and Detergents.
[+] customer1’s spending is lower than the median and the mean for all categories except for ‘Frozen’, for which speding is higher than median but lower than the mean.
[+] customer2 spending is higher than the mean and median spending for all categories except for ‘Frozen’.

The table below details the total spending of each sample customer and the percentage of spending per product categories (‘%cost_’)

Customer total cost %cost_Fresh %cost_Milk %cost_Grocery %cost_Frozen %cost_Detergents_Paper %cost_Delicatessen
0 26465 45.8 12.1 26.4 1.8 11.9 2.1
1 8434 49.3 4.4 16.5 27.3 1.0 1.5
2 39888 43.5 15.5 24.3 3.2 9.1 4.3

[+] All the 3 customers purchase mainly ‘Fresh’ products, which accounts for about 50% of their total spending.
[+] Customer0 and Customer2 are likely to be operating in the same business segment, with similar purchase pattern. Indeed, their 2nd largest expenditure is for ‘Grocery’ products followed by ‘Fresh’ and ‘Frozen’ products. Customer2 out-spends Customer0 by about 13000mu: that infers that Customer2 may be a larger business and may need more volume. Customer0 and Customer2 may be in the Retail segment, where food related items (Fresh, Milk, Grocery) are typically available, as well as Household products.
[+] For Customer1, the 2nd largest spending is for ‘Frozen’ products followed by ‘Grocery’ products. Customer1 might be in the Restaurant segment, where the highest percentage of expenditures are for food related items, and the smallest percentage of expenditure for ‘Detergents_Paper’ products. Note also that Customer1 has the smallest total expenditure compared to the 2 other customers, about 3 to 5 times lower than Customer0 and Customer2 respectively. It is very likely that a Restaurant needs less volume than retailers.

As a side note, it might be useful to look at the volume of the purchase for the different product categories to get more insight on the customer business segment. Indeed, ‘Detergents_paper’ products are typically more expensive than ‘Milk’ products for example. Therefore, a larger percentage of expenditure on ‘Detergents_Paper’ does not necessarily mean that the customer purchase a higher volume of ‘Detergents_Paper’ products.

Implementation: Feature Relevance

One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? We can make this determination quite easily by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

[+] Deleted feature: Fresh

Feature importance
Milk 0.226778
Grocery 0.082826
Frozen 0.305044
Detergents Paper 0.139340
Delicatessen 0.246012

[+] Deleted feature: Milk

Feature importance
Fresh 0.138434
Grocery 0.219012
Frozen 0.031806
Detergents Paper 0.465780
Delicatessen 0.144969

[+] Deleted feature: Grocery

Feature importance
Fresh 0.020446
Milk 0.045690<
Frozen 0.016514
Detergents Paper 0.890934
Delicatessen 0.026416

[+] Deleted feature: Frozen

Feature importance
Fresh .098081
Milk 0.067566
Grocery 0.064459
Detergents Paper 0.127549
Delicatessen 0.642346

[+]Deleted feature: Detergents_Paper

Feature importance
Fresh 0.040617
Milk 0.013689
Grocery 0.900611
Frozen 0.021572
Delicatessen 0.023511

[+] Deleted feature: Delicatessen

Feature importance
Fresh 0.148177
Milk 0.147138
Grocery 0.073654
Frozen 0.487107
Detergents_Paper 0.143924

The code above is generalized to study how the score varies when trying to predict any of the 6 attributes. In order to visualize the consistency of the test score values for each case, we run the Decision Tree Regressor multiple times (nbr_fit_trials=100) on the training set and calculate the score of the model on the test set.

The score is defined by:

where is the true value of , is the predicted value, and is the mean of the true value. can be positive with best value of 1. It can also be negative if the model fails to fit the data, i.e if is large.

We obtain a negative score when attempting to predict the values of the feature ‘Milk’ using the 5 other features. Although the score values are scattered for the 100 fit trials , it is consistently negative (the red bar in the plot shows the mean value of the score). Therefore, ‘Milk’ cannot be predicted using the 5 other features and is therefore necessary for identifying customers’ spending habits. Similarly, the values of the features ‘Frozen’ and ‘Delicatessen’ cannot be predicted.

The best average score is obtained when predicting ‘Grocery’ : . This means that, if given the 5 other features, the values of ‘Grocery’ can be determined with relatively good accuracy. Therefore, the feature ‘Grocery’ might not be necessary for identifying patterns in customers’ spending. To a lesser extent, the values of ‘Detergent_papers’ can also be predicted from the remaining features: in this case is close to 50%.

The parameter feature_importance also provides some additional information on the possible correlation between the features, in particular for the 2 cases where is positive.

The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. >It is also known as the Gini importance. Ref. sklearn documentation

When using ‘Grocery’ as the label, the most important feature is ‘Detergent_papers’, with a Gini importance of 0.89, and inversely when using ‘Detergent_papers’ as the label, the feature ‘Grocery’ becomes the most important with a Gini importance of 0.90. The result indicates that there is some correlation between the attribute ‘Grocery’ and ‘Detergent_papers’, although we cannot deduce whether it is a positive or a negative correlation.

Visualize Feature Distributions

To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data.

[+] Most features are uncorrelated, except for ‘Grocery’ and ‘Detergents_paper’ which are positively correlated. ‘Grocery’ spending clearly increases with ‘Detergent_paper’ spending. Thus, the values of the feature ‘Grocery’ can be predicted given the values of ‘Detergents_paper’, and vice-versa. There is also evidence of some correlation between other pairs of features: ‘Detergent_papers’ versus ‘Milk’, and ‘Grocery’ versus ‘Milk’, which are all positive correlation. All the 6 attributes show a lognormal distribution (the data is not normally distributed). The distributions are skewed to the right with a long tail: the bulk of the customers have relatively small spending but a few customers have spending much larger.

Data Preprocessing

In this section, we will preprocess the data to create a better representation of customers by performing a scaling on the data and detecting (and optionally removing) outliers. If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.

The Log transformation is usually performed on skewed data to make it approximately normal, and that make the transformed data more appropriate to be used with models that assume a normal distribution, like Gaussian Mixture Model. After the transformation of the data, we see a clearer correlation between ‘Detergent_papers’ and ‘Grocery’, and also between ‘Milk’ and ‘Detergent_papers’, and ‘Grocery’ and ‘Milk’.

Run the code below to see how the sample data has changed after having the natural logarithm applied to it.

Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 9.403107 8.070594 8.850088 6.173786 8.051978 6.300786
1 8.332068 5.905362 7.237059 7.743270 4.454347 4.867534
2 9.761924 8.732305 9.179262 7.164720 8.194229 7.450661

Implementation: Outlier Detection

Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. There are many “rules of thumb” for what constitutes an outlier in a dataset. Here, we will use Tukey’s Method for identfying outliers: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

[+] Data points considered outliers for the feature ‘Fresh’:

Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
65 4.442651 9.950323 10.732651 3.583519 10.095388 7.260523
66 2.197225 7.335634 8.911530 5.164786 8.151333 3.295837
81 5.389072 9.163249 9.575192 5.645447 8.964184 5.049856
95 1.098612 7.979339 8.740657 6.086775 5.407172 6.563856
96 3.135494 7.869402 9.001839 4.976734 8.262043 5.379897
128 4.941642 9.087834 8.248791 4.955827 6.967909 1.098612
171 5.298317 10.160530 9.894245 6.478510 9.079434 8.740337
193 5.192957 8.156223 9.917982 6.865891 8.633731 6.501290
218 2.890372 8.923191 9.629380 7.158514 8.475746 8.759669
304 5.081404 8.917311 10.117510 6.424869 9.374413 7.787382
305 5.493061 9.468001 9.088399 6.683361 8.271037 5.351858
338 1.098612 5.808142 8.856661 9.655090 2.708050 6.309918
353 4.762174 8.742574 9.961898 5.429346 9.069007 7.013016
355 5.247024 6.588926 7.606885 5.501258 5.214936 4.844187
357 3.610918 7.150701 10.011086 4.919981 8.816853 4.700480
412 4.574711 8.190077 9.425452 4.584967 7.996317 4.127134

[+]Data points considered outliers for the feature ‘Milk’:

Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
86 10.039983 11.205013 10.377047 6.894670 9.906981 6.805723
98 6.220590 4.718499 6.656727 6.796824 4.025352 4.882802
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
356 10.029503 4.897840 5.384495 8.057377 2.197225 6.306275

[+] Data points considered outliers for the feature ‘Grocery’:

Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
75 9.923192 7.036148 1.098612 8.390949 1.098612 6.882437
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442

[+] Data points considered outliers for the feature ‘Frozen’:

Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
38 8.431853 9.663261 9.723703 3.496508 8.847360 6.070738
57 8.597297 9.203618 9.257892 3.637586 8.932213 7.156177
65 4.442651 9.950323 10.732651 3.583519 10.095388 7.260523
145 10.000569 9.034080 10.457143 3.737670 9.440738 8.396155
175 7.759187 8.967632 9.382106 3.951244 8.341887 7.436617
264 6.978214 9.177714 9.645041 4.110874 8.696176 7.142827
325 10.395650 9.728181 9.519735 11.016479 7.148346 8.632128
420 8.402007 8.569026 9.490015 3.218876 8.827321 7.239215
429 9.060331 7.467371 8.183118 3.850148 4.430817 7.824446
439 7.932721 7.437206 7.828038 4.174387 6.167516 3.951244

[+] Data points considered outliers for the feature ‘Detergents_Paper’:

Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
75 9.923192 7.036148 1.098612 8.390949 1.098612 6.882437
161 9.428190 6.291569 5.645447 6.995766 1.098612 7.711101

[+] Data points considered outliers for the feature ‘Delicatessen’:

Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
66 2.197225 7.335634 8.911530 5.164786 8.151333 3.295837
109 7.248504 9.724899 10.274568 6.511745 6.728629 1.098612
128 4.941642 9.087834 8.248791 4.955827 6.967909 1.098612
137 8.034955 8.997147 9.021840 6.493754 6.580639 3.583519
142 10.519646 8.875147 9.018332 8.004700 2.995732 1.098612
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
183 10.514529 10.690808 9.911952 10.505999 5.476464 10.777768
184 5.789960 6.822197 8.457443 4.304065 5.811141 2.397895
187 7.798933 8.987447 9.192075 8.743372 8.148735 1.098612
203 6.368187 6.529419 7.703459 6.150603 6.860664 2.890372
233 6.871091 8.513988 8.106515 6.842683 6.013715 1.945910
285 10.602965 6.461468 8.188689 6.948897 6.077642 2.890372
289 10.663966 5.655992 6.154858 7.235619 3.465736 3.091042
343 7.431892 8.848509 10.177932 7.283448 9.646593 3.610918

[+] Data Point at index=128 is an outlier for 2 features
[+] Data Point at index=65 is an outlier for 2 features
[+] Data Point at index=66 is an outlier for 2 features
[+] Data Point at index=75 is an outlier for 2 features
[+] Data Point at index=154 is an outlier for 3 features

Are there any data points considered outliers for more than one feature based on the definition above?

[+] Data points that are outliers for at least 2 features:

Fresh Milk Grocery Frozen Detergents_Paper Delicatessen sum
128 140 8847 3823 142 1062 3 14017
65 85 20959 45828 36 24231 1423 92562
66 9 1534 7417 175 3468 27 12630
75 20398 1137 3 4407 3 975 26923
154 622 55 137 75 7 8 904

[+] Customers with total spending < 4000 mu

Fresh Milk Grocery Frozen Detergents_Paper Delicatessen sum
97 403 254 610 774 54 63 2158
98 503 112 778 895 56 132 2476
131 2101 589 314 346 70 310 3730
154 622 55 137 75 7 8 904
275 680 1610 223 862 96 379 3850
355 190 727 2012 245 184 127 3485

There are 5 data points that are outliers for more than 1 feature: index=[ 128, 65, 66, 75, 154]. I decided to include only 1 customer in the list of outliers : 154. First, 154 is an outlier for 3 features. Furthermore if we look at the total spending of this customer, it is the lowest within the full dataset (see table above). This customer does not contribute much to the bottom line of the wholesale distributor, and incorporating that data point might skewed the analysis. Other data points like 65 are outliers for 2 features. However, the total spending of customer 65 is among the highest (sum=92562) and it might be more important for the company to use a segmentation model that includes this valuable customer.

Feature Transformation

In this section we will use principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.

[+] How much variance in the data is explained in total by the first and second principal component?
0.7142 (71.42%) variance is explained by the 1st and 2nd PCA Dimension. Note that the 1st component is mainly driven by ‘Detergent_papers’ feature and to a lesser extent by the ‘Milk’ and ‘Grocery’ features. The 2nd PCA dimension is dominated by the triplet: ‘Fresh’, ‘Frozen’ and ‘Delicatessen’ features. The 4 first PCA Dimensions have total explained variance of more than 0.9300. Therefore, using the first 4 components is enough to capture most of the variance of the original data, i.e the dataset can be compressed from 6 dimensions (features) to 4 dimensions without loosing much information. The dimensions of the PCA represent different spending patterns, and are defined by a weighted sum of the original features. For example:

PCA_Dimension1 = 0.78 * ‘Detergents_Paper’ + 0.5 * ‘Grocery’ + 0.4 * ‘Milk’ - 0.2 * ‘Frozen’ - 0.2 * ‘Fresh’ + 0.2 * ‘Delicatessen’.

PCA_Dimension1 represents a spending pattern with higher spending on ‘Detergents_Paper’, ‘Milk’, ‘Grocery’ and ‘Delicatessen’ and decreasing spending on ‘Frozen’ and ‘Fresh’ products. PCA_Dimension1 is mostly correlated ( positively) to ‘Detergents_Paper’.

PCA_Dimension2: all product categories show positive weights but of different amplitudes, and the highest weight being for ‘Fresh’ and lowest rate for ‘Detergents_paper’.

PCA_Dimension3: is mainly correlated to ‘Delicatessen’ (negative correlation) and ‘Fresh’ (positive correlation). It reflects the case where customers who buy more ‘Fresh’ products, would consequently spend less on ‘Delicatessen’ and ‘Frozen’ products.

PCA_Dimension4: is mainly correlated to ‘Frozen’ (positive correlation) and negatively correlated to ‘Delicatessen’ and ‘Fresh’.

Dimension 1 Dimension 2 Dimension 3 Dimension 4 Dimension 5 Dimension 6
0 1.1367 -0.1357 1.2913 -0.6126 0.4993 0.0965
1 -3.3629 -1.7125 0.3371 0.7769 0.3800 0.6531
2 1.5088 1.3272 0.5133 -0.4019 0.2411 0.0286

#Dimensionality Reduction When using principal component analysis, one of the main goals is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained. Because of this, the cumulative explained variance ratio is extremely important for knowing how many dimensions are necessary for the problem. Additionally, if a signifiant amount of variance is explained by only two or three dimensions, the reduced data can be visualized afterwards.

Dimension 1 Dimension 2
0 1.1367 -0.1357
1 -3.3629 -1.7125
2 1.5088 1.3272

Above, the scatter matrix is plotted for Dimension1 and Dimension2. Note that Dimension1 shows a bimodal distribution. This is not unexpected: indeed, Dimension1 gets contributions mainly from ‘Milk’, ‘Grocery’ and ‘Detergents_Paper’ products (the highest weights). And we have seen earlier in this report, that the 2 features ‘Grocery and ‘Detergents_Paper’ are positively correlated. In contrast, Dimension2 shows a unimodal distribution: in this case, the main contributing features are ‘Fresh’, ‘Frozen’, and ‘Delicatessen’. Those features are uncorrelated.


In this section, we will choose to use either a K-Means clustering algorithm or a Gaussian Mixture Model clustering algorithm to identify the various customer segments hidden in the data.

[+] K-means clustering is one of the simplest algorithm for unsupervised learning. It is fast and scale well with large dataset. Another advantage of the K-means clustering algorithm is that it makes no assumption on the distribution of the observations, and performed a hard assignment of the points to a cluster.
[+] The Gaussian Mixture Model (GMM) compute to what degree (probability) a data point is assigned to a cluster: GMM offers the possibility to relax the assignment threshold depending on the data.

We will use the GMM algorithm for the following reasons:

  1. with the Log transformation, the data is closer to a normal distribution: we can see uni-modal and bi-modal distribution of some features (ref. to the scatter matrix of the Log transformed features)
  2. our dataset is relatively small, so compute time is not an issue.

#Creating Clusters Depending on the problem, the number of clusters that we expect to be in the data may already be known. When the number of clusters is not known a priori, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data — if any. However, we can quantify the “goodness” of a clustering by calculating each data point’s silhouette coefficient. The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar).

We report the silhouette score for several cluster numbers we tried.

Nbr of clusters silhouette score
2 0.4
3 0.39
4 0.3
5 0.28
6 0.28
7 0.32
8 0.3
9 0.3


[+] The best score is obtained when using 2 clusters. However, there is not much difference between the silhouette score when using 2 clusters or 3 clusters. Beyond 3 clusters, the silouette score drops significantly.

Cluster Visualization

Each cluster present in the visualization above has a central point. These centers (or means) are not specifically data points from the data, but rather the averages of all the data points predicted in the respective clusters. For the problem of creating customer segments, a cluster’s center point corresponds to the average customer of that segment. Since the data is currently reduced in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations.

Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
Segment 0 3920.0 5954.0 9243.0 989.0 2806.0 861.0
Segment 1 8846.0 2214.0 2777.0 2042.0 375.0 744.0

[+] What set of establishments could each of the customer segments represent?

Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
Segment 0 -4584.0 2327.0 4487.0 -537.0 1990.0 -105.0
Segment 1 342.0 -1413.0 -1979.0 516.0 -441.0 -222.0

The segment0 spends much less on ‘Fresh’, ‘Frozen’ and ‘Delicatessen’ products compared to the median and the mean customer. In contrast spending on ‘Grocery’, ‘Milk’ is higher. This customer segment could represent Grocery stores. The segment1 spends less on ‘Milk’ ‘Grocery’ and ‘Detergents_paper’ products than the median: this segment might represent Restaurant.

Another way to look at the data is to compute the normalized spending for each segment i.e: \begin{align} \%cost = \frac{\text{feature spending}}{\text{total spending}} \end{align}

total cost %cost_Fresh %cost_Milk %cost_Grocery %cost_Frozen %cost_Detergents_Paper %cost_Delicatessen
segment 0 23773 16.5 25.0 38.9 4.2 11.8 3.6
segment 1 16998 52.0 13 16.3 12.0 2.2 4.4

[+] segment_0: ‘Grocery’ represents the largest spending followed by ‘Milk’, ‘Fresh’ and ‘Detergents_paper’. This could represent the purchasing pattern of a Retailer.
[+] segment_1: ‘Fresh’ represents the largest spending by far, followed prractically evenly by ‘Milk’, ‘Grocery’ and ‘Frozen’. This could represent the purchasing pattern of a Restaurant.

[+] For each sample point, which customer segment best represents it?

Customer total cost %cost_Fresh %cost_Milk %cost_Grocery %cost_Frozen %cost_Detergents_Paper %cost_Delicatessen
0 26465 45.8 12.1 26.4 1.8 11.9 2.1
1 8434 49.3 4.4 16.5 27.3 1.0 1.5
2 39888 43.5 15.5 24.3 3.2 9.1 4.3

[+] sample_0 and sample_2 are best represented by segment_0 as their purchasing pattern better match segment_0. Although those two samples show higher %cost of ‘Fresh’ products, they also show lowest %cost for ‘Delicatessen’ and ‘Frozen’ similarly to segment_0.
[+] sample_2 purchasing pattern matches better segment_1 where the lowest %cost are for ‘Delicatessen’ and ‘Detergents_paper’.

The model predictions for each sample confirm our conclusion above. Note that in a earlier section of the assignment, we argued that sample_0 and sample_2 are likely in the same core business because of their similar purchasing pattern, and it is also confirmed by the model.

#Conclusion [+] Companies will often run A/B tests when making small changes to their products or services to determine whether making that change will affect its customers positively or negatively. If the wholesale distributor is considering changing its delivery service for example from currently 5 days a week to 3 days a week, is that going to affect the business?

A higher frequency of delivery is likely to positively affect the customers that purchase mainly ‘Fresh’ and ‘Milk’ products. It might have no effect on other customers, although a higher frequency of truck delivery could be a disturbance (more traffic, etc..) and impact the customer’ bottom line. In order to determine which customers would respond positively to new delivery service, one could split the customers in 2 similar sets (setA and setB): i.e customers belonging to segment_0 would be equaly split between setA and setB, and same thing for customers of segment_1. We would then expose only one set (for example setA) to the new delivery service, and keep the service unchanged for the 2nd set of customers setB. We can then compare setA and setB to determine what group of customers that the new service affects the most.

[+]Additional structure is derived from originally unlabeled data when using clustering techniques. Since each customer has a customer segment it best identifies with (depending on the clustering algorithm applied), we can consider ‘customer segment’ as an engineered feature for the data. If the wholesale distributor recently acquired ten new customers and each provided estimates for anticipated annual spending of each product category. Knowing these estimates, the wholesale distributor wants to classify each new customer to a customer segment to determine the most appropriate delivery service. How can the wholesale distributor label the new customers using only their estimated product spending and the customer segment data?

The target variable would be the customer_segment. We could train a Decision Tree Classifier (or any other classification algorithm) on existing customers using the customer_segment as the label. The wholesale distributor can then ‘run’ the new customers through the classifier which would then outputs the segment that the new customers belong to.

Let ‘s Visualize Underlying Distributions:

At the beginning of this project, it was discussed that the 'Channel' and 'Region' features would be excluded from the dataset so that the customer product categories were emphasized in the analysis. By reintroducing the 'Channel' feature to the dataset, an interesting structure emerges when considering the same PCA dimensionality reduction applied earlier to the original dataset. We can see how each data point is labeled either 'HoReCa' (Hotel/Restaurant/Cafe) or 'Retail' the reduced space.

The clustering model with 2 clusters reproduces fairly well the distribution of Hotel/Restaurant/Cafe and Retailer customer. Though, there are a few data points (green) that are embedded in a large population of red data points. Our GMM model shows a clearer/sharper boundary between the 2 segments. The customers with Dimension2 in the range [-2, 2] and Dimension1 < -1 are likely to have a very high probability to belong to segment1, but there would still be a negligible probability that the customer belongs to segment0.