- Market Basket Analysis
Market Basket Analysis is one of the applications of Association Rules Learning, and it has been applied in retail industries to discover the associations among various products. The market baskets are composed of a list of products which are frequently purchased by customers in a single visit to the stores.
2. Association Rule Learning
Association Rule Learning is a widely used approach in data mining, which is helpful to find the relationship among products in a large transactional database. The results of the association rules mining can support retailers to improve marketing strategies by achieving insights that customers frequently purchased products.
The apriori algorithm performs market basket analysis, and the input data for this analysis is customers’ purchased data. There are three significant measures in this algorithm, namely support, confidence, and lift.
Support
Support is the probability of both X and Y products purchased together in the transaction. The value of support can be calculated with the following equations:
Confidence
The confidence value is defined as the division of the total number of transactions including X and Y with the total number of transactions containing Y. The equation is as follows:
Lift
The lift is to measure the likelihood of product Y being purchased when product X is purchased. The following equation can be used to compute the lift value:
Data preparation
The dataset I utilized for this analysis is available for download on my GitHub repository. I prepared a basket using only three months of transaction data. To perform MBA, only transaction data purchased by each customer is required as input. The amount of transactions is not limited, and the data can be prepared based on the needs.
Data analysis tools
- Google Colaboratory
- Python
Required libraries
- Pandas
- NumPy
- Apriori
- Seaborn
- Matplotlib
Firstly, important the required libraries,
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
After importing the libraries, load the dataset as follows:
df = pd.read_csv('online_retail.csv')
df.head()
Output:
Then, prepare the data to perform the basket analysis:
df['Date'] = pd.to_datetime(df['Date'])
print('Time period start: {}nTime period end: {}'.format(df.Date.min(), df.Date.max()))
#define the length of transactions to prepare basket
start_date = '2010-12-01'
end_date = '2011-02-28'
mask = (df['Date'] > start_date) & (df['Date'] <= end_date)
transaction = df.loc[mask]
len(transaction)
There are a total of 100324 transactions contained in the dataset. After that, prepare the basket as follows:
basket = (transaction.groupby(['Invoice', 'Product'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('Invoice'))#Convert the units to 1 hot encoded values
def encode_units(x):
if x <= 0:
return 0
if x >= 1:
return 1
basket_sets = basket.applymap(encode_units)
#Build up frequent items with apriori model
frequent_itemsets = apriori(basket_sets, min_support=0.02, use_colnames=True)
#apply the association_rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.tail()
Output:
Note: An antecedent is an item found within the data. A consequent is an item found in combination with the antecedent.
As you can see above, a couple of columns include in the result dataframe. However, I’d like to highlight some parameters of the apriori algorithm as follows:
antecedents = HEART OF WICKER SMALL’, ‘HEART OF WICKER LARGE
consequents = WHITE HANGING HEART T-LIGHT HOLDER
support = 0.024224, total transactions = 100324
X ∪ Y = support * total transactions = 0.024224 * 100324 = 2430
It means that products [HEART OF WICKER SMALL’, ‘HEART OF WICKER LARGE] and [WHITE HANGING HEART T-LIGHT HOLDER] were bought 2430 times together.
confidence = 0.456410
This means 45% of customers bought [WHITE HANGING HEART T-LIGHT HOLDER] whenever they bought [HEART OF WICKER SMALL’, ‘HEART OF WICKER LARGE].
lift = 3.194002
The lift value is greater than 1, so those products are more likely to be purchased together. The higher the lift value, the stronger the correlation between products.
Visualization
sns.set(style = "whitegrid")
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(projection = '3d')x = rules['support']
y = rules['confidence']
z = rules['lift']
ax.set_xlabel("Support")
ax.set_ylabel("Confidence")
ax.set_zlabel("Lift")
ax.scatter(x, y, z)
ax.set_title("3D distribution of association rules")
plt.show()
Output:
Thanks for reading!