![](https://crypto4nerd.com/wp-content/uploads/2023/01/06VLYMFhwe0hrn2b7.jpeg)
In today’s highly competitive market, businesses are constantly looking for ways to improve their marketing strategies and increase their revenue.
One way to do this is by using data science techniques to predict customer behavior and tailor marketing campaigns accordingly.
In this report, I present a study that aimed to predict whether users of the Starbucks app will view offers if they are sent to them. The study employed a powerful machine learning techniques known as a Random Forest Classifier, Logistic Regression and K-Nearest Neighbors to make these predictions.
The data cleaning process as well as all other analyses and predictions made can be found on my GitHub page here
In an ideal world, every offer that Starbucks sends out to it’s users would be viewed and then completed.
In reality, these offers are sent out to a large number of users who never actually view them. These same users frequently complete the offer without even knowing they have done so.
Providing these discount to these unaware users results in a loss of revenue for Starbucks.
I have created a model that will predict whether or not a user will view the selected offer. This is based on their demographic information as well as the offer information itself.
The model itself will be a binary classification estimator. It will simply be trying to predict whether an offer will be viewed (1) or not viewed (0) by the user. This was trialed using a Random Forrest Classifier, Logistic Regression and K-Nearest Neighbors to see which model provided the best prediction, as measured by the recall score.
By utilizing a model like this, Starbucks can reduce the number of generous discounts being handed out to all users and therefore limit this loss of revenue.
The primary metric that will steer this project will be the recall score.
The reason I have chosen this metric is because we have a strong desire to reduce/minimise the the number of False Negatives we classify (we thought the user would view the offer, but they actually didn’t = Bad)
The first step in the analysis I conducted was to clean the data for the users and offers of the Starbucks app. Below are the 3 respective datasets I inherited; portfolio, profile and transcript.
portfolio.json
- id (string) — offer id
- offer_type (string) — type of offer ie BOGO (Buy One Get One Free), discount, informational
- difficulty (int) — minimum required spend to complete an offer
- reward (int) — reward given for completing an offer
- duration (int) — time for offer to be open, in days
- channels (list of strings)
profile.json
- age (int) — age of the customer
- became_member_on (int) — date when customer created an app account
- gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
- id (str) — customer id
- income (float) — customer’s income
n.b. A large number users have ages that would put them in the top 5 oldest people ever recorded (> 117 years old ). This is an obvious error that aligns with the Null/NaN values we see in other columns. We’ll deal with this later.
transcript.json
- event (str) — record description (ie transaction, offer received, offer viewed, etc.)
- person (str) — customer id
- time (int) — time in hours since start of test. The data begins at time t=0
- value — (dict of strings) — either an offer id or transaction amount depending on the record
The final dataset I used dummied a lot of the columns using the get_dummies() function in Pandas as well as removing NaN rows (eg. rows where user’s age exceed 117 also had NaN values and were therefore removed).
Below is the head of the final dataset used.
The only newly defined column here is the ‘transactions’ column that considers the total value of transactions the user has made.
The rest of the dataset is free from NaN values although this did result in a loss of around 13% of the users and their interactions from the dataset. This is a reasonable initial compromise to make, given the large size of the dataset; around 15,000 users remained after cleaning.
By looking at the correlation between variables in the final dataset, their doesn’t seem to be any large correlation between the users and the types of offers they receive. This is shown by the sea of red in the top right and lower left portion of the below graph.
For that reason, I don’t believe these offers are being tailored to each demographic of users.
Just by visualizing the data, we can see that our target variable of ‘offer viewed’ has different correlation strengths with many of our predictor variables.
The image below shows the ranked correlation between the ‘offer viewed’ column and the other variables.
This shows users tend to view offers when sent through social media, which is more typically found whilst browsing through their mobile (this is pretty intuitive given phones are now seemingly just an extension of our bodies these days).
They are also more responsive to the more generous types of offers such as “Buy One Get One Free” and other high value rewards.
Now the data is clean and visualized, I set about creating a model that would predict whether a user would view an offer when given to them.
As this is a classification problem, I chose the Random Forrest Classifier, Logistic Regression and K-Nearest Neighbours as classifiers to trial. I was able to fine tune the hyper-parameters using StratifiedKFold and GridSearch within a Pipeline.
Pipelines in data science are used to simplify the process of building and evaluating machine learning models by allowing multiple steps of the model building process to be chained together and treated as a single unit. They are particularly useful for reducing the amount of code required to implement a complex machine learning workflow, and for ensuring that all of the steps in the process are executed in the correct order.
The StratifiedKFold method ensured that each fold of the data in the training set contained a representative proportion of the different class labels. This is particularly useful when dealing with imbalanced datasets, where one class may have significantly more samples than the other(s). There are significantly more users that view the offers compared to those that do not. This means that even a dumb prediction model could guess that users will view the offer 75% of the time, with reasonable accuracy.
GridSearchCV is a method for fine-tuning machine learning models by searching for the best combination of hyper-parameters. It does this by training the model on different combinations of hyper-parameters from a predefined grid and evaluating the performance of each combination. The combination of hyper-parameters that results in the best performance is then chosen as the optimal set.
By using StratifiedKFold in conjunction with GridSearchCV and Pipeline, we can fine-tune a machine learning model by systematically searching for the best combination of hyper-parameters that results in the best performance on a representative sample of the data.
I utilized this methodology for each of the classifiers previously mentioned.
Below are the results of the hyper-parameter fine-tuning.
All classifiers do well in identifying whether a user will view an offer upon receiving it. The best Recall score comes from the Random Forest Classifier with a score of 99.72%.
It is also important to note the speed at which these results take to predict the ‘offer viewed’ variable. In this test the Logistic Regression Classifier is 30x faster than it’s closest competitor! Which we may want to consider depending on the urgency of the specific analysis that we are doing.
If I was to suggest a classifier from these results I would choose the Random Forest Classifier if speed of prediction is not essential. I believe this is true for this project as there is no ‘urgency’ to hand out offers at a fast rate.
However, this project does have room for improvement and additional analyses could be added on to the back of this to increase revenue further.
For example, by bolting on a prediction of how any given offer will affect the users transactions. We could then ask the question of whether an offer could provide incentive to the user to spend more than usual. Or whether offers only prompt users take advantage of that singular discounted product?
To do this, I would group the data into the total the user spent before and after the offer was viewed. The after column would provide us with a variable to predict. The model used here would be a regression model.
With this improvement in mind, each step along the workflow may then need to be time bound. In this case, I would choose the Logistic Regression model as this provides a good balance of a high recall score and fast predicting performance.
In this post, I showed three different machine learning models were used to predict whether Starbucks users would view offers.
The models used were Random Forest Classifier, Logistic Regression, and k-Nearest Neighbors.
The dataset used for the analysis included information on customer demographics, transaction history, and offer details.
The results of the analysis showed that the Random Forest Classifier performed the best, with a recall score of 99.7%, followed by k-Nearest Neighbors with a recall score of 98.3 % and Logistic Regression with a recall score of 98.2%. It was concluded that the Random Forest Classifier is the most suitable model for this particular task due to its ability to handle a large number of input variables and its higher accuracy rate.
However, an improvement could be made by adding further steps to this project which could predict the affect on additional revenue the selected offer may have on the user. If we continued with this, Logistic Regression would provide a faster initial prediction whilst maintaining a comparable recall score.