Starbucks Capstone Project Blog
- Project Overview
- Data Preparation
- Data Exploration
- P_testing Offer Effectiveness
- Offer Classification
- Offer Decision Making
- Conclusion
Project Overview
As part of my Udacity Capstone project, I have to write a blog post detailing my work on the problem. This is that post.
The aim of the project is to develop a strategy for giving offers (like discounts or bogoffs) to user of the Starbucks rewards app.
The main way this will be done is building a classification model to determine the likelihood of a user viewing or completing an offer. Ideally we would also try to estimate how a user’s spend would be impacted by an offer, or combination of offers, unfortunately this does not appear to be a tractable machine learning problem. At least with the data available.
Treating this as a classification problem is a very natural approach. It also offers a lot of flexibility and be easily combined with subsequent work on other aspects of the problem.
The data set is imbalanced, particularly the viewed/not viewed classes, and so performance of the model will be measured by it’s f1-scores on each class. Since the model will be used to output estimates of probabilities it is important that the model performs well on both positive and negative classes.
The strategy itself will need real world testing to evaluate. There is some discussion of this in the conclusion.
The code and datasets for this blog are available here.
Data Preparation
Offers
The offers that Starbucks sent out are as follows. The reward is the value of product for a bogoff and value of the discount otherwise. Similarly the difficulty is the required spend to complete the offer.
The data is contained in the file portfolio.json which has the following schema.
- id (String) - offer id
- offer_type (String) - type of offer ie BOGO, discount, informational
- difficulty (Int) - minimum required spend to complete an offer
- reward (Int) - reward given for completing an offer
- duration (Int) - time for offer to be open, in days
- channels (List of Strings)
Dummy variables were used for the channels and the offers were given a simplified name, resulting in the following portfolio.
Offer ID | Reward | Difficulty | Duration | Offer Type | Mobile | Social | Web | Name | |
---|---|---|---|---|---|---|---|---|---|
9b98b8c7a33c4b65b9aebfe6a799e6d9 | 5 | 5 | 7 | bogo | 1.0 | 1.0 | 0.0 | 1.0 | bogo,5,5,7 |
f19421c1d4aa40978ebb69ca19b0e20d | 5 | 5 | 5 | bogo | 1.0 | 1.0 | 1.0 | 1.0 | bogo,5,5,5 |
ae264e3637204a6fb9bb56bc8210ddfd | 10 | 10 | 7 | bogo | 1.0 | 1.0 | 1.0 | 0.0 | bogo,10,10,7 |
4d5c57ea9a6940dd891ad53e9dbe8da0 | 10 | 10 | 5 | bogo | 1.0 | 1.0 | 1.0 | 1.0 | bogo,10,10,5 |
fafdcd668e3743c1bb461111dcafc2a4 | 2 | 10 | 10 | discount | 1.0 | 1.0 | 1.0 | 1.0 | discount,2,10,10 |
2906b810c7d4411798c6938adc9daaa5 | 2 | 10 | 7 | discount | 1.0 | 1.0 | 0.0 | 1.0 | discount,2,10,7 |
2298d6c36e964ae4a3e7e9706d1fb8c2 | 3 | 7 | 7 | discount | 1.0 | 1.0 | 1.0 | 1.0 | discount,3,7,7 |
0b1e1539f2cc45b7b9fa7c272da2e1d7 | 5 | 20 | 10 | discount | 1.0 | 0.0 | 0.0 | 1.0 | discount,5,20,10 |
3f207df678b143eea3cee63160fa8bed | 0 | 0 | 4 | informational | 1.0 | 1.0 | 0.0 | 1.0 | informational,0,0,4 |
5a8bc65990b245e5a138643cd4eb9837 | 0 | 0 | 3 | informational | 1.0 | 1.0 | 1.0 | 0.0 | informational,0,0,3 |
Events and Offers
All the events - offers received, viewed or completed as well as transactions are recorded in a single file - transcript.json. The columns available were
- event (String) - record description (ie transaction, offer received, offer viewed, etc.)
- person (String) - customer id
- time (Int) - time in hours since start of test. The data begins at time t=0
- value - (Dict of Strings) - either an offer id or transaction amount depending on the record
The rows of the frame look like the following:
Person | Event | Value | Time |
---|---|---|---|
78afa995795e4d85b5d9ceeca43f5fef | Offer Received | {‘offer id’: ‘9b98b8c7a33c4b65b9aebfe6a799e6d9’} | 0 |
389bc3fa690240e798340f5a15918d5c | Offer Viewed | {‘offer id’: ‘f19421c1d4aa40978ebb69ca19b0e20d’} | 0 |
9fa9ae8f57894cc9a3b8a9bbe0fc1b2 | Offer Completed | {‘offer_id’: ‘2906b810c7d4411798c6938adc9daaa5’} | 0 |
02c083884c7d45b39cc68e1314fec56c | Transaction | {‘amount’: 0.8300000000000001} | 0 |
Most of the pre-processing for the model was needed here.
These needed to be assembled into frames for each event and were then aggregated further. Transactional information like mean spend, total spend and number of transactions was gathered and merged with the user data. It was also often necessary to pull this data for specific time periods. For example getting total spend during the period that a given offer was active.
Some care was needed to match up offers received by a user to subsequent events like viewings or completions. A user might receive the same offer twice across the month, so it was necessary to ensure that these events occurred during the duration of the offer. The informational offers did not have a completion condition, so the offer was said to be completed if a transaction was made within the duration specified.
It was important to keep the data on offer viewings because offers are redeemed automatically by the app. In particular someone might complete the offer without ever knowing they had been offered it, this is something we would like to avoid since they would have made the purchase anyway. It is not totally clear from the information provided exactly how this happens. For example if you complete a bogoff offer are you given a voucher for a free one next time? If not it is hard to see how you could unknowingly redeem the offer.
If it is the case then there does not appear to be transactional data relating to the redemption of such vouchers, this is information it would be useful to have when estimating the cost of providing an offer. In fact the only data on cost is the reward value of the offer, which is not ideal. It clearly does not cost Starbucks 10 dollars to provide a free item with the same menu cost.
User Profiles
The user profiles were contained in the file profile.json and contained the following entries.
- age (Int) - age of the customer
- became_member_on (Int) - date when customer created an app account
- gender (String) - gender of the customer (some entries contain ‘O’ rather than M or F)
- id (String) - customer id
- income (Float) - customer’s income
The became_member_on column was replaced by an account age field (in years). The gender labels were replaced with dummy Booleans. The transactional data aggregated from the previous file were also merged in.
Unlike the other data sets some cleaning was needed here. There was a large peak of users with age 118, this must be a placeholder for no entry.
If we set these ages to NaN, we see that they correspond to users for whom the gender and income is also missing. There’s about 2000 of these people for whom we have basically no user data. We’ll just drop them from the DataFrame.
Column | Count |
---|---|
Gender | 2175 |
Age | 2175 |
Became Member On | 0 |
Income | 2175 |
Account Age | 0 |
Data Exploration
If we do a bar plot of a user’s spend in the training period we see that the modal spend is only a few dollars - likely the cost of a single coffee. It is not uncommon however for users to spend in the 10-20 dollar range, and higher spend become increasingly uncommon.
A plot of income versus spend reveals some clear clustering.
There’s a very noticeable group of people with very high spend, especially compared to income. Although we are unlikely to be able to use raw spend and transaction data to predict offer uptake (since this data is partially a function of offers completed), these kinds of groupings might be useful both for the model itself as well as development of heuristics.
If someone where to go to Starbucks every day and pay the mean transaction spend each day they would pay about $400. Most of the people in the high spend cluster spent more than this. We’ll call this group Big Spenders.
To identify this group more methodically we can use K-means clustering. The data points we use for the clustering are
- Income
- Mean Spend
- Number of Transactions
- Spend/Income
We see from an elbow plot that the optimum number of clusters is four.
Once we do the clustering we see that cluster 3 corresponds to the expected Big Spender cluster.
These clusters all have quite different spending patterns, exactly as we wanted. The big_spender cluster in particular is very different, having totally different mean spend distribution and a distinct number of transaction distribution as well.
Offer Completion
We can see that the impact of rewards are not uniform in how they affect spend. Users claiming a reward tend to spend significantly more than those who are not completing an offer - for which the average spend is about 10 dollars and the modal spend about 2 dollars 50.
Reward | Mean Spend |
---|---|
0 | 10.6 |
2 | 20.0 |
3 | 17.5 |
5 | 21.2 |
10 | 23.2 |
We can also ask what proportion of offers are completed.
Name | Viewed | Completed |
---|---|---|
bogo,10,10,5 | 0.956432 | 0.478789 |
bogo,10,10,7 | 0.884469 | 0.511061 |
bogo,5,5,5 | 0.953036 | 0.600229 |
bogo,5,5,7 | 0.519257 | 0.570684 |
discount,2,10,10 | 0.967377 | 0.694608 |
discount,2,10,7 | 0.521993 | 0.529399 |
discount,3,7,7 | 0.957282 | 0.684844 |
discount,5,20,10 | 0.332224 | 0.425150 |
informational,0,0,3 | 0.814573 | 0.112565 |
informational,0,0,4 | 0.477578 | 0.129372 |
All | 0.736746 | 0.472950 |
We can also think of this in terms of total rewards earned versus rewards offered. In both cases we see that about 50% of offers are completed/rewards are earned.
The poor fit towards the end is likely due, at least in part, to fewer people being offered such large max rewards.
P_testing Offer Effectiveness
We can also ask whether these offers have a meaningful or (statistically) significant impact on user spending. To this end employ a t-test between offer recipients during the offer period and those who have received no offer in this same period. The table of results is as follows. The p_x columns contain $-log(p)$ where $p$ is the p_value output of the t-test. A score above 3 can be considered significant.
p_age | p_income | p_total-cost | total-cost | p_total | offered_total | null_total | p_mean | offered_mean | null_mean | p_count | offered_count | null_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
discount,2,10,10 | 0.373081 | 0.513743 | 251.415865 | 30.652512 | 283.408413 | 46.875874 | 14.223363 | 36.634555 | 13.352046 | 8.858298 | inf | 3.280471 | 1.043826 |
discount,3,7,7 | 0.472075 | 0.418224 | 187.476382 | 18.569828 | 249.576284 | 33.467657 | 11.897829 | 32.359681 | 11.840403 | 8.185221 | inf | 2.492615 | 0.867023 |
bogo,5,5,5 | 0.627717 | 0.307679 | 92.664772 | 12.317879 | 177.398431 | 26.940110 | 9.622231 | 52.153454 | 11.919873 | 6.976287 | inf | 1.866667 | 0.705882 |
discount,2,10,7 | 0.888283 | 0.627830 | 80.655369 | 12.179083 | 107.848849 | 26.076912 | 11.897829 | 21.634059 | 11.161493 | 8.185221 | 377.954748 | 1.737208 | 0.867023 |
discount,5,20,10 | 0.367586 | 0.735100 | 53.657970 | 11.965843 | 104.198406 | 31.189206 | 14.223363 | 19.425179 | 12.090161 | 8.858298 | 332.601467 | 1.965403 | 1.043826 |
bogo,10,10,7 | 0.949229 | 0.115336 | 48.955730 | 9.915389 | 184.972483 | 31.813218 | 11.897829 | 29.090939 | 11.862773 | 8.185221 | inf | 2.413184 | 0.867023 |
bogo,5,5,7 | 0.725452 | 0.584020 | 40.097766 | 7.913657 | 101.293686 | 24.811486 | 11.897829 | 14.263930 | 10.484330 | 8.185221 | 372.677502 | 1.742637 | 0.867023 |
informational,0,0,3 | 0.952612 | 0.920038 | 108.007086 | 6.960021 | 108.007086 | 13.918947 | 6.958925 | 49.624650 | 8.354929 | 5.289228 | 420.722524 | 1.116625 | 0.522302 |
bogo,10,10,5 | 0.178025 | 0.665663 | 34.850332 | 6.925740 | 189.016236 | 26.547971 | 9.622231 | 59.501045 | 11.383595 | 6.976287 | inf | 1.933272 | 0.705882 |
informational,0,0,4 | 0.489737 | 0.747444 | 59.748068 | 5.720949 | 59.748068 | 13.865758 | 8.144809 | 25.925570 | 8.166071 | 6.139471 | 220.613258 | 1.012556 | 0.597345 |
The key figures here are that age and income seem similarly distributed across null and test groups; and that the $p_{total-cost}$ value is always large. This means that we can reject the hypothesis that the total spend of people to whom the offer was made was lower than the total spend of those who it wasn’t plus the monetary value of the offer itself. That is the impact of the offer was both significant and meaningful. Although these tests were aggregated into one number for each offer, the information used was only from the periods in which the offer was active.
Noticeably there was an increase in both mean spend and number of transactions among offered groups.
Offer | Increase in Spend/ Duration | Reward | Difficulty | Duration | Offer_type | Mobile | Social | Web | |
---|---|---|---|---|---|---|---|---|---|
discount,2,10,10 | 3.065251 | 2 | 10 | 10 | discount | 1.0 | 1.0 | 1.0 | 1.0 |
discount,3,7,7 | 2.652833 | 3 | 7 | 7 | discount | 1.0 | 1.0 | 1.0 | 1.0 |
bogo,5,5,5 | 2.463576 | 5 | 5 | 5 | bogo | 1.0 | 1.0 | 1.0 | 1.0 |
informational,0,0,3 | 2.320007 | 0 | 0 | 3 | informational | 1.0 | 1.0 | 1.0 | 0.0 |
discount,2,10,7 | 1.739869 | 2 | 10 | 7 | discount | 1.0 | 1.0 | 0.0 | 1.0 |
informational,0,0,4 | 1.430237 | 0 | 0 | 4 | informational | 1.0 | 1.0 | 0.0 | 1.0 |
bogo,10,10,7 | 1.416484 | 10 | 10 | 7 | bogo | 1.0 | 1.0 | 1.0 | 0.0 |
bogo,10,10,5 | 1.385148 | 10 | 10 | 5 | bogo | 1.0 | 1.0 | 1.0 | 1.0 |
discount,5,20,10 | 1.196584 | 5 | 20 | 10 | discount | 1.0 | 0.0 | 0.0 | 1.0 |
bogo,5,5,7 | 1.130522 | 5 | 5 | 7 | bogo | 1.0 | 1.0 | 0.0 | 1.0 |
While all the measures are deemed effective by our p_tests, there does seem to be differences in performance. Roughly speaking it looks like the 2,10,10 discount performs best, but comparing the offers like this is on shaky statistical footing. There may be other factors impacting performance, e.g. people might be more likely to buy coffee on a Monday, so offers which include a (or multiple) Mondays might get a boost.
Offer Classification
Random Forest Model - Offer by offer prediction
We will use a random forest classifier to predict whether a user will view or complete a given offer. Initially a model is fitted for each offer individually so only information about the user is relevant.
The data points used for the predictions are:
- Age (Int)
- Income (Int)
- Account age (Int)
- Big Spender Cluster (Bool)
- Cluster 2 (Bool)
- Male (Bool)
- Female (Bool)
- Mean Spend (Float)
The Cluster 0 and Cluster 1 data points are not used because they don’t improve the predictive power of the model.
In addition to the base classifier SMOTE is used to resample the data due to the imbalanced nature of the data set. Further the parameters are tuned with a grid search carried out independently for each offer. The parameter grid is generated by
param_grid={
'criterion':['gini', 'entropy'],
'min_samples_split':np.logspace(-5,-1,5),
'min_samples_leaf':np.logspace(-6,-2,5),
}
Other than these parameters the default sklearn options were used. This was largely due to computational limits. With a different model for each offer running extensive grid searches is very time consuming, especially on basic hardware.
The problem is a mulit-output classification one and therefore we can choose to run the grid search for a single multioutput classifier, or the search can be run on each output separately. The second option was chosen as it improved performance and with only two outputs it isn’t too difficult to do so. The grids evaluated parameters by their f1-score across three cross validations.
The best parameters are then used to refit a model on the entire test set.
We collect the results for each offer below.
Type | Offer | Metric | Not Viewed | Viewed | Not Completed | Completed |
---|---|---|---|---|---|---|
bogo | bogo,10,10,5 | f1-score | 0.470588 | 0.976853 | 0.799180 | 0.843949 |
precision | 0.758621 | 0.960902 | 0.899654 | 0.776557 | ||
recall | 0.341085 | 0.993343 | 0.718894 | 0.924150 | ||
support | 129.000000 | 2103.000000 | 1085.000000 | 1147.000000 | ||
bogo,10,10,7 | f1-score | 0.651357 | 0.957582 | 0.793616 | 0.855714 | |
precision | 0.829787 | 0.933168 | 0.850236 | 0.817647 | ||
recall | 0.536082 | 0.983307 | 0.744066 | 0.897498 | ||
support | 291.000000 | 1917.000000 | 969.000000 | 1239.000000 | ||
bogo,5,5,5 | f1-score | 0.375000 | 0.978962 | 0.661710 | 0.805556 | |
precision | 0.642857 | 0.965422 | 0.651220 | 0.813084 | ||
recall | 0.264706 | 0.992888 | 0.672544 | 0.798165 | ||
support | 102.000000 | 2109.000000 | 794.000000 | 1417.000000 | ||
bogo,5,5,7 | f1-score | 0.675980 | 0.690486 | 0.683483 | 0.816823 | |
precision | 0.674460 | 0.691976 | 0.655530 | 0.837491 | ||
recall | 0.677507 | 0.689003 | 0.713927 | 0.797151 | ||
support | 1107.000000 | 1164.000000 | 797.000000 | 1474.000000 | ||
discount | discount,2,10,10 | f1-score | 0.503704 | 0.984566 | 0.704268 | 0.877370 |
precision | 0.809524 | 0.973133 | 0.689552 | 0.885204 | ||
recall | 0.365591 | 0.996270 | 0.719626 | 0.869674 | ||
support | 93.000000 | 2145.000000 | 642.000000 | 1596.000000 | ||
discount,2,10,7 | f1-score | 0.691812 | 0.691812 | 0.683951 | 0.812317 | |
precision | 0.685506 | 0.698236 | 0.708440 | 0.795977 | ||
recall | 0.698236 | 0.685506 | 0.661098 | 0.829341 | ||
support | 1077.000000 | 1097.000000 | 838.000000 | 1336.000000 | ||
discount,3,7,7 | f1-score | 0.488550 | 0.984693 | 0.597179 | 0.840965 | |
precision | 0.761905 | 0.974231 | 0.532123 | 0.883615 | ||
recall | 0.359551 | 0.995381 | 0.680357 | 0.802243 | ||
support | 89.000000 | 2165.000000 | 560.000000 | 1694.000000 | ||
discount,5,20,10 | f1-score | 0.882175 | 0.758763 | 0.733211 | 0.743363 | |
precision | 0.885445 | 0.753070 | 0.850587 | 0.656250 | ||
recall | 0.878930 | 0.764543 | 0.644301 | 0.857143 | ||
support | 1495.000000 | 722.000000 | 1237.000000 | 980.000000 | ||
informational | informational,0,0,3 | f1-score | 0.693997 | 0.944073 | 0.933436 | 0.510397 |
precision | 0.858696 | 0.912099 | 0.887152 | 0.828221 | ||
recall | 0.582310 | 0.978369 | 0.984816 | 0.368852 | ||
support | 407.000000 | 1803.000000 | 1844.000000 | 366.000000 | ||
informational,0,0,4 | f1-score | 0.709934 | 0.691404 | 0.943843 | 0.472813 | |
precision | 0.715939 | 0.685289 | 0.900962 | 0.854701 | ||
recall | 0.704028 | 0.697630 | 0.991010 | 0.326797 | ||
support | 1142.000000 | 1055.000000 | 1891.000000 | 306.000000 |
Recall on classes with low support is predictably poor despite the resampling. This is a limitation of the data, there’s only so much we can do to improve the performance here. That being said, we can also try building a model to classify on multiple offers at once. In particular we can start to use information about the offer such as whether it was sent by email. We can do this either for all offers at once or by offer type.
Random Forest Model - Alternative prediction methods
Once again a grid search was run on each output before a model was refit on the entire training set. This time the parameter grid was slightly expanded since there are fewer models to build.
param_grid={
'criterion':['gini', 'entropy'],
'min_samples_split':np.logspace(-5,-1,5),
'min_samples_leaf':np.logspace(-6,-2,5),
}
Because the Completed class is less unbalanced for some of these aggregated sets, so SMOTE sampling was less clearly needed. The performance differences were very small however, and it was much easier to pick a single option. With a bit of work this could have been included as parameter in the grid search however, which would have been a good option to eke out a bit of extra performance if one of these methods were to be chosen.
To compare these models we consider their combined performance by offer type as well as their overall performance. A breakdown by offer is also possible but it is difficult to read such a large table and meaningfully interpret the results.
Type | Metric | Method | Not Viewed | Viewed | Not Completed | Completed |
---|---|---|---|---|---|---|
bogo | f1-score | All at once | 0.521869 | 0.915683 | 0.730182 | 0.830887 |
Offer by offer | 0.645097 | 0.927438 | 0.738933 | 0.829567 | ||
Type by type | 0.567692 | 0.879088 | 0.731456 | 0.792449 | ||
precision | All at once | 0.667304 | 0.881793 | 0.777090 | 0.800597 | |
Offer by offer | 0.697857 | 0.913321 | 0.765207 | 0.811379 | ||
Type by type | 0.487450 | 0.921516 | 0.688195 | 0.832916 | ||
recall | All at once | 0.428484 | 0.952283 | 0.688615 | 0.863559 | |
Offer by offer | 0.599754 | 0.941999 | 0.714403 | 0.848588 | ||
Type by type | 0.679558 | 0.840395 | 0.780521 | 0.755732 | ||
support | NaN | 1629.000000 | 7293.000000 | 3645.000000 | 5277.000000 | |
discount | f1-score | All at once | 0.728989 | 0.881592 | 0.664318 | 0.818087 |
Offer by offer | 0.786861 | 0.906455 | 0.687559 | 0.824842 | ||
Type by type | 0.721246 | 0.856631 | 0.681159 | 0.804071 | ||
precision | All at once | 0.743580 | 0.874098 | 0.699023 | 0.796653 | |
Offer by offer | 0.800000 | 0.899968 | 0.706602 | 0.812565 | ||
Type by type | 0.663415 | 0.896841 | 0.660832 | 0.819563 | ||
recall | All at once | 0.714960 | 0.889215 | 0.632896 | 0.840706 | |
Offer by offer | 0.774147 | 0.913036 | 0.669515 | 0.837496 | ||
Type by type | 0.790123 | 0.819873 | 0.702777 | 0.789154 | ||
support | NaN | 2754.000000 | 6129.000000 | 3277.000000 | 5606.000000 | |
informational | f1-score | All at once | 0.606533 | 0.826797 | 0.917465 | 0.000000 |
Offer by offer | 0.706242 | 0.852370 | 0.938692 | 0.493697 | ||
Type by type | 0.612913 | 0.775327 | 0.901182 | 0.086047 | ||
precision | All at once | 0.713537 | 0.775598 | 0.847515 | 0.000000 | |
Offer by offer | 0.744103 | 0.831117 | 0.894112 | 0.839286 | ||
Type by type | 0.587678 | 0.795145 | 0.849490 | 0.196809 | ||
recall | All at once | 0.527437 | 0.885234 | 1.000000 | 0.000000 | |
Offer by offer | 0.672046 | 0.874738 | 0.987952 | 0.349702 | ||
Type by type | 0.640413 | 0.756473 | 0.959572 | 0.055060 | ||
support | NaN | 1549.000000 | 2858.000000 | 3735.000000 | 672.000000 |
Metric | Method | Not Viewed | Viewed | Not Completed | Completed |
---|---|---|---|---|---|
f1-score | All at once | 0.646922 | 0.886994 | 0.782560 | 0.800414 |
Offer by offer | 0.728326 | 0.906261 | 0.797295 | 0.813419 | |
Type by type | 0.649153 | 0.852142 | 0.776350 | 0.770578 | |
precision | All at once | 0.719983 | 0.859092 | 0.784515 | 0.798587 |
Offer by offer | 0.759517 | 0.893599 | 0.798119 | 0.812646 | |
Type by type | 0.590551 | 0.889334 | 0.737625 | 0.814440 | |
recall | All at once | 0.587323 | 0.916769 | 0.780614 | 0.802250 |
Offer by offer | 0.699595 | 0.919287 | 0.796472 | 0.814193 | |
Type by type | 0.720668 | 0.817936 | 0.819368 | 0.731199 | |
support | NaN | 5932.000000 | 16280.000000 | 10657.000000 | 11555.000000 |
The Offer by Offer method is the clear winner here. It is the best in almost every case and never the worst. All at once often offers similar performance but performs really poorly in some areas, in particular predicting completion of informational offers and recall of Not Viewed is very poor. Type by Type does occasionally do better than Offer by Offer but is also often much worse, overall it the poorer choice.
That being said they offer a degree of flexibility that the offer by offer does not, there is no scope to estimate performance on untested offer combinations using this method. A Random Forest model may not be the best choice for this kind of prediction, however.
Alternative Model Testing
Just as a sanity check, we can see how other types of model perform to confirm the choice of a random forest was sensible. For speed and ease of comparison we use the models to predict against all the offers at once. We also don’t use a grid search and add a standard scaler to the Pipeline. The models tested are as follows. Only a quick reference was sought, so all models were run with their default setups. As usual this was also done to keep computational time down.
Models | Initial |
---|---|
Logistic Regression | LR |
Decision Tree Classifier | DTC |
Linear Discriminant Analysis | LDA |
Random Forest Classifier | RFC |
K-Neighbours Classifier | KNC |
Gaussian Naive Bayes | GNB |
SGDClassifier | SGDC |
Metric | Model | Not Viewed | Viewed | Not Completed | Completed |
---|---|---|---|---|---|
f1-score | RFC | 0.642572 | 0.878630 | 0.763706 | 0.771635 |
KNC | 0.619437 | 0.872887 | 0.766370 | 0.773149 | |
LR | 0.615140 | 0.868381 | 0.767676 | 0.770751 | |
SGDC | 0.583110 | 0.853203 | 0.766043 | 0.760177 | |
LDA | 0.638180 | 0.850885 | 0.740597 | 0.752376 | |
DTC | 0.585024 | 0.844201 | 0.720325 | 0.708758 | |
GNB | 0.591651 | 0.841299 | 0.710868 | 0.747883 | |
precision | LDA | 0.588227 | 0.881745 | 0.727783 | 0.765238 |
RFC | 0.678927 | 0.862939 | 0.745974 | 0.789779 | |
GNB | 0.565759 | 0.856534 | 0.732862 | 0.728811 | |
LR | 0.646131 | 0.854366 | 0.742439 | 0.797501 | |
KNC | 0.663649 | 0.853886 | 0.746883 | 0.793245 | |
DTC | 0.572651 | 0.851105 | 0.679883 | 0.755560 | |
SGDC | 0.598368 | 0.845611 | 0.728064 | 0.803121 | |
recall | RFC | 0.609912 | 0.894902 | 0.782303 | 0.754305 |
KNC | 0.580748 | 0.892752 | 0.786901 | 0.754046 | |
LR | 0.586986 | 0.882862 | 0.794689 | 0.745738 | |
SGDC | 0.568611 | 0.860934 | 0.808201 | 0.721592 | |
DTC | 0.597943 | 0.837408 | 0.765882 | 0.667417 | |
GNB | 0.620027 | 0.826597 | 0.690157 | 0.767979 | |
LDA | 0.697404 | 0.822113 | 0.753871 | 0.739939 |
As we can see a Random Forest Classifier is the best performing but both the K-Neighbours Classifier and Linear Regression Classifier offer similar performance, especially the former. Linear Regression in particular might be used to build a model which is more suited to predicting the performance of offers which have not been tested here.
Offer Decision Making
There are many ways we could make the decision on what offers to suggest. We could for example send any offer we think will be accepted. Here we will use a rough estimate of the gain from sending an offer, calculated as
Ideally we would replace $P(\text{not viewed})$ with $P(\text{not viewed} \mid \text{completed})$ but since estimation of whether an offer would be viewed was already fairly poor, I worried such an estimate would be very inaccurate.
Similarly reward is used as an estimate of the cost of providing the offer but it is not a perfect representative. It likely over-estimates the gain of discounts verses buy one get one free offers as the marginal cost of providing a second coffee is likely much lower than menu price.
We use this to pick the top three offers, with no more than two of each type, for each user.
The first few entries of the resulting DataFrame look like this.
Person | 1st | 2nd | 3rd |
---|---|---|---|
ae6f43089b674728a50b8727252d3305 | discount,2,10,10 | bogo,10,10,5 | bogo,10,10,7 |
ad1f0a409ae642bc9a43f31f56c130fc | discount,2,10,10 | discount,5,20,10 | bogo,10,10,5 |
dce784e26f294101999d000fad9089bb | discount,2,10,10 | bogo,10,10,5 | discount,3,7,7 |
4d0ebb94a5a94fe6afd9350c7b1477e4 | discount,2,10,10 | discount,5,20,10 | bogo,10,10,5 |
7d7f271652d244f78b97c73cd734c553 | discount,5,20,10 | discount,2,10,10 | bogo,10,10,7 |
The discount,2,10,10 offer appears a lot in these rows. In fact we can count the number of times each offer is recommend and see this pattern holds in the rest of the DataFrame.
Offer | Count |
---|---|
Bogo,10,10,7 | 6367 |
Bogo,10,10,5 | 7518 |
Bogo,5,5,7 | 0 |
Bogo,5,5,5 | 1608 |
Discount,5,20,10 | 8221 |
Discount, 2,10,10 | 14607 |
Discount,2,10,7 | 499 |
Discount,3,7,7 | 5655 |
Informational,0,0,4 | 0 |
Informational,0,0,3 | 0 |
Conclusion
By far and away the most suggested offer is the 2,10,10 discount. This is also the best performing offer according to the tentative analysis in the p-testing section which is reassuring. The decision making algorithm is a little crude. It could be improved with access to more data, like the cost of providing offer rewards. Given a larger data set we might also be able to make estimates of the conditional probabilities needed to give a more accurate estimate of offer gain.
Ultimately it isn’t really possible to give a good measure of how effective this strategy is. This would require real world testing. An experiment could be run where different groups are given offers according to different strategies e.g. the method above, random allocation, everyone gets the same, etc. Performance could then be compared across groups and the best strategy could be found.
As well as running a one off experiment this could be done as a continuous process. Since the rewards app collects all the relevant data models could be constantly updated and different A/B tests run each month with tweaked offer strategies in order to keep improving the performance.
We can of course evaluate the model performance according the the f1-score we highlighted as a metric at the beginning. It’s difficult to say whether a model is good or bad in a vacuum, based only on its scores, but the overall performance was respectable. The the grid search implementation, the SMOTE sampling and the decision to build a separate model for each offer all improved the scores from the base Random Forest Implementation. In particular this increased the Not Viewed f1-score (which was the worst) from about 0.64 to 0.73. This was driven by better scores in both recall and precision.
There were also noticeable, if less dramatic, improvements in the other f1-scores.
It is also clear both from the poor performance of the model itself and the predictions made that informational offers have not been handled well. They are very different in nature to the other offers and there is room for more exploration of when and how to use these kinds of offers.
As mentioned at the beginning, what is really missing here is a good estimate of the impact on a user’s spend of receiving an offer. It is clear from the p-testing section that there is a benefit to sending out an offer beyond just that realised by an additional purchase to redeem the offer. For example the 2,10,10 discount led to an increase in mean number of transactions from 1 to 3, and the total spend rose by over $30 as a result.
There is also scope for development of additional heuristics. Partly these depend on company goals. For example we might want to avoid sending offers to big spenders on the basis we think they are unlikely to be swayed by them. Conversely we might want to send offers to regular users, even if they would make the purchases anyways, in order to increase brand loyalty and retain their custom.
Some key areas for future work are as follows:
- Better identify underlying relationships in the data
- Who are social media offers most effective on?
- Do any users react negatively to receiving offers?
- Investigate the use of ensemble models to improve performance
- Train alternative models like Linear Regression and K-means and combine the output
- Train models in different ways like Type by Type or All at Once and combine