Starbucks Capstone Project Blog

Project Overview
Data Preparation
Data Exploration
- Offer Completion
P_testing Offer Effectiveness
Offer Classification
Offer Decision Making
Conclusion

Project Overview

As part of my Udacity Capstone project, I have to write a blog post detailing my work on the problem. This is that post.

The aim of the project is to develop a strategy for giving offers (like discounts or bogoffs) to user of the Starbucks rewards app.

The main way this will be done is building a classification model to determine the likelihood of a user viewing or completing an offer. Ideally we would also try to estimate how a user’s spend would be impacted by an offer, or combination of offers, unfortunately this does not appear to be a tractable machine learning problem. At least with the data available.

Treating this as a classification problem is a very natural approach. It also offers a lot of flexibility and be easily combined with subsequent work on other aspects of the problem.

The data set is imbalanced, particularly the viewed/not viewed classes, and so performance of the model will be measured by it’s f1-scores on each class. Since the model will be used to output estimates of probabilities it is important that the model performs well on both positive and negative classes.

The strategy itself will need real world testing to evaluate. There is some discussion of this in the conclusion.

The code and datasets for this blog are available here.

Data Preparation

Offers

The offers that Starbucks sent out are as follows. The reward is the value of product for a bogoff and value of the discount otherwise. Similarly the difficulty is the required spend to complete the offer.

The data is contained in the file portfolio.json which has the following schema.

id (String) - offer id
offer_type (String) - type of offer ie BOGO, discount, informational
difficulty (Int) - minimum required spend to complete an offer
reward (Int) - reward given for completing an offer
duration (Int) - time for offer to be open, in days
channels (List of Strings)

Dummy variables were used for the channels and the offers were given a simplified name, resulting in the following portfolio.

Offer ID	Reward	Difficulty	Duration	Offer Type	Email	Mobile	Social	Web	Name
9b98b8c7a33c4b65b9aebfe6a799e6d9	5	5	7	bogo	1.0	1.0	0.0	1.0	bogo,5,5,7
f19421c1d4aa40978ebb69ca19b0e20d	5	5	5	bogo	1.0	1.0	1.0	1.0	bogo,5,5,5
ae264e3637204a6fb9bb56bc8210ddfd	10	10	7	bogo	1.0	1.0	1.0	0.0	bogo,10,10,7
4d5c57ea9a6940dd891ad53e9dbe8da0	10	10	5	bogo	1.0	1.0	1.0	1.0	bogo,10,10,5
fafdcd668e3743c1bb461111dcafc2a4	2	10	10	discount	1.0	1.0	1.0	1.0	discount,2,10,10
2906b810c7d4411798c6938adc9daaa5	2	10	7	discount	1.0	1.0	0.0	1.0	discount,2,10,7
2298d6c36e964ae4a3e7e9706d1fb8c2	3	7	7	discount	1.0	1.0	1.0	1.0	discount,3,7,7
0b1e1539f2cc45b7b9fa7c272da2e1d7	5	20	10	discount	1.0	0.0	0.0	1.0	discount,5,20,10
3f207df678b143eea3cee63160fa8bed	0	0	4	informational	1.0	1.0	0.0	1.0	informational,0,0,4
5a8bc65990b245e5a138643cd4eb9837	0	0	3	informational	1.0	1.0	1.0	0.0	informational,0,0,3

Events and Offers

All the events - offers received, viewed or completed as well as transactions are recorded in a single file - transcript.json. The columns available were

event (String) - record description (ie transaction, offer received, offer viewed, etc.)
person (String) - customer id
time (Int) - time in hours since start of test. The data begins at time t=0
value - (Dict of Strings) - either an offer id or transaction amount depending on the record

The rows of the frame look like the following:

Person	Event	Value
78afa995795e4d85b5d9ceeca43f5fef	Offer Received	{‘offer id’: ‘9b98b8c7a33c4b65b9aebfe6a799e6d9’}
389bc3fa690240e798340f5a15918d5c	Offer Viewed	{‘offer id’: ‘f19421c1d4aa40978ebb69ca19b0e20d’}
9fa9ae8f57894cc9a3b8a9bbe0fc1b2	Offer Completed	{‘offer_id’: ‘2906b810c7d4411798c6938adc9daaa5’}
02c083884c7d45b39cc68e1314fec56c	Transaction	{‘amount’: 0.8300000000000001}

Most of the pre-processing for the model was needed here.

These needed to be assembled into frames for each event and were then aggregated further. Transactional information like mean spend, total spend and number of transactions was gathered and merged with the user data. It was also often necessary to pull this data for specific time periods. For example getting total spend during the period that a given offer was active.

Some care was needed to match up offers received by a user to subsequent events like viewings or completions. A user might receive the same offer twice across the month, so it was necessary to ensure that these events occurred during the duration of the offer. The informational offers did not have a completion condition, so the offer was said to be completed if a transaction was made within the duration specified.

It was important to keep the data on offer viewings because offers are redeemed automatically by the app. In particular someone might complete the offer without ever knowing they had been offered it, this is something we would like to avoid since they would have made the purchase anyway. It is not totally clear from the information provided exactly how this happens. For example if you complete a bogoff offer are you given a voucher for a free one next time? If not it is hard to see how you could unknowingly redeem the offer.

If it is the case then there does not appear to be transactional data relating to the redemption of such vouchers, this is information it would be useful to have when estimating the cost of providing an offer. In fact the only data on cost is the reward value of the offer, which is not ideal. It clearly does not cost Starbucks 10 dollars to provide a free item with the same menu cost.

User Profiles

The user profiles were contained in the file profile.json and contained the following entries.

age (Int) - age of the customer
became_member_on (Int) - date when customer created an app account
gender (String) - gender of the customer (some entries contain ‘O’ rather than M or F)
id (String) - customer id
income (Float) - customer’s income

The became_member_on column was replaced by an account age field (in years). The gender labels were replaced with dummy Booleans. The transactional data aggregated from the previous file were also merged in.

Unlike the other data sets some cleaning was needed here. There was a large peak of users with age 118, this must be a placeholder for no entry.

If we set these ages to NaN, we see that they correspond to users for whom the gender and income is also missing. There’s about 2000 of these people for whom we have basically no user data. We’ll just drop them from the DataFrame.

Column	Count
Gender	2175
Age	2175
Became Member On	0
Income	2175
Account Age	0

Data Exploration

If we do a bar plot of a user’s spend in the training period we see that the modal spend is only a few dollars - likely the cost of a single coffee. It is not uncommon however for users to spend in the 10-20 dollar range, and higher spend become increasingly uncommon.

A plot of income versus spend reveals some clear clustering.

There’s a very noticeable group of people with very high spend, especially compared to income. Although we are unlikely to be able to use raw spend and transaction data to predict offer uptake (since this data is partially a function of offers completed), these kinds of groupings might be useful both for the model itself as well as development of heuristics.

If someone where to go to Starbucks every day and pay the mean transaction spend each day they would pay about $400. Most of the people in the high spend cluster spent more than this. We’ll call this group Big Spenders.

To identify this group more methodically we can use K-means clustering. The data points we use for the clustering are

Income
Mean Spend
Number of Transactions
Spend/Income

We see from an elbow plot that the optimum number of clusters is four.

Once we do the clustering we see that cluster 3 corresponds to the expected Big Spender cluster.

These clusters all have quite different spending patterns, exactly as we wanted. The big_spender cluster in particular is very different, having totally different mean spend distribution and a distinct number of transaction distribution as well.

Offer Completion

We can see that the impact of rewards are not uniform in how they affect spend. Users claiming a reward tend to spend significantly more than those who are not completing an offer - for which the average spend is about 10 dollars and the modal spend about 2 dollars 50.

Reward	Mean Spend
0	10.6
2	20.0
3	17.5
5	21.2
10	23.2

We can also ask what proportion of offers are completed.

Name	Viewed	Completed
bogo,10,10,5	0.956432	0.478789
bogo,10,10,7	0.884469	0.511061
bogo,5,5,5	0.953036	0.600229
bogo,5,5,7	0.519257	0.570684
discount,2,10,10	0.967377	0.694608
discount,2,10,7	0.521993	0.529399
discount,3,7,7	0.957282	0.684844
discount,5,20,10	0.332224	0.425150
informational,0,0,3	0.814573	0.112565
informational,0,0,4	0.477578	0.129372
All	0.736746	0.472950

We can also think of this in terms of total rewards earned versus rewards offered. In both cases we see that about 50% of offers are completed/rewards are earned. $\text{ Best Fit Equation: } y=0.53x$

The poor fit towards the end is likely due, at least in part, to fewer people being offered such large max rewards.

P_testing Offer Effectiveness

We can also ask whether these offers have a meaningful or (statistically) significant impact on user spending. To this end employ a t-test between offer recipients during the offer period and those who have received no offer in this same period. The table of results is as follows. The p_x columns contain $-log(p)$ where $p$ is the p_value output of the t-test. A score above 3 can be considered significant.

	p_age	p_income	p_total-cost	total-cost	p_total	offered_total	null_total	p_mean	offered_mean	null_mean	p_count	offered_count	null_count
discount,2,10,10	0.373081	0.513743	251.415865	30.652512	283.408413	46.875874	14.223363	36.634555	13.352046	8.858298	inf	3.280471	1.043826
discount,3,7,7	0.472075	0.418224	187.476382	18.569828	249.576284	33.467657	11.897829	32.359681	11.840403	8.185221	inf	2.492615	0.867023
bogo,5,5,5	0.627717	0.307679	92.664772	12.317879	177.398431	26.940110	9.622231	52.153454	11.919873	6.976287	inf	1.866667	0.705882
discount,2,10,7	0.888283	0.627830	80.655369	12.179083	107.848849	26.076912	11.897829	21.634059	11.161493	8.185221	377.954748	1.737208	0.867023
discount,5,20,10	0.367586	0.735100	53.657970	11.965843	104.198406	31.189206	14.223363	19.425179	12.090161	8.858298	332.601467	1.965403	1.043826
bogo,10,10,7	0.949229	0.115336	48.955730	9.915389	184.972483	31.813218	11.897829	29.090939	11.862773	8.185221	inf	2.413184	0.867023
bogo,5,5,7	0.725452	0.584020	40.097766	7.913657	101.293686	24.811486	11.897829	14.263930	10.484330	8.185221	372.677502	1.742637	0.867023
informational,0,0,3	0.952612	0.920038	108.007086	6.960021	108.007086	13.918947	6.958925	49.624650	8.354929	5.289228	420.722524	1.116625	0.522302
bogo,10,10,5	0.178025	0.665663	34.850332	6.925740	189.016236	26.547971	9.622231	59.501045	11.383595	6.976287	inf	1.933272	0.705882
informational,0,0,4	0.489737	0.747444	59.748068	5.720949	59.748068	13.865758	8.144809	25.925570	8.166071	6.139471	220.613258	1.012556	0.597345

The key figures here are that age and income seem similarly distributed across null and test groups; and that the $p_{total-cost}$ value is always large. This means that we can reject the hypothesis that the total spend of people to whom the offer was made was lower than the total spend of those who it wasn’t plus the monetary value of the offer itself. That is the impact of the offer was both significant and meaningful. Although these tests were aggregated into one number for each offer, the information used was only from the periods in which the offer was active.

Noticeably there was an increase in both mean spend and number of transactions among offered groups.

Offer	Increase in Spend/ Duration	Reward	Difficulty	Duration	Offer_type	Email	Mobile	Social	Web
discount,2,10,10	3.065251	2	10	10	discount	1.0	1.0	1.0	1.0
discount,3,7,7	2.652833	3	7	7	discount	1.0	1.0	1.0	1.0
bogo,5,5,5	2.463576	5	5	5	bogo	1.0	1.0	1.0	1.0
informational,0,0,3	2.320007	0	0	3	informational	1.0	1.0	1.0	0.0
discount,2,10,7	1.739869	2	10	7	discount	1.0	1.0	0.0	1.0
informational,0,0,4	1.430237	0	0	4	informational	1.0	1.0	0.0	1.0
bogo,10,10,7	1.416484	10	10	7	bogo	1.0	1.0	1.0	0.0
bogo,10,10,5	1.385148	10	10	5	bogo	1.0	1.0	1.0	1.0
discount,5,20,10	1.196584	5	20	10	discount	1.0	0.0	0.0	1.0
bogo,5,5,7	1.130522	5	5	7	bogo	1.0	1.0	0.0	1.0

While all the measures are deemed effective by our p_tests, there does seem to be differences in performance. Roughly speaking it looks like the 2,10,10 discount performs best, but comparing the offers like this is on shaky statistical footing. There may be other factors impacting performance, e.g. people might be more likely to buy coffee on a Monday, so offers which include a (or multiple) Mondays might get a boost.

Offer Classification

Random Forest Model - Offer by offer prediction

We will use a random forest classifier to predict whether a user will view or complete a given offer. Initially a model is fitted for each offer individually so only information about the user is relevant.

The data points used for the predictions are:

Age (Int)
Income (Int)
Account age (Int)
Big Spender Cluster (Bool)
Cluster 2 (Bool)
Male (Bool)
Female (Bool)
Mean Spend (Float)

The Cluster 0 and Cluster 1 data points are not used because they don’t improve the predictive power of the model.

In addition to the base classifier SMOTE is used to resample the data due to the imbalanced nature of the data set. Further the parameters are tuned with a grid search carried out independently for each offer. The parameter grid is generated by

param_grid={
    'criterion':['gini', 'entropy'],
    'min_samples_split':np.logspace(-5,-1,5),
    'min_samples_leaf':np.logspace(-6,-2,5),
}

Other than these parameters the default sklearn options were used. This was largely due to computational limits. With a different model for each offer running extensive grid searches is very time consuming, especially on basic hardware.

The problem is a mulit-output classification one and therefore we can choose to run the grid search for a single multioutput classifier, or the search can be run on each output separately. The second option was chosen as it improved performance and with only two outputs it isn’t too difficult to do so. The grids evaluated parameters by their f1-score across three cross validations.

The best parameters are then used to refit a model on the entire test set.

We collect the results for each offer below.

Type	Offer	Metric	Not Viewed	Viewed	Not Completed	Completed
bogo	bogo,10,10,5	f1-score	0.470588	0.976853	0.799180	0.843949
		precision	0.758621	0.960902	0.899654	0.776557
		recall	0.341085	0.993343	0.718894	0.924150
		support	129.000000	2103.000000	1085.000000	1147.000000
	bogo,10,10,7	f1-score	0.651357	0.957582	0.793616	0.855714
		precision	0.829787	0.933168	0.850236	0.817647
		recall	0.536082	0.983307	0.744066	0.897498
		support	291.000000	1917.000000	969.000000	1239.000000
	bogo,5,5,5	f1-score	0.375000	0.978962	0.661710	0.805556
		precision	0.642857	0.965422	0.651220	0.813084
		recall	0.264706	0.992888	0.672544	0.798165
		support	102.000000	2109.000000	794.000000	1417.000000
	bogo,5,5,7	f1-score	0.675980	0.690486	0.683483	0.816823
		precision	0.674460	0.691976	0.655530	0.837491
		recall	0.677507	0.689003	0.713927	0.797151
		support	1107.000000	1164.000000	797.000000	1474.000000
discount	discount,2,10,10	f1-score	0.503704	0.984566	0.704268	0.877370
		precision	0.809524	0.973133	0.689552	0.885204
		recall	0.365591	0.996270	0.719626	0.869674
		support	93.000000	2145.000000	642.000000	1596.000000
	discount,2,10,7	f1-score	0.691812	0.691812	0.683951	0.812317
		precision	0.685506	0.698236	0.708440	0.795977
		recall	0.698236	0.685506	0.661098	0.829341
		support	1077.000000	1097.000000	838.000000	1336.000000
	discount,3,7,7	f1-score	0.488550	0.984693	0.597179	0.840965
		precision	0.761905	0.974231	0.532123	0.883615
		recall	0.359551	0.995381	0.680357	0.802243
		support	89.000000	2165.000000	560.000000	1694.000000
	discount,5,20,10	f1-score	0.882175	0.758763	0.733211	0.743363
		precision	0.885445	0.753070	0.850587	0.656250
		recall	0.878930	0.764543	0.644301	0.857143
		support	1495.000000	722.000000	1237.000000	980.000000
informational	informational,0,0,3	f1-score	0.693997	0.944073	0.933436	0.510397
		precision	0.858696	0.912099	0.887152	0.828221
		recall	0.582310	0.978369	0.984816	0.368852
		support	407.000000	1803.000000	1844.000000	366.000000
	informational,0,0,4	f1-score	0.709934	0.691404	0.943843	0.472813
		precision	0.715939	0.685289	0.900962	0.854701
		recall	0.704028	0.697630	0.991010	0.326797
		support	1142.000000	1055.000000	1891.000000	306.000000

Recall on classes with low support is predictably poor despite the resampling. This is a limitation of the data, there’s only so much we can do to improve the performance here. That being said, we can also try building a model to classify on multiple offers at once. In particular we can start to use information about the offer such as whether it was sent by email. We can do this either for all offers at once or by offer type.

Random Forest Model - Alternative prediction methods

Once again a grid search was run on each output before a model was refit on the entire training set. This time the parameter grid was slightly expanded since there are fewer models to build.

param_grid={
    'criterion':['gini', 'entropy'],
    'min_samples_split':np.logspace(-5,-1,5),
    'min_samples_leaf':np.logspace(-6,-2,5),
}

Because the Completed class is less unbalanced for some of these aggregated sets, so SMOTE sampling was less clearly needed. The performance differences were very small however, and it was much easier to pick a single option. With a bit of work this could have been included as parameter in the grid search however, which would have been a good option to eke out a bit of extra performance if one of these methods were to be chosen.

To compare these models we consider their combined performance by offer type as well as their overall performance. A breakdown by offer is also possible but it is difficult to read such a large table and meaningfully interpret the results.

Type	Metric	Method	Not Viewed	Viewed	Not Completed	Completed
bogo	f1-score	All at once	0.521869	0.915683	0.730182	0.830887
		Offer by offer	0.645097	0.927438	0.738933	0.829567
		Type by type	0.567692	0.879088	0.731456	0.792449
	precision	All at once	0.667304	0.881793	0.777090	0.800597
		Offer by offer	0.697857	0.913321	0.765207	0.811379
		Type by type	0.487450	0.921516	0.688195	0.832916
	recall	All at once	0.428484	0.952283	0.688615	0.863559
		Offer by offer	0.599754	0.941999	0.714403	0.848588
		Type by type	0.679558	0.840395	0.780521	0.755732
	support	NaN	1629.000000	7293.000000	3645.000000	5277.000000
discount	f1-score	All at once	0.728989	0.881592	0.664318	0.818087
		Offer by offer	0.786861	0.906455	0.687559	0.824842
		Type by type	0.721246	0.856631	0.681159	0.804071
	precision	All at once	0.743580	0.874098	0.699023	0.796653
		Offer by offer	0.800000	0.899968	0.706602	0.812565
		Type by type	0.663415	0.896841	0.660832	0.819563
	recall	All at once	0.714960	0.889215	0.632896	0.840706
		Offer by offer	0.774147	0.913036	0.669515	0.837496
		Type by type	0.790123	0.819873	0.702777	0.789154
	support	NaN	2754.000000	6129.000000	3277.000000	5606.000000
informational	f1-score	All at once	0.606533	0.826797	0.917465	0.000000
		Offer by offer	0.706242	0.852370	0.938692	0.493697
		Type by type	0.612913	0.775327	0.901182	0.086047
	precision	All at once	0.713537	0.775598	0.847515	0.000000
		Offer by offer	0.744103	0.831117	0.894112	0.839286
		Type by type	0.587678	0.795145	0.849490	0.196809
	recall	All at once	0.527437	0.885234	1.000000	0.000000
		Offer by offer	0.672046	0.874738	0.987952	0.349702
		Type by type	0.640413	0.756473	0.959572	0.055060
	support	NaN	1549.000000	2858.000000	3735.000000	672.000000

Metric	Method	Not Viewed	Viewed	Not Completed	Completed
f1-score	All at once	0.646922	0.886994	0.782560	0.800414
	Offer by offer	0.728326	0.906261	0.797295	0.813419
	Type by type	0.649153	0.852142	0.776350	0.770578
precision	All at once	0.719983	0.859092	0.784515	0.798587
	Offer by offer	0.759517	0.893599	0.798119	0.812646
	Type by type	0.590551	0.889334	0.737625	0.814440
recall	All at once	0.587323	0.916769	0.780614	0.802250
	Offer by offer	0.699595	0.919287	0.796472	0.814193
	Type by type	0.720668	0.817936	0.819368	0.731199
support	NaN	5932.000000	16280.000000	10657.000000	11555.000000

The Offer by Offer method is the clear winner here. It is the best in almost every case and never the worst. All at once often offers similar performance but performs really poorly in some areas, in particular predicting completion of informational offers and recall of Not Viewed is very poor. Type by Type does occasionally do better than Offer by Offer but is also often much worse, overall it the poorer choice.

That being said they offer a degree of flexibility that the offer by offer does not, there is no scope to estimate performance on untested offer combinations using this method. A Random Forest model may not be the best choice for this kind of prediction, however.

Alternative Model Testing

Just as a sanity check, we can see how other types of model perform to confirm the choice of a random forest was sensible. For speed and ease of comparison we use the models to predict against all the offers at once. We also don’t use a grid search and add a standard scaler to the Pipeline. The models tested are as follows. Only a quick reference was sought, so all models were run with their default setups. As usual this was also done to keep computational time down.

Models	Initial
Logistic Regression	LR
Decision Tree Classifier	DTC
Linear Discriminant Analysis	LDA
Random Forest Classifier	RFC
K-Neighbours Classifier	KNC
Gaussian Naive Bayes	GNB
SGDClassifier	SGDC

Metric	Model	Not Viewed	Viewed	Not Completed	Completed
f1-score	RFC	0.642572	0.878630	0.763706	0.771635
	KNC	0.619437	0.872887	0.766370	0.773149
	LR	0.615140	0.868381	0.767676	0.770751
	SGDC	0.583110	0.853203	0.766043	0.760177
	LDA	0.638180	0.850885	0.740597	0.752376
	DTC	0.585024	0.844201	0.720325	0.708758
	GNB	0.591651	0.841299	0.710868	0.747883
precision	LDA	0.588227	0.881745	0.727783	0.765238
	RFC	0.678927	0.862939	0.745974	0.789779
	GNB	0.565759	0.856534	0.732862	0.728811
	LR	0.646131	0.854366	0.742439	0.797501
	KNC	0.663649	0.853886	0.746883	0.793245
	DTC	0.572651	0.851105	0.679883	0.755560
	SGDC	0.598368	0.845611	0.728064	0.803121
recall	RFC	0.609912	0.894902	0.782303	0.754305
	KNC	0.580748	0.892752	0.786901	0.754046
	LR	0.586986	0.882862	0.794689	0.745738
	SGDC	0.568611	0.860934	0.808201	0.721592
	DTC	0.597943	0.837408	0.765882	0.667417
	GNB	0.620027	0.826597	0.690157	0.767979
	LDA	0.697404	0.822113	0.753871	0.739939

As we can see a Random Forest Classifier is the best performing but both the K-Neighbours Classifier and Linear Regression Classifier offer similar performance, especially the former. Linear Regression in particular might be used to build a model which is more suited to predicting the performance of offers which have not been tested here.

Offer Decision Making

There are many ways we could make the decision on what offers to suggest. We could for example send any offer we think will be accepted. Here we will use a rough estimate of the gain from sending an offer, calculated as $P(\text{completed})*(\text{difficulty}-P(\text{not viewed})*\text{reward}).$

Ideally we would replace $P(\text{not viewed})$ with $P(\text{not viewed} \mid \text{completed})$ but since estimation of whether an offer would be viewed was already fairly poor, I worried such an estimate would be very inaccurate.

Similarly reward is used as an estimate of the cost of providing the offer but it is not a perfect representative. It likely over-estimates the gain of discounts verses buy one get one free offers as the marginal cost of providing a second coffee is likely much lower than menu price.

We use this to pick the top three offers, with no more than two of each type, for each user.

The first few entries of the resulting DataFrame look like this.

Person	1st	2nd	3rd
ae6f43089b674728a50b8727252d3305	discount,2,10,10	bogo,10,10,5	bogo,10,10,7
ad1f0a409ae642bc9a43f31f56c130fc	discount,2,10,10	discount,5,20,10	bogo,10,10,5
dce784e26f294101999d000fad9089bb	discount,2,10,10	bogo,10,10,5	discount,3,7,7
4d0ebb94a5a94fe6afd9350c7b1477e4	discount,2,10,10	discount,5,20,10	bogo,10,10,5
7d7f271652d244f78b97c73cd734c553	discount,5,20,10	discount,2,10,10	bogo,10,10,7

The discount,2,10,10 offer appears a lot in these rows. In fact we can count the number of times each offer is recommend and see this pattern holds in the rest of the DataFrame.

Offer	Count
Bogo,10,10,7	6367
Bogo,10,10,5	7518
Bogo,5,5,7	0
Bogo,5,5,5	1608
Discount,5,20,10	8221
Discount, 2,10,10	14607
Discount,2,10,7	499
Discount,3,7,7	5655
Informational,0,0,4	0
Informational,0,0,3	0

Conclusion

By far and away the most suggested offer is the 2,10,10 discount. This is also the best performing offer according to the tentative analysis in the p-testing section which is reassuring. The decision making algorithm is a little crude. It could be improved with access to more data, like the cost of providing offer rewards. Given a larger data set we might also be able to make estimates of the conditional probabilities needed to give a more accurate estimate of offer gain.

Ultimately it isn’t really possible to give a good measure of how effective this strategy is. This would require real world testing. An experiment could be run where different groups are given offers according to different strategies e.g. the method above, random allocation, everyone gets the same, etc. Performance could then be compared across groups and the best strategy could be found.

As well as running a one off experiment this could be done as a continuous process. Since the rewards app collects all the relevant data models could be constantly updated and different A/B tests run each month with tweaked offer strategies in order to keep improving the performance.

We can of course evaluate the model performance according the the f1-score we highlighted as a metric at the beginning. It’s difficult to say whether a model is good or bad in a vacuum, based only on its scores, but the overall performance was respectable. The the grid search implementation, the SMOTE sampling and the decision to build a separate model for each offer all improved the scores from the base Random Forest Implementation. In particular this increased the Not Viewed f1-score (which was the worst) from about 0.64 to 0.73. This was driven by better scores in both recall and precision.

There were also noticeable, if less dramatic, improvements in the other f1-scores.

It is also clear both from the poor performance of the model itself and the predictions made that informational offers have not been handled well. They are very different in nature to the other offers and there is room for more exploration of when and how to use these kinds of offers.

As mentioned at the beginning, what is really missing here is a good estimate of the impact on a user’s spend of receiving an offer. It is clear from the p-testing section that there is a benefit to sending out an offer beyond just that realised by an additional purchase to redeem the offer. For example the 2,10,10 discount led to an increase in mean number of transactions from 1 to 3, and the total spend rose by over $30 as a result.

There is also scope for development of additional heuristics. Partly these depend on company goals. For example we might want to avoid sending offers to big spenders on the basis we think they are unlikely to be swayed by them. Conversely we might want to send offers to regular users, even if they would make the purchases anyways, in order to increase brand loyalty and retain their custom.

Some key areas for future work are as follows:

Better identify underlying relationships in the data
- Who are social media offers most effective on?
- Do any users react negatively to receiving offers?
Investigate the use of ensemble models to improve performance
- Train alternative models like Linear Regression and K-means and combine the output
- Train models in different ways like Type by Type or All at Once and combine