In an earlier post we took a look at how we can measure and track customer behaviour with the descriptive branch of customer analytics. Now let’s turn our attention to predictive analytics, where we use that data to generate insights and infer possible future outcomes, e.g. what ways are customers likely to have behave in future, based on what they did in the past.
Source data:
The wide array of data which could possibly be collected from a consumer includes a whole variety of different things. Firstly, the ads or other marketing/promotional pieces they viewed (including where and when). Second, the promotional offers and other communications they received (including the media by which that communication was made e.g. online, by email, by mail). Third, any actions the customer took – including the all important purchase(s) made (and by which channel: web, mobile, phone, mail, store, event). Actions taken won’t just mean purchases (although we hope that is the end result) but may include returned purchases, contacts with customer services such as questions, comments or complaints, or activity online and offline (both positive and negative) such as reviewing, tweeting, sharing, recommending and so on.
Since so much of that action and interaction is now happening online, the amount of data we’re able to collect is growing seemingly exponentially. But even the more traditional clicks-and-mortar transactional data contains a wealth of data from which we can glean useful insights.
In addition, we live in a world moving so much faster than at any other time in human history. The competitive, technological and economic pressures facing all businesses is huge. We are less and less able to exist in an insular bubble, looking only inward at what we – and our immediate customers – are doing. We need to be willing to explore data from a wider variety of sources – and that may include knowledge and data about our competitors, global and local economic data/indicators, and so on.
Questions:
To use historical data to make statements about things which haven’t happened yet, we can use a variety of mathematical and statistical techniques or models to answer specific predictive questions, such as:
- Which product is this customer going to buy next?
- How many products are they going to buy/how much money are they going to spend?
- Which one of our customers is going to buy next?
- How many customers are likely to stay with us from a particular group or cohort recruited last period?
- Is this particular customer going to stay or leave (drop their subscription)?
Techniques:
The best technique to employ in each case depends on the predictive question we want answered. So for example if we want to make a prediction about some action occurring in a fixed period of time in future (‘How many products will this customer buy in the next month?’; ‘Will this customer drop their subscription in the next year?’), particularly when that period of time is happening in the near-future (this year, or next year at the latest) we can use Regression Analysis.
Where our question moves beyond a single fixed period or starts to look further into the future (the period-after-next, three years from now), or when we want to ask questions about when something will happen (when will a customer drop their subscription?) or how long it will happen for (how long will this customer continue buying this product), or where we want to look at the whole lifetime of a particular customer (lifetime value, LTV), then we can use Data Set Predictions and their corresponding Probability Models.
Regression analysis:
We use regression analysis when we want to use historical data to predict what will happen in a future period. Usually data from period 1 or year 1, used to determine likely outcome in period 2 or year 2. establish either:
- How many times an event will occur in a fixed period of time; or
- Whether or not an event will occur in a fixed period of time.
Simple linear regression:
Regression quantify a relationship between two variables, such as product price and demand for that product (which will affect sales achieved). To do this, we can build a mathematical model which shows how the dependent variable (e.g. demand) is a function of (or how it is related to and caused by) the independent variable (e.g. price). The predictive model will allow us to establish what the dependent variable (demand) is likely to be at various values of independent variable (price).
Mathematically this problem is solved as a simple linear additive model. The simplest way to solve is, for a given data set (the predictive data set), to plot the available data points – for example (selling price, sales volume) – on a simple linear graph. The resultant graph shows the relationship between the dependent and independent variables for the known data points. A best-fit straight line may be added by eye to the plotted data points, giving an approximation for the relationship, and from this an estimated prediction can be made for all other data points outside of the original data set.
Alternatively, the model/equation can be solved mathematically to give an exact fit using the least squares method.
y = a.x + b
where:
- y is the dependent variable (e.g. sales volume)
- x is the independent variable (e.g. selling price)
- a is the co-efficient of x – that is, the gradient or slope of the line being plotted (e.g. the price sensitivity in the specific market or for the specific product being analysed) (note is it possible for the co-efficient to be a negative number, meaning the curve slopes downwards from left to right, or – in real-world terms – sales demand is dropping as price is increasing)
- b is the y-intercept – that is, some constant through which the line would intercept the y axis if the independent variable were zero (e.g. the baseline market for the product if it were to be given away for free)
The solved equation will provide a regression line, R^2 (R-squared). An important aspect of regression analysis is the inclusion of the R-squared factor, which tells you how good a fit the regression line is to the data – in effect a confidence level. A small R-squared (closer to 0) indicates the data points are highly scattered, meaning the straight line does not capture a significant number of the data points so is less likely to give highly accurate predictive results. A large R-squared (closer to 1) shows that the data points were not very scattered and all or almost all have a much closer fit to the straight line in, therefore there is high confidence the regression line will give accurate predictive results. In practice, the R-squared in a regression analysis should be at least 70% or 80% in order for us to have confidence in the predictive accuracy of the model.
Regression is a simple but powerful tool used for a whole host of predictions – provided the source data is available. The example shown above looked at demand/sales as a function of price, but it could be extended to look at sales (dependent variable) as a function of any number of different relevant independent variables (e.g. number of mailings sent, number of advertisements seen, number of emails).
It should be clear from this why it is so vitally important that a company tracks all kind of objective descriptive metrics, and in an easily-accessible format. This is even more true given the ease with which many such descriptive metrics can be collected these days. The key word here is objective: split testing should be carried out regularly to isolate and test each specific independent variable to assess it’s impact on the dependent variable of interest.
Optimisation:
The value of the prediction is not to read the future, but rather to optimise it; to improve our future results given what we have learnt by looking at what happened in the past. So for example, once we know how customers are predicted to behave in response to different product pricing levels (or any other relevant independent variable), then we can calculate what-if scenarios to model the financial impact that would arise is we changed that independent variable.
Let’s say we propose changing price from P0 to P1 and our regression line (which has a high R-squared number so we know it is predictively accurate) indicates this will likely move sales in a given period from S0 to S1. We can also review what will happen to profits in that period if we change prices and get the predicted change in sales in response (assuming that all other variables, costs and so forth remain level). There will be a specific price Pn which will optimise our sales revenue to Sn, and similarly optimise our profit.
Multiple regression:
A more complex form of regression analysis looks at relationships between a dependent variable and more than one independent variable to determine how much each independent variable affects the overall total dependent variable.
A multiple regression equation takes the form:
y = a1.x1 + a2.x2 + … + an.xn + b
with any number of independent variables (x1, x2, … xn), and each independent variable having it’s own co-efficient (a1, a2, … an).
Here we can build a predictive model to quantify the relationship between our variable of interest and any number of affecting factors. For example we can review sales as a factor of price, discount given, number of times promoted (e.g. by mail, email or online) and age of customer (i.e. how many years they’ve traded with us).
Key performance indicators (KPI’s):
Businesses have naturally always had a desire to know how customers will behave in future, given how they behaved in the past. Businesses had previously been very constrained by the kinds of customer data and amount of data they had available to them – particularly data of sufficient granularity and objectivity. The response was the development of KPI’s, a small number of key measures which could quickly and easily show a company which actions or behaviours or events in the past were likely to be most predictive of positive results (sales, profits, customer spend, etc.) in future.
Many stock-in-trade KPI’s we know and use today were developed in the 1960’s and 1970’s using regression analysis. For a particular data set in the first period/year, it was possible to strip this down into key activities and, using regression analysis, establish which of these variables were the most accurate predictors of the outcomes attained in the subsequent period/year. By applying this process on many large data sets over wide time spans and across a wide range of different variables (products, promotions, customer types, prices, geographical regions, etc.) it was possible to hone in on those few most significant elements, activities or variables which had the greatest impact on future outcomes. The resulting predictive variables were shown over time to be accurate and stable predictive indicators, so came into common usage.
Of course, there is no reason why a business with access to a large historical customer data set cannot perform the same process of exhaustive, exploratory regression analysis to their own data – particularly given the cost of storing/holding all that historical data is now so low, and given how computational power is now so fast and powerful for such a small price. By doing so, they can find those unique key indicators and measures, and apply those metrics for predictive purposes to optimise their marketing efforts.
Recency-frequency-monetary value (RFM):
One such KPI in use in the direct response marketing industry developed using this method, is the RFM Model. RFM measures and scores each individual customer based on those key elements which were isolated and shown to be most predictive of future activity:
- recency: how long ago was the most recent purchase made (scored out of 10, with 10 highest; usually calculated as ’10 minus no. of months since last purchase’)
- frequency: how many purchases were made in the last year (scored out of 10, with 10 highest)
- monetary value: either (i) value of highest order, or (ii) moving average of all orders (scored out of 10, with 10 highest; benchmark set relevant to industry and company, often 10k)
Note that, within the three categories, recency is weighted most highly, followed by frequency, with monetary value being weighted the least.
The RFM model can be modified in a variety of ways relevant to the specific industry, or per the proprietary metrics of the individual business. One significant modification is to score each of the three elements, not only actual monetary sales, but also to include measurement of any economically-beneficial action (such as information requests, sales calls, website product views). This allows the company to score customers from hottest through coldest across a wider range of ‘customer engagement’ activities acting as an indicator of future sales value.
Multi-year regression analysis:
Regression techniques work well when using an actual historical data set (call this period 1) to predict what will happen in the subsequent period (period 2). It will not allow accurate prediction from period 1 directly out to period 3 and beyond, unless you have access to the independent variable data for the period preceding the period you wish to predict. So long as you do have access to the preceding period’s independent variable data, you can just apply the same regression model, and solve to obtain predictions about your dependent variable of interest in the subsequent period. But without these independent variables you cannot solve the equation for future.
Long-run predictive analysis:
There are several predictive models can be used for long-run predictions, including Data Set Predictions and Probability Models. These techniques are useful where we want to ask when something will happen, or how long something will continue for, or other long-term questions such as calculating the lifetime value of a particular customer. While not as widespread and popular as regression analysis, they are very powerful predictive techniques that a business should be ready to employ.
Data set predictions and cohort analysis:
A cohort is a group of customers acquired either at the same time, say in a given month or year, or acquired as a result of a particular activity such as a specific marketing campaign, or acquired through a particular channel such as web or TV advertising. Analysing customer data by cohorts is a useful way of reviewing historical data to show patterns and trends which can then be used for predicting future behaviours.
One common example is to review a cohort of customers based on the year in which they first started trading with the business, to establish patterns in how many customers cease trading over time, or which customers are intermittent traders versus those who are regular traders, or to view patterns of increasing/decreasing spend over time, and so on. Other examples might be to view all customers who first joined a loyalty scheme or membership subscription program in a particular year, and review trends of when different customers dropped out of the program in the past in order to predict when a given customer will drop out of the program in future. The period under review can be shortened too: to review, for a given cohort of customers, which ones were active and engaged on the website on a day-to-day basis and which were not.
The table used in the analysis is constructed line by line with each line representing one customer, and with each column representing a given time period (year, month, week, day, etc.). The data in the table shows whether or not – for each given customer (row) in each given time period (column) the activity of interest (purchase, remaining in loyalty program, being engaged on the website) took place. If the customer undertook the activity in that period the entry will show a 1, and if not the entry will show a zero [Ref. 1].
By reviewing the table of zero’s and one’s in historical actual time periods, predictions can be made about future time periods for groups of similar customers. For example, for a given customer who has remained in the loyalty program every year for the last ten years, although we can’t know or predict exactly what that one customer will do in the next five years, we can model and predict what the whole group of customers sharing the same characteristics as that one customer will – on average – do in the next five years. (The group of customers which share the same common characteristics are referred to as a heterogeneous group, and are said to be displaying heterogeneity.)
Likewise for that group of customers which joined the loyalty program in the first year then dropped out and did not re-enter, we can make a prediction about what that whole group of customers will do, on average, in the next five years. We may think that, because they dropped out of the scheme (or, perhaps in the case of customers’ purchasing habits, they didn’t trade with us for so many years) that they no longer have a future value to the business. In each individual case this is likely to be true. But on average, and over the long-run, even this group of non-active customers can have a value to the business. This is especially true given that they may make up a very large part of the customer database, often in the range of 30-50% of the whole customer database.
We can also analyse and compare different sub-groups of customers within the overall cohort, and make comparative predictions about how each sub-group will, on average, behave in future years. We can compare and establish which one of two sub-groups will be more valuable to the business over the next five years, and by how much, by taking into account, not just how many times in the last ten years they acted positively, but also how recently was their most recent positive action, with recency being weighted more highly than frequency.
The power of cohort analysis lies in its ability to pare back the noise of individual customers and their actions and bring forward the underlying patterns and trends and averages which are a truer expression of the innate propensity for different heterogeneous groups of customers to behave in different kinds of ways (drop out early, drop out recently, skip periods randomly, show up consistently period-on-period). By applying these average propensities across the whole data set, it’s possible to build a statistically valid predictive model of behaviours, so we can get to the required ‘when’ or ‘how much’ answers that we couldn’t get to with simple regression analysis.
Probability models:
As we saw, the power of the data set analysis is it’s ability to pull out the hidden underlying patterns and behaviours. It does this by creating averages for specific sets or groups with similar characteristics (that can therefore reasonably be expected to behave similarly). These averages can then be expected to apply to similar groups not just now or next year but over the long-term as well. This is what allows us to extent our prediction timelines out farther into the future than we could with regression analysis.
Because we structured our historical data into a simple binary table of results – either the customer dropped out of the loyalty program in the period (0) or they didn’t (1) – we can easily construct a simple probability model from that data with which to predict future outcomes, based on an assumption of each data point having a random chance of the event happening (1) or not (0). By using randomness we can strip out the noise caused by all the underlying complexities, all the myriad reasons, why the individuals themselves acted the way they did in each particular instance.
Now the likelihood that any given sub-group will return a zero or a one in a particular period (the average referred to above when talking about data set prediction) can be modelled as a probability, with the probability of that group indicating the propensity for individuals in the group to act a certain way when viewed altogether as a whole and when viewed over time. (Probability may be expressed as either a percentage or a value between 0 and 1. A probability closer to 0 indicates a lower likelihood of the event occurring, closer to 1 indicates a higher likelihood. This shouldn’t be confused with the zero’s and one’s in the data table we constructed earlier!)
Entropy:
The probability model shows how groups of like individuals are expected to act over the long run. This looks like a great model, but it doesn’t account for any naturally-occurring changes in that group over time. We should remember that any system is likely to decay gradually over time, unless some other factor is introduced to prevent that decay. We can build entropy or natural decay into our probability model, and it would be prudent to do so, particularly if we wish to model future customer behaviour so as not to overstate our predictions.
There are many reasons why the number of customers can decline over time; they may have moved out of the area, switched to a competitor, or (in the worst case) they may have died. Or, they may simply no longer have need of the product we’re trying to sell them [*]. In every instance though, we can assume that they have gone and will not return.
We can account for this natural wastage, or falling away in the population, by introducing some population decay factor into our basic probability model. But this can be further modelled to account for by the fact that – as with the different groups which have different propensities to take some particular action (drop out of a loyalty program, stop purchasing a product) – so different groups will have different propensities to decay or naturally decline.
A probability model containing these two aspects – (1) the propensity to take or not take some economically positive action (‘buying’); and (2) the propensity to drop out of the total population for good and never come back (‘dying’) – is sometimes referred to as the Buy ‘Til You Die (BTYD) model [Ref. 2].
Testing a probability model:
After constructing a probability model we need to test it’s accuracy before we can use it for predictive purposes. To do so, we can calculate what the model expected the data to have been (for periods where we have actual data available) and compare these expectations against the actual data. We can do this test either by looking at averages for either recency or frequency of action, or as a total of both recency and frequency together. Graphing actuals against expectations will show us how well the predictive model fits the actual data. If a good fit, this gives us confidence in the accuracy of the models predictions, and can confidently go ahead and use it to predict outcomes in future periods.
Applying a probability model:
Once we calculate the future probability for each grouping within our data set we might be surprised by some counter-intuitive results. That group of loyal, active and heavily-engaged customers which kept coming back year after year for the last ten years, would be expected to just keep on going year after year into the future. But as a whole that group will have a lower probability of continued action/engagement/purchasing than we would expect.
This is a danger. We would expect them to keep on buying and not to drop off or leave, so we might prioritise them, choosing to invest more heavily in promoting to them, cosseting them and cooing over them. Perhaps we’ll offer them some gold-card, premium level customer service and special loyalty discounts. But a higher-than-expected number of them still drop off anyway.
Conversely a group which appeared to have left years ago and not acted or traded with us since, could be expected to have a zero or near-zero probability of acting positively or trading with us again in the future. That seems to make sense. Yet their future probability of acting will be higher than we would reasonably expect. So another danger. We would choose not to contact that group again, and we would deprioritise them, refuse to invest in them, give up on them, stop wooing them, write them off as losers… and so it goes. Yet some of them will come back at a future point (unexpectedly so), confounding all our expectations of them.
So a key benefit of using such a rigorous probability model is to ensure we neither over-expect nor under-expect future outcomes from past customer groups, and as a result, we intelligently invest our marketing efforts and budgets accordingly.
Lifetime value (LTV):
The probability model we construct can capture all of that underlying complexity we mentioned, simplify and summarise it, and allow us to make inferences about what might happen at some future point in time. That’s powerful, not just to try to decide (by averaging out for all members of a particular customer group) what they’re likely to do in period 5 or 7 or 10, but also what they’re likely to do for the full extent of their existence as a customer. So the model will allow us to measure (in probability terms, or in average terms) what certain customer groups are likely to do for their full lifetime. From this we can calculate an average value (in present terms) of what we expect the monetary value of a given customer in a particular customer group to be.
This is what is known as Lifetime Value (LTV), and it’s such an important and valuable metric to digital/DR marketing businesses because, if we are going to go out into the marketplace and acquire new customers there is inevitably going to be a cost to us of doing so. We clearly can’t just acquire that customer at unlimited cost to the business, since this would be unsustainable. We need to establish a cap – how much we can afford to spend on acquiring that new customer – and make sure we don’t exceed this cap. To do so, we need to know how much profit we are likely to make back from that customer (the lifetime value of that customer) and make sure our cost to acquire does not exceed this amount in order to ensure that new customer will be profitable. And probability models will be very effective in helping to build up that lifetime value picture.
Modelling aids decision-making:
One closing point about applying predictive models is that all analysis and modelling exists to support effective management decision-making, in order to improve the future outcomes of the business. We can use regression lines or probabilities to steer us towards better investment decisions (who to promote to, how often, by which method, and so on).
But we will also need to pinpoint the critical metrics necessary to measure, monitor and evaluate in order to establish whether the management decisions we’ve taken and the changes of course we’ve undertaken (changed priorities, changed processes, etc.) are actually making a difference and moving the business in the right direction or not. So the models themselves should not be seen in isolation from the wider questions of metrics, decision-making and effective monitoring and control.
Here we begin to move in the direction of optimisation (of decisions, investments, marketing efforts), and the much more strategic focus that comes into play with the third branch of analytics: Prescriptive Analytics.
Footnotes:
* Many purchases are demographically linked, with a certain demographic group being a heavy buyer for that particular product or service, for example by age (young newly-weds, retirees) or by lifestyle (digital native vs. technophobic, local vs. globalist). If population levels and societal/lifestyle patterns always remained constant, there would be no reason to expect changes in demand for products over time. But we know that is not the case. An obvious example is the population boom known as the ‘baby-boomer’ generation, followed by a population crunch (generation X). As a population passes through various life-stages, they have need for different products and services; they also have differing financial pressures on them (as earning levels rise, peak out, then fall). But a following generational population may be smaller than the preceding generation, resulting in an outright natural shrinkage of the marketplace for a given product or service.
References:
[1] Fader, Peter S.; Hardie, Bruce G.S.; Shang, Jen (2010), “Customer-Base Analysis in a Discrete-Time Noncontractual Setting,” Marketing Science 29 (Nov-Dec): 1086-1108.
[2] Schmittlein, David; Morrison, Donald; Colombo, Richard (1987), “Counting Your Customers: Who-Are They and What Will They Do Next?” Management Science 33 (1): 1–24.