Tuesday, February 2, 2010

Simpson's Paradox and Marketing

A reader asked the following question:

Hi Michael/Gordon,
In campaign measurements, it's possible to get a larger lift at the overall level compared to all the individual decile level lifts or vice versa, because of the differences in sample size across the deciles, and across Test & Control.
According to wikipedia, it's known as Simpson's paradox (or the Yule-Simpson effect) and is explained as an apparent paradox in which the successes in different groups seem to be reversed when the groups are combined.
In such scenarios, how do you calculate the overall lift? Which methods are commonly used in the industry?

Simpson's Paradox is an interesting phenomenon, where results about subgroups of a population do not generalize to the overall population. I think the simplest version that I've heard is an old joke . . . "I heard you moved from Minnesota to Iowa, raising the IQ of both states."

How could this happen? For the joke to work, the average IQ in Minnesota must be higher than the average IQ in Iowa. And, the person who moves must have an IQ between these two values. Voila, you can get the paradox that the averages in both states go up, although they are based on exactly the same population.

I didn't realize that this paradox has a name (or, if I did, then I had forgotten). Wikipedia has a very good article on Simpson's Paradox, which includes real world examples from baseball, medical studies, and an interesting discussion of a gender discrimination lawsuit at Berkeley. In the gender discrimination lawsuit, women were accepted at a much lower rate than men overall. However, department by department, women were typically accepted at a higher rate than men. The difference is that women applied to more competitive departments than men. These departments have lower rates of acceptance, lowering the overall rate for women.

Simpson's Paradox arises when we are taking weighted averages of evidence from different groups. Different weightings can produce very different, even counter-intuitive results. The results become much less paradoxical when we see the actual counts rather than just the percentages.

The specific question is how to relate this paradox to lift, and understanding marketing campaigns. Assume there is a marketing campaign, where one group receives a particular treatment and another group does not. The ratio of performance between these two groups is the lift of the marketing campaign.

To avoid Simpson's paradox, you need to ensure that the groups are as similar as possible, except for what's being tested. If the test is for the marketing message, there is no problem, both groups can be pulled from the same population. If, instead, the test is for the marketing group itself (say high value customers), then Simpson's Paradox is not an issue, since we care about how the group performs rather than how the entire population performs.

As a final comment, I could imagine finding marketing results where Simpson's Paradox has surfaced, because the original groups were not well chosen. Simpson's Paradox arises because the sizes of the test groups are not proportional to their sizes in the overall population. In this case, I would be tempted to weight the results from each group based on the expected size in the overall population to calculate the overall response and lift.

Labels: , ,

Monday, December 28, 2009

Differential Response or Uplift Modeling

Some time before the holidays, we received the following inquiry from a reader:

Dear Data Miners,

I’ve read interesting arguments for uplift modeling (also called incremental response modeling) [1], but I’m not sure how to implement it. I have responses from a direct mailing with a treatment group and a control group. Now what? Without data mining, I can calculate the uplift between the two groups but not for individual responses. With the data mining techniques I know, I can identify the ‘do not disturbs,’ but there’s more than avoiding mailing that group. How is uplift modeling implemented in general, and how could it be done in R or Weka?

[1] http://www.stochasticsolutions.com/pdf/CrossSell.pdf

I first heard the term "uplift modeling" from Nick Radcliffe, then of Quadstone. I think he may have invented it. In our book, Data Mining Techniques, we use the term "differential response analysis." It turns out that "differential response" has a very specific meaning in the child welfare world, so perhaps we'll switch to "incremental response" or "uplift" in the next edition. But whatever it is called, you can approach this problem in a cell-based fashion without any special tools. Cell-based approaches divide customers into cells or segments in such a way that all members of a cell are similar to one another along some set of dimensions considered to be important for the particular application. You can then measure whatever you wish to optimize (order size, response rate, . . .) by cell and, going forward, treat the cells where treatment has the greatest effect.

Here, the quantity  to measure is the difference in response rate or average order size between treated and untreated groups of otherwise similar customers. Within each cell, we need a randomly selected treatment group and a randomly selected control group; the incremental response or uplift is the difference in average order size (or whatever) between the two. Of course some cells will have higher or lower overall average order size, but that is not the focus of incremental response modeling. The question is not "What is the average order size of women between 40 and 50 who have made more than 2 previous purchases and live in a neighborhood where average household income is two standard deviations above the regional average?" It is "What is the change in order size for this group?"

Ideally, of course, you should design the segmentation and assignment of customers to treatment and control groups before the test, but the reader who submitted the question has already done the direct mailing and tallied the responses. Is it now too late to analyze incremental response?  That depends: If the control group is a true random control group and if it is large enough that it can be partitioned into segments that are still large enough to provide statistically significant differences in order size, it is not too late. You could, for instance, compare the incremental response of male and female responders.

A cell-based approach is only useful if the segment definitions are such that incremental response really does vary across cells. Dividing customers into male and female segments won't help if men and women are equally responsive to the treatment. This is the advantage of the special-purpose uplift modeling software developed by Quadstone (now Portrait Software). This tool builds a decision tree where the splitting criteria is maximizing the difference in incremental response. This automatically leads to segments (the leaves of the tree) characterized by either high or low uplift.  That is a really cool idea, but the lack of such a tool is not a reason to avoid incremental response analysis.

Labels: , , ,

Thursday, December 17, 2009

What do group members have in common?

We received the following question via email.


I have a data set which has both numeric and string attributes. It is a data set of our customers doing a particular activity (eg: customers getting one particular loan). We need to find out the pattern in the data or the set of attributes which are very common for all of them.

Classification/regression not possible , because there is only one class
Association rule cannot take my numeric value into consideration
clustering clusters similar people, but not common attributes.

 What is the best method to do this? Any suggestion is greatly appreciated.

The question "what do all the customers with a particular type of loan have in common"  sounds seductively reasonable. In fact, however, the question is not useful at all because the answer is "Almost everything."  The proper question is "What, if anything, do these customers have in common with one another, but not with other people?"  Because people are all pretty much the same, it is the tiny ways they differ that arouse interest and even passion.  Think of two groups of Irishmen, one Catholic and one Protestant. Or two groups of Indians, one Hindu and one Muslim. If you started with members of only one group and started listing things they had in common, you would be unlikely to come up with anything that didn't apply equally to the other group as well.

So, what you really have is a classification task after all.  Take the folks who have the loan in question and an equal numbers of otherwise similar customers who do not. Since you say you have a mix of numeric and string attributes, I would suggest using decision trees. These can split equally well on numeric values ( x>n ) or categorical variables ( model in ('A','B','C') ). If the attributes you have are, in fact, able to distinguish the two groups, you can use the rules that describe leaves that are high in holders of product A as "what holders of product A have in common" but that is really shorthand for "what differentiates holders of product A from the rest of the world."

Labels: , ,

Friday, November 6, 2009

Oversampling in General

Dear Data Miners,

I am trying to find out statistical reasons for balancing data sets when building models with binary targets, and nobody is able to intelligently describe why it is being done. In fact, there are mixed opinions on sampling when the response rate is low.

Based on literature and data mining professional opinions, here are few versions (assume that the response rate is 1%):

1) As long as the number of responders is approximately equal or greater than 10 times the number variables included, no additional sampling is needed.

2) Oversample or undersample (based on the total number of observations) at least until the response rate = 10%.

3) Oversample or undersample (based on the total number of observations) until the response rate = 50%.

4) Undersampling is useful only for cutting down on processing time; really no good reason to do it statistically as long as the number of observations for responders is "sufficient" (% does not matter).

Having an advanced degree in mathematics but not being a statistician, I would like to understand whether there really is any statistical benefit in doing that.

I appreciate your time answering this.


Your fellow data miner

Many years ago, I was doing a churn model for SK Telecom (in South Korea) using SAS Enterprise Miner. A friend of mine at SAS, Anne Milley, had suggested that having a 50% density for a binary response model would produce optimal models. Her reasoning was that with a 50% density of each target value, the contrast between the two values would be maximized, making it easier to pick out patterns in the data.

I spent some time testing decision trees with all sorts of different densities. To my surprise, the decision trees with more than 30% density performed better than trees with lower densities, regardless of the splitting criterion and other factors. This convinced me that 50% is not a bad idea.

There is a reason why decision trees perform better on balanced samples. The standard pruning algorithm for decision trees uses classification as the metric for choosing subtrees. That is, a leaf chooses its dominant class -- the one in excess of 50% for two classes. This works best when the classes are evenly distributed in the data. (Why data mining software implementing trees doesn't take the original density into account is beyond me.)

In addition, the splitting criteria may be more sensitive to deviations around 50% than around other values.

Standard statistical techniques are insensitive to the original density of the data. So, a logistic regression run on oversampled data should produce essentially the same model as on the original data. It turns out that the confidence intervals on the coefficients do vary, but the model remains basically the same.

Hmmm, as I think about it, I wonder if the oversampling rate would affect stepwise or forward selection of variables. I could imagine that, when testing each variable, the variance in results using a rare target would be larger than the variance using a balanced model set. This, in turn, might lead to a poorer choice of variables. But I don't know if this is the case.

For neural networks, the situation is more complicated. Oversampling does not necessarily improve the neural network -- there is no theoretical reason why. However, it does allow the network to run on a smaller set of data, which makes convergence faster. This, in turn, allows the modeler to experiment with different models. Faster convergence is a benefit in other ways.

Some other techniques such as k-means clustering and nearest neighbor approaches probably do benefit from oversampling. However, I have not investigated these situations in detail.

Because I am quite fond of decision trees, I prefer a simple rule, such as "oversample to 50%", since this works under the maximum number of circumstances.

In response to your specific questions, I don't think that 10% is a sufficient density. If you are going to oversample, you might as well go to 50% -- there is at least an elegant reason why (the contrast idea between the two response values). If you don't have enough data, then use weights instead of oversampling to get the same effect.

In the end, though, if you have the data and you have the software, try out different oversampling rates and see what produces the best models!

Labels: ,

Friday, October 16, 2009

SVM with redundant cases

We received the following question from a reader:

I just discovered this blog -- it looks great. I apologize if this question has been asked before -- I tried searching without hits.

I'm just starting with SVMs and have a huge amount of data, mostly in the negative training set (2e8 negative examples, 2e7 positive examples), with relatively few features (eg less than 200). So far I've only tried linear SVM (liblinear) due to the size, with middling success, and want to under-sample at least the negative set to try kernels.

A very basic question. The bulk of the data is quite simple and completely redundant -- meaning many examples of identical feature sets overlapping both positive and negative classes. What differs is the frequency in each class. I think I should be able to remove these redundant samples and simply tell the cost function the frequency of each sample in each class. This would reduce my data by several orders of magnitude.

I have been checking publications on imbalanced data but I haven't found this simple issue addressed. Is there a common technique?

Thanks for any insight. Will start on your archives.
There are really two parts to the question. The first part is a general question about using frequencies to reduce the number of records. This is a fine approach. You can list each distinct record only once along with its frequency. The frequency counts how many times a particular pattern of feature values (including the class assigned to the target) appears. The second part involves the effect on the SVM algorithm of having many cases with identical features but different assigned classes. That sounded problematic to me, but since I am not an expert on support vector machines, I forwarded your question to someone who is--Lutz Hamel, author of Knowledge Discovery with Support Vector Machines.

Here is his reply:

I have some fundamental questions about the appropriateness of SVM for this classification problem:

Identical observation feature vectors produce different classification outcomes. If this is truly meaningful then we are asking the SVM to construct a decision plane through a point with some of the examples in this point classified as positive and some as negative. This is not possible. This means one of two things: (a) we have a sampling problem where different observations are mapped onto the same feature vectors. (b) we have a representation problem where the feature vector is not powerful enough to distinguish observations that should be distinguished.

It seems to me that this is not a problem of a simple unbalanced dataset but a problem of encoding and perhaps coming up with derived features that would make this a problem suitable for decision plane based classification algorithms such as SVMs. (is assigning the majority label to points that carry multiple observations an option?)
SVM tries to find a hyperplane that separates your classes. When, (as is very common with things such as marketing response data, or default, or fraud, or pretty much any data I ever work with), there are many training cases where identical values of the predictors lead to different outcomes, support vector machines are probably not the best choice. One alternative you could consider is decision trees. So long as there is a statistically significant difference in the distribution of the target classes, a decision tree can make splits. Any frequently occuring pattern of features will form a leaf and, taking the frquencies into account, the proportion of each class in the leaf provides estimates of the probabilities for each class given that pattern.

Labels: , ,

Tuesday, September 15, 2009

Adjusting for Oversampling

We recently received two similar questions about oversampling . . .

If you don´t mind, I would like to ask you a Question regarding Oversampling as you wrote in your book (Mastering Data Mining...).

I can understand how you calculate predictive lift when using oversampling, though don´t know how to do it for the confusion matrix.

Would you mind telling me how do I compute then the confusion matrix for the actual population (not the oversampled set)?

Thanks in advance for your reply and help.



I have severely unbalanced training data (180K negative cases, 430 positive cases). Yeah...very unbalanced.

I fit a model in a software program that allows instance weights (weka). I give all the positive cases a weight of 1 and all the negative cases a weight of 0.0024. I fit a model (not a decision tree so running the data through a test set is not an option to recalibrate) - like a neural network. I output the probabilities and they are out of whack - good for predicting the class or ranking but not for comparing predicted probability against actual.

What can we do to fit a model like this but then output probabilities that are in line with the distribution? Is this new (wrong) probabilities just the price we have to pay for instance weights to (1) get a model to build (2) get reasonably good classification? Can I have my cake and eat it too (classification and probs that are close to actual)?

Many many thanks!

The problem in these cases is the same. The goal is to predict a class, usually a binary class, where one outcome is rarer than the other. To generate the best model, some method of oversampling is used so the model set has equal numbers of the two outcomes. There are two common ways of doing this. Diego is probably using all the rare outcomes and an equal-sized random sample of the common outcomes. This is most useful when there are a large number of cases, and reducing the number of rows makes the modeling tools run faster. Brian is using a method where weights are used for the same purpose. Rare cases are given a weight of 1 and common cases are given a weight less than 1, so that the sum of the weights of the two groups is equal.

Regardless of the technique (neural network, decision trees, logistic regression, neearest neighbor, and so on), the resulting probabilities are "directionally" correct. A group of rows with a larger probability are more likey to have the modeled outcome than a group with a lower probability. This is useful for some purposes, such as getting the top 10% with the highest scores. It is not useful for other purposes, where the actual probability is needed.

Some tools can back into the desired probabilities, and do correct calculations for lift and for the confusion matrix. I think SAS Enterprise Miner, for instance, uses prior probabilties for this purpose. I say "think" because I do not actually use this feature. When I need to do this calculation, I do it manually, because not all tools support it. And, even if they do, why bother learning how. I can easily do the necessary calculations in Excel.

The key idea here is simply counting. Assume that we start with data that is 10% rare and 90% common, and we oversample so it is 50%-50%. The relationship between the original data and the model set is:
  • rare outcomes: 10% --> 50%
  • common outcomes: 90% --> 50%
To put it differently, each rare outcome in the original data is worth 5 in the model set. Each common outcome is worth 5/9 in the model set. We can call these numbers the oversampling rates for each of the outcomes.

We now apply these mappings to the results. Let's answer Brian's question for a particular situation. Say we have the above data and a result has a modeled probability of 80%. What is the actual probability?

Well, 25% means that there is 0.25 rare outcomes for 0.75 common ones. Let's undo the mapping above:
  • 0.80 / 5 = 0.16
  • 0.20 / (5/9) = 0.36
So, the expected probability on the original data is 0.16/(0.16+0.36) = 30.8%. Notice that the probability has decreased, but it is still larger than the 10% in the original data. Also notice that the lift on the model set is 80%/50% = 1.6. The lift on the original data is 3.08 (30.8% / 10%). The expected probability goes down, and the lift goes up.

This calculation can also be used for the cross-correlation matrix (or confusion matrix). In this case, you just have to divide each cell by the appropriate overampling rate. So, if the confusion matrix said:
  • 10 rows in the model set are rare and classified as rare
  • 5 rows in the model set are rare and classified as common
  • 3 rows in the model set are common and classified as rare
  • 12 rows in the model set are common and classified as common
(I apologize for not including a table, but that is more trouble than it is worth in the blog.)

In the original data, this means:
  • 2=10/5 rows in the original data are rare and classified as rare
  • 1=5/5 rows in the original data are rare and classified as common
  • 5.4 = 3/(5/9) rows inthe original data are common and classified as rare
  • 21.6 = 12/(5/9) rows in the original data are common and classified as common
These calculations are quite simple, and it is easy to set up a spreadsheet to do them.

I should also mention that this method readily works for any number of classes. Having two classes is simply the most common case.

Labels: , ,

Monday, July 27, 2009

Time to Event Models, When the Event Is Not Churn

Dear Data Miners,

I am trying to build a churn model to predict WHEN customers will become paying members.

1. Person comes to our web site.
2. They register for free to use the site.
3. If the want to have more access to the site and use more features they pay us.

What are the issues I should consider when I decide to set a cut date. The first step towards censoring the data.

For a classic churn model , we want to know when someone will stop paying us and leave our phone company. We censor those that we don’t know their final status pass our censor point.

I want to know when they will pay us and censor those I don’t know if they will pay us in the future.

Is the cut date choice arbitrary or is there some sampling rule?

Thank you;


Your example is a time-to-event model that does not represent churn. There are many such examples in business (and this is something discussed in Data Analysis Using SQL and Excel in a bit of depth).

Think of your situation as two different time-to-event problems:

(1) A person visits the web site, what happens next? Does the person return to the web site or register? This is a time-to-event problem and analysis can provide information on customer registrations, particularly the lag between the initial visit and the registration.

(2) A person registers for free, how long until that person buys something? This can provide insight on paying visitors.

Once you have broken the problem into these pieces, imagining the customer signature is easier. For the first problem, the customer signature is a picture of customers when they initially visit (or for each pre-registration visit, for a time-to-next event problem). The "prediction" columns are the date of the registration (or for time-to-event, the date of the next visit and whether it involves a registration).

The second component is a picture of the customer when they first register, and the prediction columns are when (and whether) the customer every pays for anything. In this case, it is very important to treat this as a time-to-event problem, because older registrations have had more opportunity to pay for something and the analysis needs to take this into account.

As for the censor date, it is the most recent date of the data. So, if you have data through the end of yesterday, then that is the censor date. For instance, for the second component of the analysis, customers who registered before yesterday but never paid would have their outcomes censored (these customers have not paid yet but they may pay in the future).

Labels: , ,

Friday, June 26, 2009

When Customers Start and End

In texts on credit scoring, some effort almost always goes into defining what is to be considered as a "bad" credit. The Basel framework provides rather a precise definition of what is to be considered a default.

But I have rarely seen the same in predicting cross-sell, up-sell or churn. I do however, remember attending an SPSS conference where churn of pre-paid cards was discussed. Churn, in that case, was defined as a number of consecutive periods where the number of calls fell below a certain level.

In the past, I've used start and end dates of contracts, as well as a simple increase (or decrease) in the number of products that a customer has over time as indicators of what to target.

I'd be really interested in hearing how you define and extract targets, be it in telecom, banking, cards or any other business where you use prediction. For instance, how would you go looking for customers that have churned? Or for that matter, customers where up-sell has been successful?

This may be too simple a question, but if there are standard methods that you use, I'd be really interested in learning about them.


This is not a simple question at all. Or rather, the simplest questions are often the most illuminating.

The place where I see the biggest issues in defining starts and stops is in survival data mining (obligatory plug for my book Data Analysis Using SQL and Excel, which has two chapters on the subject). For the start date, I try to use (or approximate as closely as possible) the date when two things have occurred: the company has agreed to provide a product or service, and the customer has agreed to pay for it. In the case of post-pay telecoms, this would be the activation date -- and there are similar dates in many other industries, as varied as credit cards, cable subscriptions, and health insurance.

The activation date is often well-defined because the number of active customers gets reported through some system tied to the financial systems. Even so, there are anomalies. I recently completed a project at a large newspaper, and used their service start date as the activation date. Alas, at time, customers with start dates did not necessarily actually receive the paper on the date -- often because the newspaper delivery person could not find the address.

The stop date is even more fraught with complication, because there are a variety of different dates to choose from. For voluntary churn, there is the date the customer requests termination of the service. There is also the date when the service is actually turned off. Which to use? It depends on the application. To count active customers, we want the service cut-off date. To plan for customer retention efforts, we want to know when they call in.

Involuntary churn is also complicated, because there are a series of steps, often called the Dunning Process, which keeps track of customers who do not pay. At what point does a non-paying customer stop? When the service stops? When the bill is written off or settled? At some arbitrary point, such as 60 or 90 days of non-payment? To further confuse the situation, the business may change its rules over time. So, during some periods of time or for some customers, 60 days of non-payment results in service cutoff. For other periods or customers, 90 days might be the rule.

Often, I find multiple time-to-event problems in this scenario. How long does it take a non-paying customer to stop, if ever? How long after customers sign up do they begin?

In your particular case, the contract start date is probably a good place to start. However, the contract end date might or might not be appropriate, since this might not be updated to reflect when a customer actually stops.


Labels: , ,

Monday, June 8, 2009

Confidence in Logistic Regression Coefficients

I work in the marketing team of a telecom company and I recently encountered an annoying problem with an upsell model. Since the monthly sale rate is less than 1% of our customer base, I used oversampling as you mentioned in your book ‘Mastering data mining’ with data over the last 3 sales months so that I had a ratio of about 15% buyers and 85% non-buyers (sample size of about 20K). Using alpha=5%, I got parameter estimates which were from a business perspective entirely explicable. However, when I then re-estimated the model on the total customer base to obtain the ‘true’ parameter estimates which I will use for my monthly scoring two effects were suddenly insignificant at alpha=5%.

I never encountered this and was wondering what to do with these effects: should I kick them out of the model or not ? I decided to keep them in since they did have some business meaning and concluded that they must have become insignificant since it is only a micro-segment in your entire population.
To your opinion, did I interpret this correctly ? . . .
Many thanks in advance for your advice,

Michael responds:

Hi Wendy,

This question has come up on the blog before. The short answer is that with a logistic regression model trained at one concentration of responders, it is a bit tricky to adjust the model to reflect the actual probability of response on the true population. I suggest you look at some papers by Gary King on this topic.

Gordon responds:

Wendy, I am not sure that Prof. King deals directly with your issue, of changing confidence in the coefficients estimates. To be honest, I have never considered this issue. Since you bring it up, though, I am not surprised that it may happen.

My first comment is that the results seem usable, since they are explainable. Sometimes statistical modeling stumbles on relationships in the data that make sense, although they may not be fully statistically significant. Similarly, some relationships may be statistically significant, but have no meaning in the real world. So, use the variables!

Second, if I do a regresson on a set of data, and then duplicate the data (to make it twice as big) and run it again, I'll get the same estimates as on the orignal data. However, the confidence in the coefficients will increase. I suspect that something similar is happening on your data.

If you want to fix that particular problem, then use a tool (such as SAS Enterprise Miner and probably proc logistic) that supports a frequency option on each row. Set the frequency to one for the more common events and to an appropriate value less than one for more common events. I do this as a matter of habit, because it works best for decision trees. You have pointed out that the confidence in the coefficients is also affected by the frequencies, so this is a good habit with regressions as well.

Labels: , , ,

Saturday, April 25, 2009

When There Is Not Enough Data

I have a dataset where the target (continuous variable) variable that has to be estimated. However, in the given dataset, values for target are preset only for 2% while rest of 98% do not have values. The 98% are empty values. I need to score a dataset and give values for the target for all 2500 records. Can I use the 2% and replicate it several times and use that dataset to build a model? The ASE is too high if I use the 2% data alone. Any suggestions how to handle it, please?


The short answer to your question is "Yes, you can replicate the 2% and use it to build a model." BUT DO NOT DO THIS! Just because a tool or technique is possible to implement does not mean that it is a good idea. Replicating observations "confuses" models, often by making the model appear overconfident in its results.

Given the way that ASE (average squared error) is calculated, I don't think that replicating data is going to change the value. We can imagine adding a weight or frequency on each observation instead of replicating them. When the weights are all the same, they cancel out in the ASE formula.

What does change is confidence in the model. So, if you are doing a regression and looking at the regression coefficients, each has a confidence interval. By replicating the data, the resulting model would have smaller confidence intervals. However, these are false, because the replicated data has no more information than the original data.

The problem that you are facing is that the modeling technique you are using is simply not powerful enough to represent the 50 observations that you have. Perhaps a different modeling technique would work better, although you are working with a small amount of data. For instance, perhaps some sort of nearest neighbor approach would work well and be easy to implement.

You do not say why you are using ASE (average squared error) as the preferred measure of model fitness. I can speculate that you are trying to predict a number, perhaps using a regression. One challenge is that the numbers being predicted often fall into a particular range (such as positive numbers for dollar values or ranging between 0 and 1 for a percentage). However, regressions produce numbers that run the gamut of values. In this case, transforming the target variable can sometimes improve results.

In our class on data mining (Data Mining Techniques: Theory and Practice), Michael and I introduce the idea of oversamping rare data using weights in order to get a balanced model set. For instance, if you were predicting whether someone was in the 2% group, you might give each of them a weight of 49 and all the unknowns a weight of 1. The result would be a balanced model set. However, we strongly advise that the maximum weight be 1. So, the weights would be 1/49 for the common cases and 1 for the rare ones. For regressions, this is important because it prevents any coefficients from having too-narrow confidence intervals.

Labels: , , ,

Wednesday, January 14, 2009

Neural Network Training Methods

Scott asks . . .

Dear Ask a Data Miner,

I am using SPSS Clementine 12. The Neural Network node in Clementine allows users to choose from six different training methods for building neural network models:

• Quick. This method uses rules of thumb and characteristics of the data to choose an appropriate shape (topology) for the network.

• Dynamic. This method creates an initial topology but modifies the topology by adding and/or removing hidden units as training progresses.

• Multiple. This method creates several networks of different topologies (the exact number depends on the training data). These networks are then trained in a pseudo-parallel fashion. At the end of training, the model with the lowest RMS error is presented as the final model.

• Prune. This method starts with a large network and removes (prunes) the weakest units in the hidden and input layers as training proceeds. This method is usually slow, but it often yields better results than other methods.

• RBFN. The radial basis function network (RBFN) uses a technique similar to k-means clustering to partition the data based on values of the target field.

• Exhaustive prune. This method is related to the Prune method. It starts with a large network and prunes the weakest units in the hidden and input layers as training proceeds. With Exhaustive Prune, network training parameters are chosen to ensure a very thorough search of the space of possible models to find the best one. This method is usually the slowest, but it often yields the best results. Note that this method can take a long time to train, especially with large datasets.

Which is your preferred training method? How about for a lot of data - (a high number of cases AND a high number of input variables)? How about for a relatively small amount of data?


Our general attitude with respect to fancy algorithms is that they provide incremental value. However, focusing on data usually provides more scope for improving results. This is particularly true of neural networks, because stable neural networks should have few inputs.

Before addressing your question, there are a few things that you should keep in mind when using neural networks:

(1) Standardize all the inputs (that is, subtract the average and divide by the standard deviation). This puts all numeric inputs into a particular range.

(2) Avoid categorical inputs! These should be replaced by appropriate numeric descriptors. Neural network tools, such as Clementine, handle categorical inputs using something called n-1 coding, which converts one variable into many flag variables, which, in turn, multiplies the number of weights in the network that need to be optimized.

(3) Avoid variables that are highly collinear. These cause "multidimensional ridges" in the space of neural network weights, which can confuse the training algorithms.

To return to your question in more detail. Try out lots of the different approaches to determine which is best! There is no rule that says that you have to decide on one approach initially and stick with it. To test the approaches use a separate partition of the data to see which works best.

For instance, the Quick method is probably very useful in getting results back in a reasonable amount of time. Examine the topology, though, to see if it makes sense (no hidden units or too many hidden units). Most of the others are all about adding or removing units, which can be valuable. However, always test the methods on a test set that is not used for training. The topology of the network may depend on the training set, so that provides an opportunity for overfitting.

These methods are focusing more on the topology than on the input parameters. If the prune method really does remove inputs, then that would be powerful functionality. For the methods that are comparing results, ensure that the results are compared on a validation set, separate from the test set used to calculate the weights. It can be easy to overfit neural networks, particularly as the number of weights increases.

A comment about the radial basis function approach. Make sure that Clementine is using normalized radial basis functions. Standard neural networks use an s-shaped function that starts low and goes high (or vice versa), meaning that the area under the curve is unbounded. RBFs start low, go high, and then go low again, meaning that the area under the curve is finite. Normalizing the RBFs ensures that the basis functions do not get too small.

My personal favorite approach to neural networks these days is to use principal components as inputs into the network. To work effectively, this requires some background in principal components to choose the right number as inputs into the network.


Labels: , ,

Saturday, November 1, 2008

Should model scores be rescaled?

Here’s a quick question for your blog;

- background -

I work in a small team of data miners for a telecommunications company. We usually do ‘typical’ customer churn and mobile (cell-phone) related analysis using call detail records (CDR’s)

We often use neural nets to create a decimal range score between zero and one (0.0 – 1.0), where zero equals no churn and maximum 1.0 equals highest likelihood of churn. Another dept then simply sorts an output table in descending order and runs the marketing campaigns using the first 5% (or whatever mailing size they want) of ranked customers.

- problem -

We have differing preferences in the distribution of our prediction score for churn. Churn occurs infrequently, lets say 2% (it is voluntary churn of good fare paying customers) per month. So 98% of customers have a score of 0.0 and 2% have a score of 1.0.

When I build my predictive model I try to mimic this distribution. My view that is most of the churn prediction scores would be skewed toward 0.1 or 0.2, say 95% of all predicted customers, and from 0.3 to 1.0 of the churn score would apply to maybe 5% of the customer base.

Some of my colleagues re-scale the prediction score so that there are an equal number of customers spread throughout.

- question -

What are your views/preferences on this?

I see no reason to rescale the scores. Of course, if the only use of the scores is to mail the top 5% of the list it makes no difference since the transformation preserves the ordering, but for other applications you want the score to be an estimate of the actual probability of cancellation.

In general, scores that represent the probability of an event are more useful than scores which only order a list in descending order by probability of the event. For example, in a campaign response model, you can multiply the probability that a particular prospect will respond by the value of that response to get an expected value of making the offer. If the expected value is greater than the cost, the offer should not be made. Gordon and I discuss this and related issues in our book Mastering Data Mining.

This issue often comes up when stratified sampling is used to create a balanced model set of 50% responders and 50% non-responders. For some modeling techniques--notably, decision trees--a balanced model set will produce more and better rules. However, the proportion of responders at each leaf is no longer an estimate of the actual probability of response. The solution is simple: simply apply the model to a test set that has the correct distribution of responders to get correct estimates of the response probability.


Labels: ,

Tuesday, October 28, 2008

Random Samples in SQL


How would recommend getting a random sample from a table in SQL? Thank you!


This is a good question. Unfortunately, there is not a good answer, because the concept of a random sample does not really exist in relational algebra (which SQL -- to a greater or lesser extent -- is based on). There are, however, ways of to arrive at the solution. This discussion is based partly on the Appendix in Data Analysis Using SQL and Excel.

The basic idea is assume that there is a function that returns a random number, say uniformly between 0 and 1. If such a function exists, the SQL code for a random sample might look like:

....SELECT *
....FROM table t
....WHERE rand() <>

The function rand() does actually exist in many databases, such as IBM UDB, Microsoft SQL, and Mysql.

Does this really work for these databases? That depends on whether rand() is a deterministic or non-deterministic function. A deterministic function is essentially evaluated once, when the query is parsed. If this is the case, then all rows would have the same value, and the query would not return a 10% random sample. It would return either 0 rows or all of them.

Fortunately, for these databases, the designers were smart and rand() is non-deterministic, so the above code works as written.

Oracle has a totally different approach. It supports the SAMPLE clause. Using it, the above query would be written as:

....SELECT *
....FROM table t
....SAMPLE (10)

Another approach in Oracle is to use a pseudo-random number generator and ROWNUM. This approach works in any database that has something similar to ROWNUM.

If you happen to be using SAS proc SQL, then you can do something similar to the first example. The only difference is that the function is RAND('UNIFORM') rather than just RAND().

Labels: , ,

Thursday, October 2, 2008

Decision Trees and Clustering


I started to write my master thesis and i chose a data mining topic.What I have to do is to analyze the bookings of an airline company and to observe for which markets,time periods and clients the bookings can be trusted and for which not.(The bookings can anytime be canceled or modified ).

I decided to use the decision trees as a classification method but I somehow wonder if clustering would have been more appropriate in this situation.

Thanks and best regards,

When choosing between decision trees and clustering, remember that decision trees are themselves a clustering method. The leaves of a decision tree contain clusters of records that are similar to one another and dissimilar from records in other leaves. The difference between the clusters found with a decision tree and the clusters found using other methods such as K-means, agglomerative algorithms, or self-organizing maps is that decision trees are directed while the other techniques I mentioned are undirected. Decision trees are appropriate when there is a target variable for which all records in a cluster should have a similar value. Records in a cluster will also be similar in other ways since they are all described by the same set of rules, but the target variable drives the process. People often use undirected clustering techniques when a directed technique would be more appropriate. In your case, I think you made the correct choice because you can easily come up with a target variable such as the percentage cancelations, alterations and no-shows in a market.

You can make a model set that has one row per market. One column, the target, will be the percentage of reservations that get changed or cancelled. The other columns will contain everything you know about the market--number of flights, number of connections, ratio of business to leasure travelers, number of carriers, ratio of transit passengers to origin or destination passengers, percentage of same day bookings, same week bookings, same month bookings, and whatever else comes to mind. A decision tree will produce some leaves with trustworthy bookings and some with untrustworthy bookings and the paths from the root to these leaves will be descriptions of the clusters.


Tuesday, July 29, 2008

Nested Subqueries in SQL

A recent question:

You used a lot of multi-layer subqueries. Equivalently, I think we can create intermediate tables or views and query them. It's easier for me to build, to follow, and to debugg especially from dataflow diagram. But I do believe the two approaches will result in different time and space required. Could you elaborate on the difference?

[Note: This question is about the the book Data Analysis Using SQL and Excel.]
This is a good question. Ironically, I received the email while I was sitting in an office in Kuala Lumpur, Malaysia writing very complicated nested queries running a very remote database (the database is actually in San Diego).

In this case, the need for complicated nested subqueries was obvious: I (and the group I was working with) only have read access into the database. And for good reason. Although the database contains key analytic information, it is an operational database. In such cases, analysts often have only read access. (By the way, this brings up a question for Oracle: Why can't a user with read-only access explain a query?)

This very immediate experience provides the first answer to the question. In some circumstances, it is not possible or desireable to write to the database.

However, that is only the first reason. There are other reasons.

One of the jobs of databases is planning the most efficient execution plan. For instance, a query might join several tables together. If we do these joins with intermediate tables, say, two at a time, then we impose an ordering on them. However, one of the most important parts of a SQL optimizer is the part that chooses the ordering of joins. Some ways of doing joins are more efficient than other ways.

So, a second reason is that explicitly storing results in intermediate tables might prevent the SQL optimizer from choosing the most efficient query plan.

The third reason also has to do with the database engine. Database engines manage storage. A complex query plan may produce several intermediate results in temporary tables. The writer of the query does not need to name these tables, keep track of the storage, or remember to delete them. The query engine does this automatically.

Doing analysis in a single query (versus in a script file) saves time and effort in storage management.

Storing data in intermediate tables may also impose constraints on the intermediate tables. In particular, the intermediate tables may not be parallel or the table space used for storage may be inefficient in some other way. Using subqueries eliminates any dependency on possible inefficient temporary user storage.

Another reason has to do with columns in intermediate results that are not used. Eliminating columns in query processing can be significant for improving efficiency of queries. When storing results in an intermediate table, all the columns are stored. When using subqueries, the query plan should include only columns actually needed.

Subqueries allow the query plan to eliminate unused columns from intermediate tables.

A final reason for me is really a personal preference. The task of maintaining separate queries and tables, especially using naming conventions is cumbersome. If I store results in a temporary table, I want the name to mean something. If I'm using a subquery, then the name is less important. Or, if I change the name or type of a column, then I find it easier to make the change in one place rather than distributed through a script file. Using a subquery reduces my need to think of new names. Admittedly, this reason is really personal preference.

In general, writing complicated subqueries actually allows the query optimizer to do what it does best -- determine the most efficient execution plan for running the query. Although there are some exceptions, the general rule is quite simple: try to focus on the questions being answered and not how the results are combined. SQL may not be a perfect language for processing, but it does allow us to do very complex data manipulations with a minimum of programming concerns.

Labels: , ,

Thursday, June 5, 2008

Qualifications for Studying Data Mining

A recent question . . .

I am hoping to begin my masters degree in Data Mining. I have come from a Software Development primary degree. I am a bit worried over the math involved in Data Mining.Could you tell me, do I need to have a strong mathematical aptitude to produce a good Thesis on Data Mining?

First, I think a software development background is a good foundation for data mining. Data mining is as much about data (and hence computers and databases) as it is about analysis (and hence statistics, probability, and math).

Michael and I are not academics so we cannot speak to the thesis requirements for a particular data mining program. Both of us majored in mathematics (many years ago) and then worked as software engineers. We do have some knowledge of both fields, and the combination provided a good foundation for our data mining work.

To be successful in data mining, you do need some familiarity with math, particularly applied math -- things like practical applications of probability, algebra, the ability to solve word problems, and the ability to use spreadsheets. Unlike theoretical statistics, the purpose of data mining is not to generate rigorous proofs of various theorems; the purpose is to find useful patterns in data, to validate hypotheses, to set up marketing tests. We need to know when patterns are unexpected, and when patterns are expected.

This is a good place to add a plug for my book Data Analysis Using SQL and Excel, which has two or three chapters devoted to practical statistics in the context of data analysis.

In short, if you are math-phobic, then you might want to reconsider data mining. If your challenges in math are solving complex integrals, then you don't have much to worry about.


Labels: , ,

Thursday, May 1, 2008

Statistical Test for Measuring ROI on Direct Mail Test

If I want to test the effect of return of investment on a mail/ no mail sample, however, I cannot use a parametric test since the distribution of dollar amounts do not follow a normal distribution. What non-parametric test could I use that would give me something similar to a hypothesis test of two samples?

Recently, we received an email with the question above. Since it was addressed to bloggers@data-miners.com, it seems quite reasonable to answer it here.

First, I need to note that Michael and I are not statisticians. We don't even play one on TV (hmm, that's an interesting idea). However, we have gleaned some knowledge of statistics over the years, much from friends and colleagues who are respected statisticians.

Second, the question I am going to answer is the following: Assume that we do a test, with a test group and a control group. What we want to measure is whether the average dollars per customer is significantly different for the test group as compared to the control group. The challenge is that the dollar amounts themselve do not follow a known distribution, or the distribution is known not to be a normal distribution. For instance, we might only have two products, one that costs $10 and one that costs $100.

The reason that I'm restating the problem is because a term such as ROI (return on investment) gets thrown around a lot. In some cases, it could mean the current value of discounted future cash flows. Here, though, I think it simply means the dollar amount that customers spend (or invest, or donate, or whatever depending on the particular business).

The overall approach is that we want to measure the average and standard error for each of the groups. Then, we'll apply a simple "standard error" of the difference to see if the difference is consistently positive or negative. This is a very typical use of a z-score. And, it is a topic that I discuss in more detail in Chapter 3 of my book "Data Analysis Using SQL and Excel". In fact, the example here is slightly modified from the example in the book.

A good place to start is the Central Limit Theorem. This is a fundamental theorem for statistics. Assume that I have a population of things -- such as customers who are going to spend money in response to a marketing campaign. Assume that I take a sample of these customers and measure an average over the sample. Well, as I take more an more samples, the distribution of the averages follows a normal distribution regardless of the original distribution of values. (This is a slight oversimplification of the Central Limit Theorem, but it captures the important ideas.)

In addition, I can measure the relationship between the characteristics of the overall population and the characteristics of the sample:

(1) The average of the sample is as good an approximation as any of the average of the overall population.

(2) The standard error on the average of the sample is the standard deviation of the overall population divided by the square root of the size of the sample. Alternatively, we can phrase this in terms of variance: the variance of the sample average is the variance of the population average divided by the size of the sample.

Well, we are close. We know the average of each sample, because we can measure the average. If we knew the standard deviation of the overall population, then we could get the standard error for each group. Then, we'd know the standard error and we would be done. Well, it turns out that:

(3) The standard deviation of the sample is as good an approximation as any for the standard deviation of the population. This is convenient!

Let's assume that we have the following scenario.

Our test group has 17,839 customers, and the overall average purchase is $85.48. The control group has 53,537 customers, and the average purchase is $70.14. Is this statistically different?

We need some additional information, namely the standard deviation for each group. For the test group, the standard deviation is $197.23. For the control group, it is $196.67.

The standard error for the two groups is then $197.23/sqrt(17,839) and $196.67/sqrt(53,537), which comes to $1.48 and $0.85, respectively.

So, now the question is: is the difference of the means ($85.48 - $70.14 = $15.34) significantly different from zero. We need another formula from statistics to calculate the standard error of the difference. This formula says that the standard error is the square root of the sums of the squares of standard errors. So the value is $1.71 = sqrt(0.85^2 + 1.48^2).

And we have arrived at a place where we can use the z-score. The difference of $15.34 is about 9 standard deviations from 0 (that is, 9*1.71 is about 15.34). It is highly, highly, highly unlikely that the difference includes 0, so we can say that the test group is significantly better than the control group.

In short, we can apply the concepts of normal distributions, even to calculations on dollar amounts. We do need to be careful and pay attention to what we are doing, but the Central Limit Theorem makes this possible. If you are interested in this subject, I do strongly recommend Data Analysis Using SQL and Excel, particularly Chapter 3.


Labels: , , ,

Saturday, April 12, 2008

Using validation data in Enterprise Miner

Dear Sir/Madam,

I am a lecturer at De Montfort University in the UK and teach modules on
Data Mining at final year BSc and MSc level. For both of these we use the
Berry & Linoff Data Mining book. I have a couple of questions regarding SAS that I've been unable to find the answer to and I wondered if you could point in the direction of a source of info where I could find the answers. They are to do with partitioning data in SAS EM and how the different data sets are used. In the Help from SAS EM I see that it says the validation set is used in regression "to choose a final subset of predictors from all the subsets computed during stepwise regression" - so is the validation set not used in regression otherwise (e.g. in forward deletion and backward deletion)?

Also I'm not sure where we see evidence of the test set being used in any of the models I've developed (NNs, Decision Trees, Regression). I presume the lift charts are based on the actual model (resulting from the training and validation data sets) though I noticed if I only had a training and a validation data set (i.e. no test set) the lift chart gave a worse model.

I hope you don't mind me asking these questions - My various books and the help don't seem to explain fully but I know it must be documented somewhere.

best wishes, Jenny Carter

Dr. Jenny Carter
Dept. of Computing
De Montfort University
The Gateway

Hi Jenny,

I'd like to take this opportunity to go beyond your actual question about SAS Enterprise Miner to make a general comment on the use of validation sets for variable selection in regression models and to guard against overfitting in decision tree and neural network models.

Historically, statistics grew up in a world of small datasets. As a result, many statistical tools reuse the same data to fit candidate models as to evaluate and select them. In a data mining context, we assume that there is plenty of data so there is no need to reuse the training data. The problem with using the training data to evaluate a model is that overfitting may go undetected. The best model is not the one that best describes the training data; it is the one that best generalizes to new datasets. That is what the validation is for. The details of how Enterprise Miner accomplishes this vary with the type of model. In no case does the test set get used for either fitting the model or selecting from among candidate models. Its purpose is to allow you to see how your model will do on data that was not involved in the model building or selection process.

Regression Models

When you use any of the model selection methods (Forward, Stepwise, Backward), you also get to select a method for evaluating the candidate models formed from different combinations of explanatory variables. Most of the choices make no use of the validation data. Akaike's Information Criterion and Schwarz's Bayesian Criterion both add a penalty term for the number of effects in the model to a function of the error sum of squares. This penalty term is meant to compensate for the fact that additional model complexity appears to lower the error on the training data even when the model is not actually improving. When you choose Validation Error as the selection criterion, you get the model that minimizes error on the validation set. That is our recommended setting. You must also take care to set Use Selection Default to No in the Model Selection portion of the property sheet of Enterprise Miner will ignore the rest of your settings.

When a training set, validation set, and test set are all present, Enterprise Miner will report statistics such as the root mean squared error for all three sets. The error on the test set, which is not used to fit models nor to select candidate models, is the best predictor of performance on unseen data.

Decision Trees

With decision trees, the validation set is used to select a subtree of the tree grown using the training set. This process is called "pruning." Pruning helps prevent overfitting. Some splits which have a sufficiently high worth (chai-square value) on the training data to enter the initial tree, fail to improve the error rate of the tree when applied to the validation data. This is especially likely to happen when small leaf sizes are allowed. By default, if a validation set is present, Enterprise Miner will use it for subtree selection.

Neural Networks

Training a neural network is an iterative process. Each training iteration adjusts the weights associated with each network connection. As training proceeds, the network becomes better and better at "predicting" the training data. By the time training stops, the model is almost certainly overfit. Each set of weights is a candidate model. The selected model is the one that minimizes error on the validation set. In the chart shown below, after 20 iterations of training the error on the training set is still declining, but the best model was reached after on 3 training iterations.


Saturday, November 24, 2007

Constructing a Model Set for Reccuring Events

In the previous post, I answered a question about how to set up a model set for binary churn. It is fairly common for data miners to find ways to express almost any problem as a binary outcome since binary outcome problems are easily approached with familiar tools such as logistic regression or decision trees. The context for the questions suggests an alternate approach, however. The event of interest was the purchase of refill pages for a calendar/planner. This is an example of a recurring event. Other examples include:
  • Visits to a web page.
  • Purchases of additional minutes for a pre-paid phone plan.
  • Subscription renewals.
  • Repeat purchases.
  • Pregnancies.
  • Incarcerations.
  • Posts to a blog.
All of these are examples of counting processes. A counting process is one where each time an event occurs it increments a total count. The event frequency is governed by an intensity function which is a function of time and other covariates, much like the hazard function in survival analysis for non-recurring events. The intensity function can be estimated empirically, or it may be fit by a parametric or semi-parametric model using, for example, the SAS PHREG procedure. Either way, the data must first be transformed from the way it was probably recorded--dated transactions--to a form suitable for the required calculations.

These are customers making multiple purchases during an observation window. Each time a customer makes a purchase, a transaction record is created. When we add this data to a table in the counting process style, each customer contributes several rows. There is a row for the time from time 0, which may be the time of the initial purchase, to the second purchase, a row for the time to each subsequent purchase, and a row for the time between the final observed purchase and the end of the observation period.

Depending on the style of analysis used, each event may be seen as starting a new time 0 with the number of previous events as a covariate, or each event may be modeled separately with a customer only becoming part of the at-risk pool for event n after experiencing event n-1.
Either way, it is important to include the final censored time period. This period does not correspond to any transaction, but customers are "at risk" for another purchase during that period.

My approach to creating the table is to first create the table without the censored observations, which is reasonably straightforward. Each of these rows contains a flag indicating it is a complete, uncensored observation. Next I create just the censored observations by creating an observation going from the latest observed purchase to the end of the observation period (in this case, 22May2006). The censored rows can then be appended to the uncensored rows. These could, of course, be turned into subqueries in order to avoid creating the temporary tables.

This fully expanded version of the data is what is referred to as the counting process style of input. In a realistic situation where there might be millions of customers, it makes more sense to group by tenure so that there is one row showing how many customers made a purchase with that tenure and how many customers experienced the tenure and so could have made a purchase. This is the data needed to estimate the intensity function.
In Gordon Linoff's book, Data Analysis Using SQL and Excel, he provides sample code for making a related, but different table using the data available on the book's companion page. I reproduce it here for reference.

The code uses the DATEDIFF function to subtract a household's first order date from all its other order dates to put things on the tenure timeline. It then counts the number of second (or third, or fourth, . . .) purchases that happen at each tenure. This query does not track the population at risk so it is not the actual intensity function, but it never the less gives a nice visual image of the way intensity peaks at yearly intervals as many customers make regular annual purchases, just as the purchasers of calendars in the previous posting did.


Thursday, November 1, 2007

Constructing a model set for binary outcome churn

Yesterday I received the following question from a reader who is trying to build a churn model for a business where refill purchases are expected to occur annually. The post raises several questions including how to define churn when it happens passively, how to prepare data for a binary outcome churn model, and whether it might be more appropriate to model refills as a repeating event. Although this question happens to be about annual planning book refills, the situation is similar with prepaid phone cards, transit passes, toner cartridges, etc. I will address the issue of modeling repeating events in a follow-up post, but first I will answer the question that was actually asked.

I need advise. I hope you do not mind me asking questions.

Our Churn variable definition is if customer did not purchased in 13 months then we consider this customer has churned.

In this situation, if I want to build a model to see who is likely to leave, my churn variable will take values …

Churn = 1 (when last purchased date > 13 month)
else Churn = 0

After building a model, my Scoring data (To figure out who is likely to leave) should be…….

1. Customers who purchased within 13 months to see who are likely to leave or

2. Entire database or maybe 4 year buyers (customers whose last purchase date is within 4 years)?? Or

3. Use Modeling file which I have used create churn model as Scoring file?

Please let me know.


With Best Regards,

First some context. I know from her email address (which I have removed to protect her from spam) that Nilima works for a company that sells planners and pocket calendars. The planners have an outer cover that lasts for years. When you order a planner, it comes with a year's worth of pages. As part of the order you specify what month to start with. A year later, you should need a refill. The product is not useful without its refill pages, so if 13 months go by without an order, it is likely that the customer has been lost. (Perhaps he or she now synchronizes a PDA with Outlook, or uses Google Apps, or is now enjoying a schedule-free retirement.)

As an aside, a purely time-since-last-purchase based definition of churn would not work if the product in question were wall calendars that only cover a particular year. In that case, the definition of churn might be "hasn't made a purchase by the end of January" without regard to when the previous purchase was made. There is undoubtedly also a fair amount of seasonality in the purchase of planners--the beginning of the calendar year and the beginning of the academic year seem like likely times to make an initial purchase--but that's OK. The business problem is to identify customers likely to not refill on their anniversary. For this purpose, it is not important that some months have more of these anniversaries than others.

The Data
The questioner is not a client of ours and I have never seen her data. I will assume that she has several years of history and that there is data for every customer who ever made a purchase during that time. I will further assume that all purchases are captured and that she can reliably link a purchase to a purchaser so repeat purchases are recognized as such. The business goal is to score all active, at-risk customers with a churn probability (or, equivalently and more cheerfully, with a refill probability). Presumably, customers with a high enough churn score will be given some extra incentive to refill.

It sounds as though Nilima has already taken the first step which is to summarize the purchase transactions to create a customer signature with one row per customer and columns describing them. Possible fields include

Fields derived from purchase data
  • number of past refills
  • months since last refill
  • months since first purchase
  • original product purchased
  • number of contacts since last refill
Fields captured at registration time
  • Age at time of first purchase
  • Sex
  • Country
  • Postal code or Zip code
Fields derived from the above and in combination with census data
  • Age at scoring time
  • Zip median income
  • Zip population density
  • Zip percent foreign born
Fields that could be purchased from a data vendor
  • Estimated household income
  • Estimated number of children
  • Estimated number of cars
  • Cluster assignment (e.g. "urban achievers", "bible devotion")
Rolling Back the Clock
Building a predictive model requires data from two distinct time periods. All data is from the past. To build a predictive model, you find patterns in data from the distant past that explain results in the more recent past. The result is a model that can be applied today to predict things that will happen in the future.

In the current case, you could take a snapshot of what all active customers looked like 14 months ago as your data from the distant past. In this data set, all of the tenure fields and count fields are reset to what they looked like way back when. Some customers now considered lapsed were still active. Some customers who have now made 4 refills had only made three. Customers who are now 65 were only 63, and so forth. Your data from the recent past would then be a single flag indicating whether the customer made a refill within 13 months of his or her previous refill or initial purchase. Note that because the churn definition is in terms of months since last purchase, the calendar date when a customer becomes lapsed must be calculated separately for each customer.

SAS PROC SQL Code Example
As I said, I have not seen the data that prompted Nilima's question. I do have some similar data that I can share with readers, however. Gordon Linoff and I teach a 2-day class on Applying Survival Analysis for Business Time-to-Event Problems. For that class we use a customer signature with a row for each subscriber, past and present, of a mobile phone company. You can get the data by registering on our web site.

The focus of the class is on calculating hazard probabilities for each tenure and using them to create survival curves that can be used to predict a subscriber's remaining lifetime and create subscriber level forecasts. If we wanted to use that data for a binary outcome churn model, we would have to roll back time as described above. The following SAS code creates a dataset of customers who were active 100 days before the extract or cutoff date. Time is rolled back so that subscribers appear as they did at the observation date. In particular, the subscriber's tenure and age are defined as of the observation date.

The code does a few other interesting things that may be worth noting. In the mobile telephony industry, handset is a known driver of churn. Subscribers know that they can get a new, cooler phone by signing up with a competitor as a new subscriber. Subscribers with uncool phones are most at risk, but which phones are the least cool is constantly changing over time. Therefore, rather than trying to incorporate the handset model into the model, we incorporate the churn rate associated with each model in the 100 days before the observation date by counting the number of people who stopped with each model and dividing by the number of people carrying each model.

Another big factor in churn is whether subscribers are on or off contract. Subscribers on contract must pay a fee to cancel their subscriptions. This code calculates two flags--one indicating whether the subscriber is off-contract as of the observation date and another indicating whether the subscriber is scheduled to go off contract (and so become more likely to churn) before the cutoff date.

The code creates 3 future variables, any of which could be the target for a binary outcome churn model. FutureCHURN is true for anyone who stopped for any reason between the observation date and the cutoff date. FutureVOLUNTARY is true for anyone who stopped voluntarily and FutureINVOLUNTARY is true for anyone who stopped involuntarily.

SQL code