Thursday, June 5, 2008

Qualifications for Studying Data Mining

A recent question . . .

I am hoping to begin my masters degree in Data Mining. I have come from a Software Development primary degree. I am a bit worried over the math involved in Data Mining.Could you tell me, do I need to have a strong mathematical aptitude to produce a good Thesis on Data Mining?

First, I think a software development background is a good foundation for data mining. Data mining is as much about data (and hence computers and databases) as it is about analysis (and hence statistics, probability, and math).

Michael and I are not academics so we cannot speak to the thesis requirements for a particular data mining program. Both of us majored in mathematics (many years ago) and then worked as software engineers. We do have some knowledge of both fields, and the combination provided a good foundation for our data mining work.

To be successful in data mining, you do need some familiarity with math, particularly applied math -- things like practical applications of probability, algebra, the ability to solve word problems, and the ability to use spreadsheets. Unlike theoretical statistics, the purpose of data mining is not to generate rigorous proofs of various theorems; the purpose is to find useful patterns in data, to validate hypotheses, to set up marketing tests. We need to know when patterns are unexpected, and when patterns are expected.

This is a good place to add a plug for my book Data Analysis Using SQL and Excel, which has two or three chapters devoted to practical statistics in the context of data analysis.

In short, if you are math-phobic, then you might want to reconsider data mining. If your challenges in math are solving complex integrals, then you don't have much to worry about.

--gordon

Labels: , ,

Thursday, May 1, 2008

Statistical Test for Measuring ROI on Direct Mail Test

If I want to test the effect of return of investment on a mail/ no mail sample, however, I cannot use a parametric test since the distribution of dollar amounts do not follow a normal distribution. What non-parametric test could I use that would give me something similar to a hypothesis test of two samples?

Recently, we received an email with the question above. Since it was addressed to bloggers@data-miners.com, it seems quite reasonable to answer it here.

First, I need to note that Michael and I are not statisticians. We don't even play one on TV (hmm, that's an interesting idea). However, we have gleaned some knowledge of statistics over the years, much from friends and colleagues who are respected statisticians.

Second, the question I am going to answer is the following: Assume that we do a test, with a test group and a control group. What we want to measure is whether the average dollars per customer is significantly different for the test group as compared to the control group. The challenge is that the dollar amounts themselve do not follow a known distribution, or the distribution is known not to be a normal distribution. For instance, we might only have two products, one that costs $10 and one that costs $100.

The reason that I'm restating the problem is because a term such as ROI (return on investment) gets thrown around a lot. In some cases, it could mean the current value of discounted future cash flows. Here, though, I think it simply means the dollar amount that customers spend (or invest, or donate, or whatever depending on the particular business).

The overall approach is that we want to measure the average and standard error for each of the groups. Then, we'll apply a simple "standard error" of the difference to see if the difference is consistently positive or negative. This is a very typical use of a z-score. And, it is a topic that I discuss in more detail in Chapter 3 of my book "Data Analysis Using SQL and Excel". In fact, the example here is slightly modified from the example in the book.

A good place to start is the Central Limit Theorem. This is a fundamental theorem for statistics. Assume that I have a population of things -- such as customers who are going to spend money in response to a marketing campaign. Assume that I take a sample of these customers and measure an average over the sample. Well, as I take more an more samples, the distribution of the averages follows a normal distribution regardless of the original distribution of values. (This is a slight oversimplification of the Central Limit Theorem, but it captures the important ideas.)

In addition, I can measure the relationship between the characteristics of the overall population and the characteristics of the sample:

(1) The average of the sample is as good an approximation as any of the average of the overall population.

(2) The standard error on the average of the sample is the standard deviation of the overall population divided by the square root of the size of the sample. Alternatively, we can phrase this in terms of variance: the variance of the sample average is the variance of the population average divided by the size of the sample.

Well, we are close. We know the average of each sample, because we can measure the average. If we knew the standard deviation of the overall population, then we could get the standard error for each group. Then, we'd know the standard error and we would be done. Well, it turns out that:

(3) The standard deviation of the sample is as good an approximation as any for the standard deviation of the population. This is convenient!

Let's assume that we have the following scenario.

Our test group has 17,839 customers, and the overall average purchase is $85.48. The control group has 53,537 customers, and the average purchase is $70.14. Is this statistically different?

We need some additional information, namely the standard deviation for each group. For the test group, the standard deviation is $197.23. For the control group, it is $196.67.

The standard error for the two groups is then $197.23/sqrt(17,839) and $196.67/sqrt(53,537), which comes to $1.48 and $0.85, respectively.

So, now the question is: is the difference of the means ($85.48 - $70.14 = $15.34) significantly different from zero. We need another formula from statistics to calculate the standard error of the difference. This formula says that the standard error is the square root of the sums of the squares of standard errors. So the value is $1.71 = sqrt(0.85^2 + 1.48^2).

And we have arrived at a place where we can use the z-score. The difference of $15.34 is about 9 standard deviations from 0 (that is, 9*1.71 is about 15.34). It is highly, highly, highly unlikely that the difference includes 0, so we can say that the test group is significantly better than the control group.

In short, we can apply the concepts of normal distributions, even to calculations on dollar amounts. We do need to be careful and pay attention to what we are doing, but the Central Limit Theorem makes this possible. If you are interested in this subject, I do strongly recommend Data Analysis Using SQL and Excel, particularly Chapter 3.

--gordon

Labels: , , ,

Saturday, April 12, 2008

Using validation data in Enterprise Miner

Dear Sir/Madam,

I am a lecturer at De Montfort University in the UK and teach modules on
Data Mining at final year BSc and MSc level. For both of these we use the
Berry & Linoff Data Mining book. I have a couple of questions regarding SAS that I've been unable to find the answer to and I wondered if you could point in the direction of a source of info where I could find the answers. They are to do with partitioning data in SAS EM and how the different data sets are used. In the Help from SAS EM I see that it says the validation set is used in regression "to choose a final subset of predictors from all the subsets computed during stepwise regression" - so is the validation set not used in regression otherwise (e.g. in forward deletion and backward deletion)?

Also I'm not sure where we see evidence of the test set being used in any of the models I've developed (NNs, Decision Trees, Regression). I presume the lift charts are based on the actual model (resulting from the training and validation data sets) though I noticed if I only had a training and a validation data set (i.e. no test set) the lift chart gave a worse model.

I hope you don't mind me asking these questions - My various books and the help don't seem to explain fully but I know it must be documented somewhere.

best wishes, Jenny Carter

Dr. Jenny Carter
Dept. of Computing
De Montfort University
The Gateway
Leicester

Hi Jenny,

I'd like to take this opportunity to go beyond your actual question about SAS Enterprise Miner to make a general comment on the use of validation sets for variable selection in regression models and to guard against overfitting in decision tree and neural network models.

Historically, statistics grew up in a world of small datasets. As a result, many statistical tools reuse the same data to fit candidate models as to evaluate and select them. In a data mining context, we assume that there is plenty of data so there is no need to reuse the training data. The problem with using the training data to evaluate a model is that overfitting may go undetected. The best model is not the one that best describes the training data; it is the one that best generalizes to new datasets. That is what the validation is for. The details of how Enterprise Miner accomplishes this vary with the type of model. In no case does the test set get used for either fitting the model or selecting from among candidate models. Its purpose is to allow you to see how your model will do on data that was not involved in the model building or selection process.

Regression Models

When you use any of the model selection methods (Forward, Stepwise, Backward), you also get to select a method for evaluating the candidate models formed from different combinations of explanatory variables. Most of the choices make no use of the validation data. Akaike's Information Criterion and Schwarz's Bayesian Criterion both add a penalty term for the number of effects in the model to a function of the error sum of squares. This penalty term is meant to compensate for the fact that additional model complexity appears to lower the error on the training data even when the model is not actually improving. When you choose Validation Error as the selection criterion, you get the model that minimizes error on the validation set. That is our recommended setting. You must also take care to set Use Selection Default to No in the Model Selection portion of the property sheet of Enterprise Miner will ignore the rest of your settings.



When a training set, validation set, and test set are all present, Enterprise Miner will report statistics such as the root mean squared error for all three sets. The error on the test set, which is not used to fit models nor to select candidate models, is the best predictor of performance on unseen data.

Decision Trees

With decision trees, the validation set is used to select a subtree of the tree grown using the training set. This process is called "pruning." Pruning helps prevent overfitting. Some splits which have a sufficiently high worth (chai-square value) on the training data to enter the initial tree, fail to improve the error rate of the tree when applied to the validation data. This is especially likely to happen when small leaf sizes are allowed. By default, if a validation set is present, Enterprise Miner will use it for subtree selection.

Neural Networks

Training a neural network is an iterative process. Each training iteration adjusts the weights associated with each network connection. As training proceeds, the network becomes better and better at "predicting" the training data. By the time training stops, the model is almost certainly overfit. Each set of weights is a candidate model. The selected model is the one that minimizes error on the validation set. In the chart shown below, after 20 iterations of training the error on the training set is still declining, but the best model was reached after on 3 training iterations.


Labels:

Saturday, November 24, 2007

Constructing a Model Set for Reccuring Events

In the previous post, I answered a question about how to set up a model set for binary churn. It is fairly common for data miners to find ways to express almost any problem as a binary outcome since binary outcome problems are easily approached with familiar tools such as logistic regression or decision trees. The context for the questions suggests an alternate approach, however. The event of interest was the purchase of refill pages for a calendar/planner. This is an example of a recurring event. Other examples include:
  • Visits to a web page.
  • Purchases of additional minutes for a pre-paid phone plan.
  • Subscription renewals.
  • Repeat purchases.
  • Pregnancies.
  • Incarcerations.
  • Posts to a blog.
All of these are examples of counting processes. A counting process is one where each time an event occurs it increments a total count. The event frequency is governed by an intensity function which is a function of time and other covariates, much like the hazard function in survival analysis for non-recurring events. The intensity function can be estimated empirically, or it may be fit by a parametric or semi-parametric model using, for example, the SAS PHREG procedure. Either way, the data must first be transformed from the way it was probably recorded--dated transactions--to a form suitable for the required calculations.


These are customers making multiple purchases during an observation window. Each time a customer makes a purchase, a transaction record is created. When we add this data to a table in the counting process style, each customer contributes several rows. There is a row for the time from time 0, which may be the time of the initial purchase, to the second purchase, a row for the time to each subsequent purchase, and a row for the time between the final observed purchase and the end of the observation period.


Depending on the style of analysis used, each event may be seen as starting a new time 0 with the number of previous events as a covariate, or each event may be modeled separately with a customer only becoming part of the at-risk pool for event n after experiencing event n-1.
Either way, it is important to include the final censored time period. This period does not correspond to any transaction, but customers are "at risk" for another purchase during that period.

My approach to creating the table is to first create the table without the censored observations, which is reasonably straightforward. Each of these rows contains a flag indicating it is a complete, uncensored observation. Next I create just the censored observations by creating an observation going from the latest observed purchase to the end of the observation period (in this case, 22May2006). The censored rows can then be appended to the uncensored rows. These could, of course, be turned into subqueries in order to avoid creating the temporary tables.


This fully expanded version of the data is what is referred to as the counting process style of input. In a realistic situation where there might be millions of customers, it makes more sense to group by tenure so that there is one row showing how many customers made a purchase with that tenure and how many customers experienced the tenure and so could have made a purchase. This is the data needed to estimate the intensity function.
In Gordon Linoff's book, Data Analysis Using SQL and Excel, he provides sample code for making a related, but different table using the data available on the book's companion page. I reproduce it here for reference.


The code uses the DATEDIFF function to subtract a household's first order date from all its other order dates to put things on the tenure timeline. It then counts the number of second (or third, or fourth, . . .) purchases that happen at each tenure. This query does not track the population at risk so it is not the actual intensity function, but it never the less gives a nice visual image of the way intensity peaks at yearly intervals as many customers make regular annual purchases, just as the purchasers of calendars in the previous posting did.

Labels:

Thursday, November 1, 2007

Constructing a model set for binary outcome churn

Yesterday I received the following question from a reader who is trying to build a churn model for a business where refill purchases are expected to occur annually. The post raises several questions including how to define churn when it happens passively, how to prepare data for a binary outcome churn model, and whether it might be more appropriate to model refills as a repeating event. Although this question happens to be about annual planning book refills, the situation is similar with prepaid phone cards, transit passes, toner cartridges, etc. I will address the issue of modeling repeating events in a follow-up post, but first I will answer the question that was actually asked.
Michael,

I need advise. I hope you do not mind me asking questions.

Our Churn variable definition is if customer did not purchased in 13 months then we consider this customer has churned.

In this situation, if I want to build a model to see who is likely to leave, my churn variable will take values …

Churn = 1 (when last purchased date > 13 month)
else Churn = 0

After building a model, my Scoring data (To figure out who is likely to leave) should be…….

1. Customers who purchased within 13 months to see who are likely to leave or

2. Entire database or maybe 4 year buyers (customers whose last purchase date is within 4 years)?? Or

3. Use Modeling file which I have used create churn model as Scoring file?

Please let me know.

Thanks.

With Best Regards,

Nilima
First some context. I know from her email address (which I have removed to protect her from spam) that Nilima works for a company that sells planners and pocket calendars. The planners have an outer cover that lasts for years. When you order a planner, it comes with a year's worth of pages. As part of the order you specify what month to start with. A year later, you should need a refill. The product is not useful without its refill pages, so if 13 months go by without an order, it is likely that the customer has been lost. (Perhaps he or she now synchronizes a PDA with Outlook, or uses Google Apps, or is now enjoying a schedule-free retirement.)

As an aside, a purely time-since-last-purchase based definition of churn would not work if the product in question were wall calendars that only cover a particular year. In that case, the definition of churn might be "hasn't made a purchase by the end of January" without regard to when the previous purchase was made. There is undoubtedly also a fair amount of seasonality in the purchase of planners--the beginning of the calendar year and the beginning of the academic year seem like likely times to make an initial purchase--but that's OK. The business problem is to identify customers likely to not refill on their anniversary. For this purpose, it is not important that some months have more of these anniversaries than others.

The Data
The questioner is not a client of ours and I have never seen her data. I will assume that she has several years of history and that there is data for every customer who ever made a purchase during that time. I will further assume that all purchases are captured and that she can reliably link a purchase to a purchaser so repeat purchases are recognized as such. The business goal is to score all active, at-risk customers with a churn probability (or, equivalently and more cheerfully, with a refill probability). Presumably, customers with a high enough churn score will be given some extra incentive to refill.

It sounds as though Nilima has already taken the first step which is to summarize the purchase transactions to create a customer signature with one row per customer and columns describing them. Possible fields include

Fields derived from purchase data
  • number of past refills
  • months since last refill
  • months since first purchase
  • original product purchased
  • number of contacts since last refill
Fields captured at registration time
  • Age at time of first purchase
  • Sex
  • Country
  • Postal code or Zip code
Fields derived from the above and in combination with census data
  • Age at scoring time
  • Zip median income
  • Zip population density
  • Zip percent foreign born
Fields that could be purchased from a data vendor
  • Estimated household income
  • Estimated number of children
  • Estimated number of cars
  • Cluster assignment (e.g. "urban achievers", "bible devotion")
Rolling Back the Clock
Building a predictive model requires data from two distinct time periods. All data is from the past. To build a predictive model, you find patterns in data from the distant past that explain results in the more recent past. The result is a model that can be applied today to predict things that will happen in the future.

In the current case, you could take a snapshot of what all active customers looked like 14 months ago as your data from the distant past. In this data set, all of the tenure fields and count fields are reset to what they looked like way back when. Some customers now considered lapsed were still active. Some customers who have now made 4 refills had only made three. Customers who are now 65 were only 63, and so forth. Your data from the recent past would then be a single flag indicating whether the customer made a refill within 13 months of his or her previous refill or initial purchase. Note that because the churn definition is in terms of months since last purchase, the calendar date when a customer becomes lapsed must be calculated separately for each customer.

SAS PROC SQL Code Example
As I said, I have not seen the data that prompted Nilima's question. I do have some similar data that I can share with readers, however. Gordon Linoff and I teach a 2-day class on Applying Survival Analysis for Business Time-to-Event Problems. For that class we use a customer signature with a row for each subscriber, past and present, of a mobile phone company. You can get the data by registering on our web site.

The focus of the class is on calculating hazard probabilities for each tenure and using them to create survival curves that can be used to predict a subscriber's remaining lifetime and create subscriber level forecasts. If we wanted to use that data for a binary outcome churn model, we would have to roll back time as described above. The following SAS code creates a dataset of customers who were active 100 days before the extract or cutoff date. Time is rolled back so that subscribers appear as they did at the observation date. In particular, the subscriber's tenure and age are defined as of the observation date.

The code does a few other interesting things that may be worth noting. In the mobile telephony industry, handset is a known driver of churn. Subscribers know that they can get a new, cooler phone by signing up with a competitor as a new subscriber. Subscribers with uncool phones are most at risk, but which phones are the least cool is constantly changing over time. Therefore, rather than trying to incorporate the handset model into the model, we incorporate the churn rate associated with each model in the 100 days before the observation date by counting the number of people who stopped with each model and dividing by the number of people carrying each model.

Another big factor in churn is whether subscribers are on or off contract. Subscribers on contract must pay a fee to cancel their subscriptions. This code calculates two flags--one indicating whether the subscriber is off-contract as of the observation date and another indicating whether the subscriber is scheduled to go off contract (and so become more likely to churn) before the cutoff date.

The code creates 3 future variables, any of which could be the target for a binary outcome churn model. FutureCHURN is true for anyone who stopped for any reason between the observation date and the cutoff date. FutureVOLUNTARY is true for anyone who stopped voluntarily and FutureINVOLUNTARY is true for anyone who stopped involuntarily.

SQL code

Labels: