Monday, December 28, 2009

Differential Response or Uplift Modeling

Some time before the holidays, we received the following inquiry from a reader:

Dear Data Miners,



I’ve read interesting arguments for uplift modeling (also called incremental response modeling) [1], but I’m not sure how to implement it. I have responses from a direct mailing with a treatment group and a control group. Now what? Without data mining, I can calculate the uplift between the two groups but not for individual responses. With the data mining techniques I know, I can identify the ‘do not disturbs,’ but there’s more than avoiding mailing that group. How is uplift modeling implemented in general, and how could it be done in R or Weka?



[1] http://www.stochasticsolutions.com/pdf/CrossSell.pdf

I first heard the term "uplift modeling" from Nick Radcliffe, then of Quadstone. I think he may have invented it. In our book, Data Mining Techniques, we use the term "differential response analysis." It turns out that "differential response" has a very specific meaning in the child welfare world, so perhaps we'll switch to "incremental response" or "uplift" in the next edition. But whatever it is called, you can approach this problem in a cell-based fashion without any special tools. Cell-based approaches divide customers into cells or segments in such a way that all members of a cell are similar to one another along some set of dimensions considered to be important for the particular application. You can then measure whatever you wish to optimize (order size, response rate, . . .) by cell and, going forward, treat the cells where treatment has the greatest effect.

Here, the quantity  to measure is the difference in response rate or average order size between treated and untreated groups of otherwise similar customers. Within each cell, we need a randomly selected treatment group and a randomly selected control group; the incremental response or uplift is the difference in average order size (or whatever) between the two. Of course some cells will have higher or lower overall average order size, but that is not the focus of incremental response modeling. The question is not "What is the average order size of women between 40 and 50 who have made more than 2 previous purchases and live in a neighborhood where average household income is two standard deviations above the regional average?" It is "What is the change in order size for this group?"

Ideally, of course, you should design the segmentation and assignment of customers to treatment and control groups before the test, but the reader who submitted the question has already done the direct mailing and tallied the responses. Is it now too late to analyze incremental response?  That depends: If the control group is a true random control group and if it is large enough that it can be partitioned into segments that are still large enough to provide statistically significant differences in order size, it is not too late. You could, for instance, compare the incremental response of male and female responders.

A cell-based approach is only useful if the segment definitions are such that incremental response really does vary across cells. Dividing customers into male and female segments won't help if men and women are equally responsive to the treatment. This is the advantage of the special-purpose uplift modeling software developed by Quadstone (now Portrait Software). This tool builds a decision tree where the splitting criteria is maximizing the difference in incremental response. This automatically leads to segments (the leaves of the tree) characterized by either high or low uplift.  That is a really cool idea, but the lack of such a tool is not a reason to avoid incremental response analysis.

Labels: , , ,

Thursday, September 10, 2009

TDWI Question: Consolidating SAS and SPSS Groups

Yesterday, I had the pleasure of being on a panel for a local TDWI event here in New York focused on advanced analytics (thank you Jon Deutsch). Mark Madsen of Third Nature gave an interesting, if rapid-fire, overview of data mining technologies. Of course, I was excited to see that Mark included Data Analysis Using SQL and Excel as one of the first steps in getting started in data mining -- even before meeting me. Besides myself, the panel included my dear friend Anne Milley from SAS, Ali Pasha from Teradata, and a gentleman from Information Builders whose name I missed.

I found one of the questions from the audience to be quite interesting. The person was from the IT department of a large media corporation. He has two analysis groups, one in Los Angeles that uses SPSS and the other in New York that uses SAS. His goal, of course, is to reduce costs. He prefers to have one vendor. And, undoubtedly, the groups are looking for servers to run their software.

This is a typical IT-type question, particularly in these days of reduced budgets. I am more used to encountering such problems in the charged atmosphere of a client. The more relaxed atmosphere of a TDWI meeting perhaps gives a different perspective.

The groups are doing the same thing from the perspective of an IT director. Diving in a bit futher, the two groups do very different things -- at least from my perspective. Of course, both are using software running on computers to analyze data. The group in Los Angeles is using SPSS to analyze survey data. The group in New York is doing modeling using SAS. I should mention that I don't know anyone in the groups, and only have the cursory information provided at the TDWI conference.

Conflict Alert! Neither group wants to change and both are going to put up a big fight. SPSS has a stronghold in the market for analyzing survey data, with specialized routines and procedures to handle this data. (SAS probably has equivalent functionality, but many people who analyze survey data gravitate to SPSS.) Similarly, the SAS programmers in New York are not going to take kindly to switching to SPSS, even if offers the same functionality.

Each group has the skills and software that they need. Each group has legacy code and methods, that are likely tied to their tools. The company in question is not a 20-person start-up. It is a multinational corporation. Although the IT department might see standarizing a tool as beneficial, in actual fact, the two groups are doing different things and the costs of switching are quite high -- and might involve losing skilled people.

This issue brings up the question of what do we want to standardize on. The value of advanced analytics comes in two forms. The first is the creative process of identifying new and interesting phenomena. The second is the communication process of spreading the information where it is needed.

Although people may not think of nerds as being creative, really, we are. It is important to realize that imposing standards or limiting resources may limit creativity, and hence the quality of the results. This does not mean that cost control is unnecessary. Instead, it means that there are intangible costs that may not show up in a standard cost-benefit analysis.

On the other hand, communicating results through an organization is an area where standards are quite useful. Sometimes the results might be captured as a simple email going to the right person. Other times, the communication must go to a broader audience. Whether byy setting up an internal Wiki, updating model scores in a database, or loading a BI tool, having standards is important in this case. Many people are going to be involved, and these people should not have to learn special tools for one-off analyses -- so, if you have standardized on a BI tool, make the resources available to put in new results. And, from the perspective of the analysts, having standard methods of communicating results simplifies the process of transforming smart analyses into business value.

Labels: ,

Saturday, April 25, 2009

When There Is Not Enough Data

I have a dataset where the target (continuous variable) variable that has to be estimated. However, in the given dataset, values for target are preset only for 2% while rest of 98% do not have values. The 98% are empty values. I need to score a dataset and give values for the target for all 2500 records. Can I use the 2% and replicate it several times and use that dataset to build a model? The ASE is too high if I use the 2% data alone. Any suggestions how to handle it, please?
Thanks,
Sneha

Sneha,

The short answer to your question is "Yes, you can replicate the 2% and use it to build a model." BUT DO NOT DO THIS! Just because a tool or technique is possible to implement does not mean that it is a good idea. Replicating observations "confuses" models, often by making the model appear overconfident in its results.

Given the way that ASE (average squared error) is calculated, I don't think that replicating data is going to change the value. We can imagine adding a weight or frequency on each observation instead of replicating them. When the weights are all the same, they cancel out in the ASE formula.

What does change is confidence in the model. So, if you are doing a regression and looking at the regression coefficients, each has a confidence interval. By replicating the data, the resulting model would have smaller confidence intervals. However, these are false, because the replicated data has no more information than the original data.

The problem that you are facing is that the modeling technique you are using is simply not powerful enough to represent the 50 observations that you have. Perhaps a different modeling technique would work better, although you are working with a small amount of data. For instance, perhaps some sort of nearest neighbor approach would work well and be easy to implement.

You do not say why you are using ASE (average squared error) as the preferred measure of model fitness. I can speculate that you are trying to predict a number, perhaps using a regression. One challenge is that the numbers being predicted often fall into a particular range (such as positive numbers for dollar values or ranging between 0 and 1 for a percentage). However, regressions produce numbers that run the gamut of values. In this case, transforming the target variable can sometimes improve results.

In our class on data mining (Data Mining Techniques: Theory and Practice), Michael and I introduce the idea of oversamping rare data using weights in order to get a balanced model set. For instance, if you were predicting whether someone was in the 2% group, you might give each of them a weight of 49 and all the unknowns a weight of 1. The result would be a balanced model set. However, we strongly advise that the maximum weight be 1. So, the weights would be 1/49 for the common cases and 1 for the rare ones. For regressions, this is important because it prevents any coefficients from having too-narrow confidence intervals.





Labels: , , ,

Thursday, June 5, 2008

Qualifications for Studying Data Mining

A recent question . . .

I am hoping to begin my masters degree in Data Mining. I have come from a Software Development primary degree. I am a bit worried over the math involved in Data Mining.Could you tell me, do I need to have a strong mathematical aptitude to produce a good Thesis on Data Mining?

First, I think a software development background is a good foundation for data mining. Data mining is as much about data (and hence computers and databases) as it is about analysis (and hence statistics, probability, and math).

Michael and I are not academics so we cannot speak to the thesis requirements for a particular data mining program. Both of us majored in mathematics (many years ago) and then worked as software engineers. We do have some knowledge of both fields, and the combination provided a good foundation for our data mining work.

To be successful in data mining, you do need some familiarity with math, particularly applied math -- things like practical applications of probability, algebra, the ability to solve word problems, and the ability to use spreadsheets. Unlike theoretical statistics, the purpose of data mining is not to generate rigorous proofs of various theorems; the purpose is to find useful patterns in data, to validate hypotheses, to set up marketing tests. We need to know when patterns are unexpected, and when patterns are expected.

This is a good place to add a plug for my book Data Analysis Using SQL and Excel, which has two or three chapters devoted to practical statistics in the context of data analysis.

In short, if you are math-phobic, then you might want to reconsider data mining. If your challenges in math are solving complex integrals, then you don't have much to worry about.

--gordon

Labels: , ,

Thursday, May 1, 2008

Statistical Test for Measuring ROI on Direct Mail Test

If I want to test the effect of return of investment on a mail/ no mail sample, however, I cannot use a parametric test since the distribution of dollar amounts do not follow a normal distribution. What non-parametric test could I use that would give me something similar to a hypothesis test of two samples?

Recently, we received an email with the question above. Since it was addressed to bloggers@data-miners.com, it seems quite reasonable to answer it here.

First, I need to note that Michael and I are not statisticians. We don't even play one on TV (hmm, that's an interesting idea). However, we have gleaned some knowledge of statistics over the years, much from friends and colleagues who are respected statisticians.

Second, the question I am going to answer is the following: Assume that we do a test, with a test group and a control group. What we want to measure is whether the average dollars per customer is significantly different for the test group as compared to the control group. The challenge is that the dollar amounts themselve do not follow a known distribution, or the distribution is known not to be a normal distribution. For instance, we might only have two products, one that costs $10 and one that costs $100.

The reason that I'm restating the problem is because a term such as ROI (return on investment) gets thrown around a lot. In some cases, it could mean the current value of discounted future cash flows. Here, though, I think it simply means the dollar amount that customers spend (or invest, or donate, or whatever depending on the particular business).

The overall approach is that we want to measure the average and standard error for each of the groups. Then, we'll apply a simple "standard error" of the difference to see if the difference is consistently positive or negative. This is a very typical use of a z-score. And, it is a topic that I discuss in more detail in Chapter 3 of my book "Data Analysis Using SQL and Excel". In fact, the example here is slightly modified from the example in the book.

A good place to start is the Central Limit Theorem. This is a fundamental theorem for statistics. Assume that I have a population of things -- such as customers who are going to spend money in response to a marketing campaign. Assume that I take a sample of these customers and measure an average over the sample. Well, as I take more an more samples, the distribution of the averages follows a normal distribution regardless of the original distribution of values. (This is a slight oversimplification of the Central Limit Theorem, but it captures the important ideas.)

In addition, I can measure the relationship between the characteristics of the overall population and the characteristics of the sample:

(1) The average of the sample is as good an approximation as any of the average of the overall population.

(2) The standard error on the average of the sample is the standard deviation of the overall population divided by the square root of the size of the sample. Alternatively, we can phrase this in terms of variance: the variance of the sample average is the variance of the population average divided by the size of the sample.

Well, we are close. We know the average of each sample, because we can measure the average. If we knew the standard deviation of the overall population, then we could get the standard error for each group. Then, we'd know the standard error and we would be done. Well, it turns out that:

(3) The standard deviation of the sample is as good an approximation as any for the standard deviation of the population. This is convenient!

Let's assume that we have the following scenario.

Our test group has 17,839 customers, and the overall average purchase is $85.48. The control group has 53,537 customers, and the average purchase is $70.14. Is this statistically different?

We need some additional information, namely the standard deviation for each group. For the test group, the standard deviation is $197.23. For the control group, it is $196.67.

The standard error for the two groups is then $197.23/sqrt(17,839) and $196.67/sqrt(53,537), which comes to $1.48 and $0.85, respectively.

So, now the question is: is the difference of the means ($85.48 - $70.14 = $15.34) significantly different from zero. We need another formula from statistics to calculate the standard error of the difference. This formula says that the standard error is the square root of the sums of the squares of standard errors. So the value is $1.71 = sqrt(0.85^2 + 1.48^2).

And we have arrived at a place where we can use the z-score. The difference of $15.34 is about 9 standard deviations from 0 (that is, 9*1.71 is about 15.34). It is highly, highly, highly unlikely that the difference includes 0, so we can say that the test group is significantly better than the control group.

In short, we can apply the concepts of normal distributions, even to calculations on dollar amounts. We do need to be careful and pay attention to what we are doing, but the Central Limit Theorem makes this possible. If you are interested in this subject, I do strongly recommend Data Analysis Using SQL and Excel, particularly Chapter 3.

--gordon

Labels: , , ,