Monday, July 27, 2009

Time to Event Models, When the Event Is Not Churn

Dear Data Miners,

I am trying to build a churn model to predict WHEN customers will become paying members.
Process:

1. Person comes to our web site.
2. They register for free to use the site.
3. If the want to have more access to the site and use more features they pay us.

What are the issues I should consider when I decide to set a cut date. The first step towards censoring the data.

For a classic churn model , we want to know when someone will stop paying us and leave our phone company. We censor those that we don’t know their final status pass our censor point.

I want to know when they will pay us and censor those I don’t know if they will pay us in the future.

Is the cut date choice arbitrary or is there some sampling rule?

Thank you;
Daryl

Daryl,

Your example is a time-to-event model that does not represent churn. There are many such examples in business (and this is something discussed in Data Analysis Using SQL and Excel in a bit of depth).

Think of your situation as two different time-to-event problems:

(1) A person visits the web site, what happens next? Does the person return to the web site or register? This is a time-to-event problem and analysis can provide information on customer registrations, particularly the lag between the initial visit and the registration.

(2) A person registers for free, how long until that person buys something? This can provide insight on paying visitors.

Once you have broken the problem into these pieces, imagining the customer signature is easier. For the first problem, the customer signature is a picture of customers when they initially visit (or for each pre-registration visit, for a time-to-next event problem). The "prediction" columns are the date of the registration (or for time-to-event, the date of the next visit and whether it involves a registration).

The second component is a picture of the customer when they first register, and the prediction columns are when (and whether) the customer every pays for anything. In this case, it is very important to treat this as a time-to-event problem, because older registrations have had more opportunity to pay for something and the analysis needs to take this into account.

As for the censor date, it is the most recent date of the data. So, if you have data through the end of yesterday, then that is the censor date. For instance, for the second component of the analysis, customers who registered before yesterday but never paid would have their outcomes censored (these customers have not paid yet but they may pay in the future).

Friday, June 26, 2009

When Customers Start and End

In texts on credit scoring, some effort almost always goes into defining what is to be considered as a "bad" credit. The Basel framework provides rather a precise definition of what is to be considered a default.

But I have rarely seen the same in predicting cross-sell, up-sell or churn. I do however, remember attending an SPSS conference where churn of pre-paid cards was discussed. Churn, in that case, was defined as a number of consecutive periods where the number of calls fell below a certain level.

In the past, I've used start and end dates of contracts, as well as a simple increase (or decrease) in the number of products that a customer has over time as indicators of what to target.

I'd be really interested in hearing how you define and extract targets, be it in telecom, banking, cards or any other business where you use prediction. For instance, how would you go looking for customers that have churned? Or for that matter, customers where up-sell has been successful?

This may be too simple a question, but if there are standard methods that you use, I'd be really interested in learning about them.
--Ola


Ola,

This is not a simple question at all. Or rather, the simplest questions are often the most illuminating.

The place where I see the biggest issues in defining starts and stops is in survival data mining (obligatory plug for my book Data Analysis Using SQL and Excel, which has two chapters on the subject). For the start date, I try to use (or approximate as closely as possible) the date when two things have occurred: the company has agreed to provide a product or service, and the customer has agreed to pay for it. In the case of post-pay telecoms, this would be the activation date -- and there are similar dates in many other industries, as varied as credit cards, cable subscriptions, and health insurance.

The activation date is often well-defined because the number of active customers gets reported through some system tied to the financial systems. Even so, there are anomalies. I recently completed a project at a large newspaper, and used their service start date as the activation date. Alas, at time, customers with start dates did not necessarily actually receive the paper on the date -- often because the newspaper delivery person could not find the address.

The stop date is even more fraught with complication, because there are a variety of different dates to choose from. For voluntary churn, there is the date the customer requests termination of the service. There is also the date when the service is actually turned off. Which to use? It depends on the application. To count active customers, we want the service cut-off date. To plan for customer retention efforts, we want to know when they call in.

Involuntary churn is also complicated, because there are a series of steps, often called the Dunning Process, which keeps track of customers who do not pay. At what point does a non-paying customer stop? When the service stops? When the bill is written off or settled? At some arbitrary point, such as 60 or 90 days of non-payment? To further confuse the situation, the business may change its rules over time. So, during some periods of time or for some customers, 60 days of non-payment results in service cutoff. For other periods or customers, 90 days might be the rule.

Often, I find multiple time-to-event problems in this scenario. How long does it take a non-paying customer to stop, if ever? How long after customers sign up do they begin?

In your particular case, the contract start date is probably a good place to start. However, the contract end date might or might not be appropriate, since this might not be updated to reflect when a customer actually stops.

--gordon

Monday, June 8, 2009

Confidence in Logistic Regression Coefficients

I work in the marketing team of a telecom company and I recently encountered an annoying problem with an upsell model. Since the monthly sale rate is less than 1% of our customer base, I used oversampling as you mentioned in your book ‘Mastering data mining’ with data over the last 3 sales months so that I had a ratio of about 15% buyers and 85% non-buyers (sample size of about 20K). Using alpha=5%, I got parameter estimates which were from a business perspective entirely explicable. However, when I then re-estimated the model on the total customer base to obtain the ‘true’ parameter estimates which I will use for my monthly scoring two effects were suddenly insignificant at alpha=5%.

I never encountered this and was wondering what to do with these effects: should I kick them out of the model or not ? I decided to keep them in since they did have some business meaning and concluded that they must have become insignificant since it is only a micro-segment in your entire population.
To your opinion, did I interpret this correctly ? . . .
Many thanks in advance for your advice,
Wendy


Michael responds:

Hi Wendy,

This question has come up on the blog before. The short answer is that with a logistic regression model trained at one concentration of responders, it is a bit tricky to adjust the model to reflect the actual probability of response on the true population. I suggest you look at some papers by Gary King on this topic.


Gordon responds:

Wendy, I am not sure that Prof. King deals directly with your issue, of changing confidence in the coefficients estimates. To be honest, I have never considered this issue. Since you bring it up, though, I am not surprised that it may happen.

My first comment is that the results seem usable, since they are explainable. Sometimes statistical modeling stumbles on relationships in the data that make sense, although they may not be fully statistically significant. Similarly, some relationships may be statistically significant, but have no meaning in the real world. So, use the variables!

Second, if I do a regresson on a set of data, and then duplicate the data (to make it twice as big) and run it again, I'll get the same estimates as on the orignal data. However, the confidence in the coefficients will increase. I suspect that something similar is happening on your data.

If you want to fix that particular problem, then use a tool (such as SAS Enterprise Miner and probably proc logistic) that supports a frequency option on each row. Set the frequency to one for the more common events and to an appropriate value less than one for more common events. I do this as a matter of habit, because it works best for decision trees. You have pointed out that the confidence in the coefficients is also affected by the frequencies, so this is a good habit with regressions as well.


Sunday, May 10, 2009

Not Enough Data

An article in yesterday's New York Times reminded me of examples of "bad" examples of data mining. By bad examples, I mean that spurious correlations are given credence -- enough credence to make it into a well-reputed national newspaper.

The article, entitled "Eat Quickly, for the Economy's State" is about a leisure time report from the OECD that shows a correlation between the following two variables:
  • Change in real GNP in 2008; and,
  • Amount of time people spend eating and drinking in a given day.
The study is based on surveys from 17 countries (for more information on the survey, you can check this out).

The highlight is a few charts that shows that countries such as Mexico, Canada, and the United States have the lowest time spent eating (under 75 minutes per day) versus countries such as New Zealand, France, and Japan (over 110 minutes per day). The first group of countries have higher growth rates, both in 2008 and for the past few years.

My first problem with the analysis is one of granularity. Leisure time is measured per person, but GNP is measured over everyone. One big component of GNP growth is population growth, and different countries have very different patterns of population growth. The correct measure would be per capital GNP. Taking this into account would dampen the GNP growth figures for growing countries such as Mexico and the United States, and increase the GNP growth figures for lesser growing (or shrinking countries) such as Italy, Germany, and Japan.

Also, the countries where people eat more leisurely have other characteristics in common. In particular, they tend to have older populations and lower (or even negative) rates of population growth. One wonders if speed eating is a characteristic of younger people and leisurely eating is a characteristic of older people.

The biggest problem, though, is that this is, in all likelihood, a spurious correlation. One of the original definitions of data mining, which may still be used in the ecoonomics and political world, is a negative one: data mining is looking for data to support a conclusion. The OECD surveys were done in 17 different countries. The specific result in the NYT article is "Counties in which people eat and drink less than 100 minutes per day grow 0.9% faster -- on average -- than countries in which people each and drink more than 100 minutes per day".

In other words, the 17 countries were divided into two groups, and the growth rates were then measured for each group. Let's look at this in more detail.

How many ways are there to divide 17 countries into 2 groups? The answer is 2^17 = 131,072 different ways (any particular country could be in either group). So, if we had 131,072 yes-or-no survey questions, then would would expect any combination to arise, including the combinations where all the high growth countries are in one group and all the low growth countries in the other. (I admit the exact figure is a bit more than 131,072 but that is unimportant to illustrate my point.)

The situation actually gets worse. The results are not yes-or-no; they are numeric measurements which are then used to split the countries into two groups. The splits could be at any value of the measure. So, any given measurement results in 17-1=16 different possible splits (the first group having the country with the lowest measurement, with the two lowest, and so on). Now we only need about 8,192 uncorrelated measurements to get all possibilities.

However, we do not need all possibilities. A glance at the NYT article shows that the country with the worst 2008 growth is Poland, yet it is in the fast-eating group. And Spain -- in the slow eating group -- is the third fastest growing economy (okay, its GNP actually shrank but less than most others). So, we only need an approximation of a split, where the two groups look different. And then, voila! we get a news article.

The problem is that the OECD was able to measure dozens or hundreds of different things in their survey. My guess is that measures such as "weekly hours of work in main job," "time spent retired," and "time spent sleeping" -- just a few of the many possibilities -- did not result in interesting splits. Eventually, though, a measure such as "time spent eating and drinking" results in a split where the different groups look "statistically significant" but they probably are not. If the measure is interesting enough, then it can become an article in the New York Times.

This is probably a problem with statistical significance. The challenge is that a p-value of 0.01 means that something has only a 1% chance of happening at random. However, if we look at 100 different measures, then there is a really, really good chance that one of them will have a p-value of 0.01 or less. By the way, there is a statistical adjustment called the Bonferroni correction to take this into account (this as well as others are described in the Wikipeida).

Fortunately, neither the OECD nor the New York Times talk about this discovery as an example of data mining. It is just poor data analysis, but poor data analysis that can re-enforce lessons in good data analysis. Lately, I have been noticing more examples of articles such as this, where researchers -- or perhaps just journalists -- extrapolate from very small samples to make unsupported conclusions. These are particularly grating when they appear in respected newspapers, magazines, and journals.

Data mining is not about finding spurious correlations and claiming some great discovery. It is about extracting valuable information from large quantities of data, information that is stable and useful. Smaller amounts of data often contain many correlations. Often, these correlations are going to be spurious. And without further testing, or at least a mechanism to explain the correlation, the results should not be mentioned at all.

Saturday, April 25, 2009

When There Is Not Enough Data

I have a dataset where the target (continuous variable) variable that has to be estimated. However, in the given dataset, values for target are preset only for 2% while rest of 98% do not have values. The 98% are empty values. I need to score a dataset and give values for the target for all 2500 records. Can I use the 2% and replicate it several times and use that dataset to build a model? The ASE is too high if I use the 2% data alone. Any suggestions how to handle it, please?
Thanks,
Sneha

Sneha,

The short answer to your question is "Yes, you can replicate the 2% and use it to build a model." BUT DO NOT DO THIS! Just because a tool or technique is possible to implement does not mean that it is a good idea. Replicating observations "confuses" models, often by making the model appear overconfident in its results.

Given the way that ASE (average squared error) is calculated, I don't think that replicating data is going to change the value. We can imagine adding a weight or frequency on each observation instead of replicating them. When the weights are all the same, they cancel out in the ASE formula.

What does change is confidence in the model. So, if you are doing a regression and looking at the regression coefficients, each has a confidence interval. By replicating the data, the resulting model would have smaller confidence intervals. However, these are false, because the replicated data has no more information than the original data.

The problem that you are facing is that the modeling technique you are using is simply not powerful enough to represent the 50 observations that you have. Perhaps a different modeling technique would work better, although you are working with a small amount of data. For instance, perhaps some sort of nearest neighbor approach would work well and be easy to implement.

You do not say why you are using ASE (average squared error) as the preferred measure of model fitness. I can speculate that you are trying to predict a number, perhaps using a regression. One challenge is that the numbers being predicted often fall into a particular range (such as positive numbers for dollar values or ranging between 0 and 1 for a percentage). However, regressions produce numbers that run the gamut of values. In this case, transforming the target variable can sometimes improve results.

In our class on data mining (Data Mining Techniques: Theory and Practice), Michael and I introduce the idea of oversamping rare data using weights in order to get a balanced model set. For instance, if you were predicting whether someone was in the 2% group, you might give each of them a weight of 49 and all the unknowns a weight of 1. The result would be a balanced model set. However, we strongly advise that the maximum weight be 1. So, the weights would be 1/49 for the common cases and 1 for the rare ones. For regressions, this is important because it prevents any coefficients from having too-narrow confidence intervals.





Monday, April 13, 2009

Customer-Centric Forecasting White Paper Available

In our consulting practice, we work with many subscription-based businesses including newspapers, mobile phone companies, and software-as-a-service providers. All of these companies need to forecast future subscriber levels. With production support from SAS, I have recently written a white paper describing our approach to creating such forecasts. Very briefly, the central idea is that the subscriber population is a constantly changing mix of customer segments based on geography, acquisition channel, product mix, subscription type, payment type, demographic characteristics, and the like. Each of these segments has a different survival curve. Overall subscriber numbers come from aggregating planned additions and forecast losses at the segment level. Managers can simulate the effects of alternative acquisition strategies by changing assumptions about the characteristics of future subscribers and watching how the forecast changes. The paper is available on our web site. I will also be presenting a keynote talk on customer-centric forecasting on July 1st at the A2009 conference in Copenhagen.

Friday, April 10, 2009

Rexer Analytics Data Mining Survey

Karl Rexer of Rexer Analytics asked us to alert our readers that their annual survey of data miners is ongoing and will be available for a few more days. Click on the title to be taken to the survey page.