Thursday, November 1, 2007

Constructing a model set for binary outcome churn

Yesterday I received the following question from a reader who is trying to build a churn model for a business where refill purchases are expected to occur annually. The post raises several questions including how to define churn when it happens passively, how to prepare data for a binary outcome churn model, and whether it might be more appropriate to model refills as a repeating event. Although this question happens to be about annual planning book refills, the situation is similar with prepaid phone cards, transit passes, toner cartridges, etc. I will address the issue of modeling repeating events in a follow-up post, but first I will answer the question that was actually asked.
Michael,

I need advise. I hope you do not mind me asking questions.

Our Churn variable definition is if customer did not purchased in 13 months then we consider this customer has churned.

In this situation, if I want to build a model to see who is likely to leave, my churn variable will take values …

Churn = 1 (when last purchased date > 13 month)
else Churn = 0

After building a model, my Scoring data (To figure out who is likely to leave) should be…….

1. Customers who purchased within 13 months to see who are likely to leave or

2. Entire database or maybe 4 year buyers (customers whose last purchase date is within 4 years)?? Or

3. Use Modeling file which I have used create churn model as Scoring file?

Please let me know.

Thanks.

With Best Regards,

Nilima
First some context. I know from her email address (which I have removed to protect her from spam) that Nilima works for a company that sells planners and pocket calendars. The planners have an outer cover that lasts for years. When you order a planner, it comes with a year's worth of pages. As part of the order you specify what month to start with. A year later, you should need a refill. The product is not useful without its refill pages, so if 13 months go by without an order, it is likely that the customer has been lost. (Perhaps he or she now synchronizes a PDA with Outlook, or uses Google Apps, or is now enjoying a schedule-free retirement.)

As an aside, a purely time-since-last-purchase based definition of churn would not work if the product in question were wall calendars that only cover a particular year. In that case, the definition of churn might be "hasn't made a purchase by the end of January" without regard to when the previous purchase was made. There is undoubtedly also a fair amount of seasonality in the purchase of planners--the beginning of the calendar year and the beginning of the academic year seem like likely times to make an initial purchase--but that's OK. The business problem is to identify customers likely to not refill on their anniversary. For this purpose, it is not important that some months have more of these anniversaries than others.

The Data
The questioner is not a client of ours and I have never seen her data. I will assume that she has several years of history and that there is data for every customer who ever made a purchase during that time. I will further assume that all purchases are captured and that she can reliably link a purchase to a purchaser so repeat purchases are recognized as such. The business goal is to score all active, at-risk customers with a churn probability (or, equivalently and more cheerfully, with a refill probability). Presumably, customers with a high enough churn score will be given some extra incentive to refill.

It sounds as though Nilima has already taken the first step which is to summarize the purchase transactions to create a customer signature with one row per customer and columns describing them. Possible fields include

Fields derived from purchase data
  • number of past refills
  • months since last refill
  • months since first purchase
  • original product purchased
  • number of contacts since last refill
Fields captured at registration time
  • Age at time of first purchase
  • Sex
  • Country
  • Postal code or Zip code
Fields derived from the above and in combination with census data
  • Age at scoring time
  • Zip median income
  • Zip population density
  • Zip percent foreign born
Fields that could be purchased from a data vendor
  • Estimated household income
  • Estimated number of children
  • Estimated number of cars
  • Cluster assignment (e.g. "urban achievers", "bible devotion")
Rolling Back the Clock
Building a predictive model requires data from two distinct time periods. All data is from the past. To build a predictive model, you find patterns in data from the distant past that explain results in the more recent past. The result is a model that can be applied today to predict things that will happen in the future.

In the current case, you could take a snapshot of what all active customers looked like 14 months ago as your data from the distant past. In this data set, all of the tenure fields and count fields are reset to what they looked like way back when. Some customers now considered lapsed were still active. Some customers who have now made 4 refills had only made three. Customers who are now 65 were only 63, and so forth. Your data from the recent past would then be a single flag indicating whether the customer made a refill within 13 months of his or her previous refill or initial purchase. Note that because the churn definition is in terms of months since last purchase, the calendar date when a customer becomes lapsed must be calculated separately for each customer.

SAS PROC SQL Code Example
As I said, I have not seen the data that prompted Nilima's question. I do have some similar data that I can share with readers, however. Gordon Linoff and I teach a 2-day class on Applying Survival Analysis for Business Time-to-Event Problems. For that class we use a customer signature with a row for each subscriber, past and present, of a mobile phone company. You can get the data by registering on our web site.

The focus of the class is on calculating hazard probabilities for each tenure and using them to create survival curves that can be used to predict a subscriber's remaining lifetime and create subscriber level forecasts. If we wanted to use that data for a binary outcome churn model, we would have to roll back time as described above. The following SAS code creates a dataset of customers who were active 100 days before the extract or cutoff date. Time is rolled back so that subscribers appear as they did at the observation date. In particular, the subscriber's tenure and age are defined as of the observation date.

The code does a few other interesting things that may be worth noting. In the mobile telephony industry, handset is a known driver of churn. Subscribers know that they can get a new, cooler phone by signing up with a competitor as a new subscriber. Subscribers with uncool phones are most at risk, but which phones are the least cool is constantly changing over time. Therefore, rather than trying to incorporate the handset model into the model, we incorporate the churn rate associated with each model in the 100 days before the observation date by counting the number of people who stopped with each model and dividing by the number of people carrying each model.

Another big factor in churn is whether subscribers are on or off contract. Subscribers on contract must pay a fee to cancel their subscriptions. This code calculates two flags--one indicating whether the subscriber is off-contract as of the observation date and another indicating whether the subscriber is scheduled to go off contract (and so become more likely to churn) before the cutoff date.

The code creates 3 future variables, any of which could be the target for a binary outcome churn model. FutureCHURN is true for anyone who stopped for any reason between the observation date and the cutoff date. FutureVOLUNTARY is true for anyone who stopped voluntarily and FutureINVOLUNTARY is true for anyone who stopped involuntarily.

SQL code

Labels:

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home