Wednesday, May 26, 2010

Combining Empirical Hazards by the Naïve Bayesian Method

Occasionally one has an idea that seems so obvious and right that it must surely be standard practice and have a well-known name. A few months ago, I had such an idea while sitting in a client’s office in Ottawa. Last week, I wanted to include the idea in a proposal, so I tried to look it up and couldn’t find a reference to it anywhere. Before letting my ego run away with itself and calling it the naïve Berry model, I figured I would share it with our legions of erudite readers so someone can point me to a reference.

Some Context

My client sells fax-to-email and email-to-fax service on a subscription basis. I had done an analysis to quantify the effect of various factors such as industry code, acquisition channel, and type of phone number (local or long distance) on customer value. Since all customers pay the same monthly fee, the crucial factor is longevity. I had analyzed each covariate separately by calculating cancellation hazard probabilities for each stratum and generating survival curves. The area under the first year of each survival curve is the first year truncated mean tenure. Multiplying the first-year mean tenure by the subscription price yields the average first year revenue for a segment. This let me say how much more valuable a realtor is than a trucker; or a Google adwords referral than an MSN referral.

For many purposes, the dollar value was not even important. We used the probability of surviving one year as a way of scoring particular segments. But how should the individual segment scores be combined to give an individual customer a score based on his being a trucker with an 800 number referred by MSN? Or a tax accountant with a local number referred by Google? The standard empirical hazards approach would be to segment the training data by all levels of all variables before estimating the hazards, but that was not practical since there were so many combinations that many would lack sufficient data to make confident hazard estimates. Luckily, there is a standard model for combining the contributions of several independent pieces of evidence—naïve Bayesian models. An excellent description of the relationship between probability, odds, and likelihood and how to use them to implement naïve Bayesian models, can be found in Chapter 10 of Gordon Linoff’s Data Analysis Using SQL and Excel.

Here are the relevant correspondences:

odds = p/(1-p)
p = 1 - (1/(1+odds))
likelihood = (odds|evidence)/overall odds

Statisticians switch from one representation to another as convenient. A familiar example is logistic regression. Since linear regression is inappropriate for modeling probabilities that range only from 0 to 1, they convert the probabilities to log(odds) that vary from negative infinity to positive infinity. Expressing the log odds as a linear regression equation and solving for p, yields the logistic function.

Naïve Bayesian Models

The Naïve Bayesian model says that the odds of surviving one year given the evidence is the overall odds times the product of the likelihoods for each piece of evidence. For concreteness, let’s calculate a score for a general contractor (industry code 1521) with a local number who was referred by a banner ad.
The probability of surviving one year is 54%. Overall survival odds are therefore 0.54/(1-0.54) or 1.17.
One-year survival for industry code 1521 is 74%, considerably better than overall survival. The survival likelihood is defined as the survival odds, 0.74/(1-0.74) divided by the overall survival odds of 1.17. This works out to 2.43.

One-year survival for local phone numbers is 37%, considerably worse than overall survival. Local phone numbers have one-year survival odds of 0.59 and likelihood of 0.50.

Subscribers acquired through banner ads have one-year survival of 0.52, about the same as overall survival. This corresponds to odds of 1.09 and likelihood of 0.91.

Plugging these values into the naïve Bayesian model formula, we estimate one-year survival odds for this customer as 1.17*2.43*0.50*0.91=1.29. Solving 1.29=p/(p-1) for p yields a one-year survival estimate of 56%, a little bit better than overall survival. The positive evidence from the industry code slightly outweighs the negative evidence from the phone number type.

This example does not illustrate another great feature of naïve Bayesian models. If some evidence is missing—if the subscriber works in an industry for which we have no survival curve, for example—you can simply leave out the industry likelihood term.

The Idea

If we are happy to use the naïve Bayesian model to estimate the probability of a subscriber lasting one year, why not do the same for daily hazard probabilities? This is something I’ve been wanting to do since the first time I ever used the empirical hazard estimation method. That first project was for a wireless phone company. There was plenty of data to calculate hazards stratified by market or rate plan or handset type or credit class or acquisition channel or age group or just about any other time-0 covariate of interest. But there wasn’t enough data to estimate hazards for every combination of the above. I knew about naïve Bayesian models back then; I’d used the Evidence Model in SGI’s Mineset many times. But I never made the connection—it’s hard to combine probabilities, but easy to combine likelihoods. There you have it: Freedom from the curse of dimensionality via the naïve assumption of independence. Estimate hazards for as many levels of as many covariates as you please and then combine them with the naïve Bayesian model. I tried it, and the results were pleasing.

An Example

This example uses data from a mobile phone company. The dataset is available on our web site. There are three rate plans, Top, Middle, and Bottom. There are three markets, Gotham, Metropolis, and Smallville. There are four acquisition channels, Dealer, Store, Chain, and Mail. There is plenty of data to make highly confident hazard estimates for any of the above, but some combinations, such as Smallville-Mail-Top are fairly rare. For many tenures, no one with this combination cancels so there are long stretches of 0 hazard punctuated by spikes where one or two customers leave.


Here are the Smallville-Mail-Top hazard by the Naïve Berry method:



Isn’t that prettier? I think it makes for a prettier survival curve as well.


The naïve method preserves a feature of the original data—the sharp drop at the anniversary when many people coming off one-year contracts quit—that was lost in the sparse calculation.

Sunday, April 4, 2010

Data Mining Techniques now available in Korean

For any of our readers who have been wishing they could read our book Data Mining Techniques for Marketing, Sales, and Customer Relationship Management (2nd Edition) in Korean, now you can! We don't know why the cover pictures someone playing jacks, but then we don't really understand how our publisher chooses our U.S. cover pictures either.

This book was already available in Japanese, and, of course, English. Earlier editions are available in Traditional Chinese and French.

Sunday, March 14, 2010

Bitten by an Unfamiliar Form of Left Truncation

Alternate title: Data Mining Consultant with Egg on Face

Last week I made a client presentation. The project was complete. I was presenting the final results to the client.  The CEO was there. Also the CTO, the CFO, the VPs of Sales and Marketing, and the Marketing Analytics Manager. The client runs a subscription-based business and I had been analyzing their attrition patterns. Among my discoveries was that customers with "blue" subscriptions last longer than customers with "red" subscriptions. By taking the difference of the area under the two survival curves truncated at one year and multiplying by the subscription cost, I calculated the dollar value of the difference. I put forward some hypotheses about why the blue product was stickier and suggested a controlled experiment to determine whether having a blue subscription actually caused longer tenure or was merely correlated with it. Currently, subscribers simply pick blue or red at sign-up. There is no difference in price.  I proposed that half of new customers be given blue by default unless they asked for red and the other half be given red by default unless they asked for blue. We could then look for differences between the two randomly assigned groups.

All this seemed to go over pretty well.  There is only one problem.  The blue customers may not be better after all.  One of the attendees asked me whether the effect I was seeing could just be a result of the fact that blue subscriptions have been around longer than red ones so the oldest blue customers are older than the oldest red customers. I explained that this would not bias my findings because all my calculations were based on the tenure time line, not the calendar time line. We were comparing customers' first years without regard to when they happened. I explained that there would be a problem if the data set suffered from left truncation, but I had tested for that, and it was not a problem because we knew about starts and stops since the beginning of time.

Left truncation is something that creates a bias in many customer databases.  What it means is that there is no record of customers who stopped before some particular date in the past--the left truncation date. The most likely reason is that the company has been in existence longer than its data warehouse. When the warehouse was created, all active customers were loaded in, but customers who had already left were not. Fine, for most applications, but not for survival analysis. Think about customers who started before the warehouse was built.  One (like many thousands of others) stops before the warehouse gets built with a short tenure of two months. Another, who started on the same day as the first, is still around two be loaded into the warehouse with a tenure of two years.  Lots of short-tenure people are missing and long-tenure people are over represented. Average tenure is inflated and retention appears to be better than it really is.

My client's data did not have that problem.  At least, not in the way I am used to looking for it.  Instead, it had a large number of stopped customers for whom the subscription type had been forgotten. I (foolishly) just left these people out of my calculations.  Here is the problem: Although the customer start and stop dates are remembered for ever, certain details, including the subscription type,  are purged after a certain amount of time. For all the people who started back when there were only blue subscriptions and had short or even average tenures, that time had already past. The only ones for whom I could determine the subscription type were those who had unusually long tenures.  Eliminating the subscribers for whom the subscription type had been forgotten had exactly the same effect as left truncation!

If this topic and things related to it sound interesting to you, it is not too late to sign up for a two-day class I will be teaching in New York later this week.  The class is called Survival Analysis for Business Time to Event Problems. It will be held at the offices of SAS Institute in Manhattan this Thursday and Friday, March 18-19.

Thursday, February 25, 2010

Agglomerative Variable Clustering

Lately, I've been thinking about the topic of reducing the number of variables, and how this is a lot like clustering variables (rather than clustering rows). This post is about a method that seems intuitive to me, although I haven't found any references to it. Perhaps a reader will point me to references and a formal name. This method using Pearson correlation and principal components to agglomeratively cluster the variables.

Agglomerative clustering is the process of assigning records to clusters, starting with the records that are closest to each other. This process is repeated, until all records are placed into a single cluster. The advantage of agglomerative clustering is that it creates a structure for the records, and the user can see different numbers of clusters. Divisive clustering, such as implemented by SAS's varclus proc, produces something similar, but from the top-down.

Agglomerative variable clustering works the same way. Two variables are put into the same cluster, based on their proximity. The cluster then needs to be defined in some manner, by combining information in the cluster.

The natural measure for proximity is the square of the (Pearson) correlation between the variables. This is a value between 0 and 1 where 0 is totally uncorrelated and 1 means the values are colinear. For those who are more graphically inclined, this statistic has an easy interpretation when there are two variables. It is the R-square value of the first principal component of the scatter plot.

Combining two variables into a cluster requires creating a single variable to represent the cluster. The natural variable for this is the first principal component.

My proposed clustering method repeatedly does the following:
  1. Finds the two variables with the highest correlation.
  2. Calculates the principal component for these variables and adds it into the data.
  3. Maintains the information that the two variables have been combined.
The attached SAS code (available at sas-var-hierarchical-clustering-v01.sas) does exactly this, although not in the most efficient and robust way. The bulk of the code is a macro, called buildcolumns, that appends the new cluster variables to the data set and maintains another table called columns which has the information about the rows. After I run this code, I can select different numbers of variables using the expression:

proc sql;
....select colname
....from columns
....where counter <= [some number] <>

These variables can then be used for predictive models or visualization purposes.

The inner loop of the code works by doing the following:
  1. Calling proc corr to calculate the correlation of all variables not already in a cluster.
  2. Transposing the correlations into a table with three columns, two for the variables and one for the correlation using proc transpose.
  3. Finding the pair of variables with the largest correlation.
  4. Calculating the first principal component for these variables.
  5. Appending this principal component to the data set.
  6. Updating the columns data set with information about the new cluster.
The data set referred to in the code comes from the companion site for Data Analysis Using SQL and Excel. The code will fail (by running an infinite loop) if any variables are missing or if two variables are exactly correlated.

Wednesday, February 10, 2010

Why there is always a J window open on my desktop

People often ask me what tools I use for data analysis. My usual answer is SQL and I explain that just as Willie Sutton robbed banks because "that's where the money is," I use SQL because that is where the data is. But sometimes, it gets so frustrating trying to figure out how to get SQL to do something as seemingly straight forward as a running total or running maximum, that I let the data escape from the confines of its relational tables and into J where it can be free. I assume that most readers have never heard of J, so I'll give you a little taste of it here.  It's a bit like R only a lot more general and more powerful. It's even more like APL, of which it is a direct descendant, but those of us who remember APL are getting pretty old these days.

The question that sent me to J this time came from a client who had just started collection sales data from a web site and wanted to know how long they would have to wait before being able to make some statistically valid conclusions about whether spending differences between two groups who had received different marketing treatments were statistically significant. One thing I wanted to look at was how much various measures such as average order size and total revenue fluctuate from day to day and how many days does it take before the overall measures settle down near their long-term means. For example, I'd like to calculate the average order size with just one day's worth of purchases, then two day's worth, then three day's worth, and so on. This sort of operation, where a function is applied to successively longer and longer prefixes is called a scan.

A warning: J looks really weird when you first see it. One reason is that many things that are treated as a single token are spelled with two characters. I remember when I first saw Dutch, there were all these impossible looking words with "ij" in them--ijs and rijs, for example. Well, it turns out that in Dutch "ij" is treated like a single letter that makes a sound a bit like the English "eye." So ijs is ice and rijs is rice and the Rijn is a famous big river. In J, the second character of these two-character symbols is usually a '.' or a ':'.

=: is assignment. <. is lesser of. >. is greater of. And so on. You should also know that anything following NB. on a line is comment text.

   x=: ? 100#10                        NB. One hundred random integers between 0 and 9

   +/ x                                      NB. Like putting a + between every pair of x--the sum of x.
424
   <. / x                                    NB. Smallest x
0
   >. / x                                    NB. Largest x
9
   mean x
4.24
   ~. x                                      NB. Nub of x. (Distinct elements.)
3 0 1 4 6 2 8 7 5 9
   # ~. x                                    NB. Number of distinct elements.
10
    x # /. x                                  NB. How many of each distinct element. ( /. is like SQL GROUP BY.)
6 10 15 13 15 9 9 12 6 5
   +/ \ x                                      NB. Running total of x.
3 3 4 8 12 13 19 23 25 33 41 48 54 56 61 67 69 72 73 74 75 . . .
   >./ \ x                                     NB. Running maximum of x.
3 3 3 4 4 4 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 . . .
   mean \ x                                  NB. Running mean of x.
3 1.5 1.33333 2 2.4 2.16667 2.71429 2.875 2.77778 3.3 3.72727 . . .
   plot mean \ x                            NB. Plot running mean of x.


   plot var \ x                               NB. Plot running variance of x.

 
 
J is available for free from J software. Other than as a fan, I have no relationship with that organization.

Creating DDL For An Entire Database In SQL Server 2008

Recently, I started a new project which has a database component. I looked around for some visual data modeling tools, and I settled on just using the diagrams capability of SQL Server. Since the client is using SQL Server, it was simple to download SQL Server Express and get started using their diagramming tool.

After creating a bunch of tables, I learned that SQL Server Database Diagrams do not produce the Data Definition Language (DDL) to create the database. Instead, the tables are created in sync with the diagram. Furthermore, SQL Server does not have a command that creates the DDL for an entire database. Right clicking on two dozen tables is cumbersome. But even worse, it would not provide complete DDL, since the table DDL does not include index definitions.

I have seen some debate on the web about the merits of graphical tools versus text DDL. Each has their advantages, and, personally, I believe that a decent database tool should allow users to switch between the two. The graphical environment lets me see the tables and their relationships. The text allows me to make global changes, such as:
  • Changing all the SMALLDATETIME data types to DATE when I go to a commercial version of SQL Server. The Expression version does not support DATE, alas.
  • Adding auditing columns -- such as user, creation date, and update date -- to almost all tables.
  • Adding table-specific comments.
Doing these types of actions in a point-and-click environment is cumbersome, inefficient, and prone to error. At the same time, the GUI environment is great for designing the tables and visualizing their relationships.

So, I searched on the web for a DDL program that would allow me to create the DDL for an entire SQL Server database. Because I did not find any, I decided that I had to write something myself. The attached file contains script-all-tables.sql contains my script.

This script uses SQL to generate SQL code -- a trick that I talk about in my book Data Analysis Using SQL and Excel. The script generates code for the following:
  1. Dropping all tables in the database, if they exist.
  2. Creating new versions of the tables, taking into account primary keys, data types, and identity columns.
  3. Creating foreign key constraints on the table.
  4. Creating indexes on the table.
This is a very common subset of DDL used for databases. And, importantly, it seems to cover almost all that you can do using Database Diagrams. However, the list of what it is missing from fully re-creating any database is very, very long, ranging from user defined types, functions, and procedures, to the storage architecture, replication, and triggers.

The script uses the view in the sys schema rather than in Information_Schema simply because I found it easier to find the information that I needed to put the SQL together.

Tuesday, February 2, 2010

Simpson's Paradox and Marketing

A reader asked the following question:

Hi Michael/Gordon,
In campaign measurements, it's possible to get a larger lift at the overall level compared to all the individual decile level lifts or vice versa, because of the differences in sample size across the deciles, and across Test & Control.
According to wikipedia, it's known as Simpson's paradox (or the Yule-Simpson effect) and is explained as an apparent paradox in which the successes in different groups seem to be reversed when the groups are combined.
In such scenarios, how do you calculate the overall lift? Which methods are commonly used in the industry?
Thanks,
Datalligence
http://datalligence.blogspot.com/

Simpson's Paradox is an interesting phenomenon, where results about subgroups of a population do not generalize to the overall population. I think the simplest version that I've heard is an old joke . . . "I heard you moved from Minnesota to Iowa, raising the IQ of both states."

How could this happen? For the joke to work, the average IQ in Minnesota must be higher than the average IQ in Iowa. And, the person who moves must have an IQ between these two values. Voila, you can get the paradox that the averages in both states go up, although they are based on exactly the same population.

I didn't realize that this paradox has a name (or, if I did, then I had forgotten). Wikipedia has a very good article on Simpson's Paradox, which includes real world examples from baseball, medical studies, and an interesting discussion of a gender discrimination lawsuit at Berkeley. In the gender discrimination lawsuit, women were accepted at a much lower rate than men overall. However, department by department, women were typically accepted at a higher rate than men. The difference is that women applied to more competitive departments than men. These departments have lower rates of acceptance, lowering the overall rate for women.

Simpson's Paradox arises when we are taking weighted averages of evidence from different groups. Different weightings can produce very different, even counter-intuitive results. The results become much less paradoxical when we see the actual counts rather than just the percentages.

The specific question is how to relate this paradox to lift, and understanding marketing campaigns. Assume there is a marketing campaign, where one group receives a particular treatment and another group does not. The ratio of performance between these two groups is the lift of the marketing campaign.

To avoid Simpson's paradox, you need to ensure that the groups are as similar as possible, except for what's being tested. If the test is for the marketing message, there is no problem, both groups can be pulled from the same population. If, instead, the test is for the marketing group itself (say high value customers), then Simpson's Paradox is not an issue, since we care about how the group performs rather than how the entire population performs.

As a final comment, I could imagine finding marketing results where Simpson's Paradox has surfaced, because the original groups were not well chosen. Simpson's Paradox arises because the sizes of the test groups are not proportional to their sizes in the overall population. In this case, I would be tempted to weight the results from each group based on the expected size in the overall population to calculate the overall response and lift.