Sunday, March 14, 2010

Bitten by an Unfamiliar Form of Left Truncation

Alternate title: Data Mining Consultant with Egg on Face

Last week I made a client presentation. The project was complete. I was presenting the final results to the client.  The CEO was there. Also the CTO, the CFO, the VPs of Sales and Marketing, and the Marketing Analytics Manager. The client runs a subscription-based business and I had been analyzing their attrition patterns. Among my discoveries was that customers with "blue" subscriptions last longer than customers with "red" subscriptions. By taking the difference of the area under the two survival curves truncated at one year and multiplying by the subscription cost, I calculated the dollar value of the difference. I put forward some hypotheses about why the blue product was stickier and suggested a controlled experiment to determine whether having a blue subscription actually caused longer tenure or was merely correlated with it. Currently, subscribers simply pick blue or red at sign-up. There is no difference in price.  I proposed that half of new customers be given blue by default unless they asked for red and the other half be given red by default unless they asked for blue. We could then look for differences between the two randomly assigned groups.

All this seemed to go over pretty well.  There is only one problem.  The blue customers may not be better after all.  One of the attendees asked me whether the effect I was seeing could just be a result of the fact that blue subscriptions have been around longer than red ones so the oldest blue customers are older than the oldest red customers. I explained that this would not bias my findings because all my calculations were based on the tenure time line, not the calendar time line. We were comparing customers' first years without regard to when they happened. I explained that there would be a problem if the data set suffered from left truncation, but I had tested for that, and it was not a problem because we knew about starts and stops since the beginning of time.

Left truncation is something that creates a bias in many customer databases.  What it means is that there is no record of customers who stopped before some particular date in the past--the left truncation date. The most likely reason is that the company has been in existence longer than its data warehouse. When the warehouse was created, all active customers were loaded in, but customers who had already left were not. Fine, for most applications, but not for survival analysis. Think about customers who started before the warehouse was built.  One (like many thousands of others) stops before the warehouse gets built with a short tenure of two months. Another, who started on the same day as the first, is still around two be loaded into the warehouse with a tenure of two years.  Lots of short-tenure people are missing and long-tenure people are over represented. Average tenure is inflated and retention appears to be better than it really is.

My client's data did not have that problem.  At least, not in the way I am used to looking for it.  Instead, it had a large number of stopped customers for whom the subscription type had been forgotten. I (foolishly) just left these people out of my calculations.  Here is the problem: Although the customer start and stop dates are remembered for ever, certain details, including the subscription type,  are purged after a certain amount of time. For all the people who started back when there were only blue subscriptions and had short or even average tenures, that time had already past. The only ones for whom I could determine the subscription type were those who had unusually long tenures.  Eliminating the subscribers for whom the subscription type had been forgotten had exactly the same effect as left truncation!

If this topic and things related to it sound interesting to you, it is not too late to sign up for a two-day class I will be teaching in New York later this week.  The class is called Survival Analysis for Business Time to Event Problems. It will be held at the offices of SAS Institute in Manhattan this Thursday and Friday, March 18-19.

Labels: ,

Wednesday, February 10, 2010

Why there is always a J window open on my desktop

People often ask me what tools I use for data analysis. My usual answer is SQL and I explain that just as Willie Sutton robbed banks because "that's where the money is," I use SQL because that is where the data is. But sometimes, it gets so frustrating trying to figure out how to get SQL to do something as seemingly straight forward as a running total or running maximum, that I let the data escape from the confines of its relational tables and into J where it can be free. I assume that most readers have never heard of J, so I'll give you a little taste of it here.  It's a bit like R only a lot more general and more powerful. It's even more like APL, of which it is a direct descendant, but those of us who remember APL are getting pretty old these days.

The question that sent me to J this time came from a client who had just started collection sales data from a web site and wanted to know how long they would have to wait before being able to make some statistically valid conclusions about whether spending differences between two groups who had received different marketing treatments were statistically significant. One thing I wanted to look at was how much various measures such as average order size and total revenue fluctuate from day to day and how many days does it take before the overall measures settle down near their long-term means. For example, I'd like to calculate the average order size with just one day's worth of purchases, then two day's worth, then three day's worth, and so on. This sort of operation, where a function is applied to successively longer and longer prefixes is called a scan.

A warning: J looks really weird when you first see it. One reason is that many things that are treated as a single token are spelled with two characters. I remember when I first saw Dutch, there were all these impossible looking words with "ij" in them--ijs and rijs, for example. Well, it turns out that in Dutch "ij" is treated like a single letter that makes a sound a bit like the English "eye." So ijs is ice and rijs is rice and the Rijn is a famous big river. In J, the second character of these two-character symbols is usually a '.' or a ':'.

=: is assignment. <. is lesser of. >. is greater of. And so on. You should also know that anything following NB. on a line is comment text.

   x=: ? 100#10                        NB. One hundred random integers between 0 and 9

   +/ x                                      NB. Like putting a + between every pair of x--the sum of x.
   <. / x                                    NB. Smallest x
   >. / x                                    NB. Largest x
   mean x
   ~. x                                      NB. Nub of x. (Distinct elements.)
3 0 1 4 6 2 8 7 5 9
   # ~. x                                    NB. Number of distinct elements.
    x # /. x                                  NB. How many of each distinct element. ( /. is like SQL GROUP BY.)
6 10 15 13 15 9 9 12 6 5
   +/ \ x                                      NB. Running total of x.
3 3 4 8 12 13 19 23 25 33 41 48 54 56 61 67 69 72 73 74 75 . . .
   >./ \ x                                     NB. Running maximum of x.
3 3 3 4 4 4 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 . . .
   mean \ x                                  NB. Running mean of x.
3 1.5 1.33333 2 2.4 2.16667 2.71429 2.875 2.77778 3.3 3.72727 . . .
   plot mean \ x                            NB. Plot running mean of x.

   plot var \ x                               NB. Plot running variance of x.

J is available for free from J software. Other than as a fan, I have no relationship with that organization.

Labels: ,

Tuesday, January 19, 2010

Oracle load scripts now avalable for Data Analysis Using SQL and Excel

Classes started this week for the spring semester at Boston College where I am teaching a class on marketing analytics to MBA students at the Carroll School of Management.  The class makes heavy use of Gordon's book, Data Analysis Using SQL and Excel and the data that accompanies it. Since the local database is Oracle, I have at long last added Oracle load scripts to the book's companion page.

Due to laziness, my method of creating the Oracle script was to use the existing MySQL script and edit bits that didn't work in Oracle.  As it happens, the MySQL scripts worked pretty much as-is to load the tab-delimited data into Oracle tables using Oracle's sqlldr utility. One case that did not work taught me something about the danger of mixing tab-delimited data with input formats in sqlldr.  Even though it has nothing to do with data mining, as a public service, that will be the topic of my next post.

Preview: Something that works perfectly well when your field delimiter is comma, fails mysteriously when it is tab.


Monday, December 28, 2009

Differential Response or Uplift Modeling

Some time before the holidays, we received the following inquiry from a reader:

Dear Data Miners,

I’ve read interesting arguments for uplift modeling (also called incremental response modeling) [1], but I’m not sure how to implement it. I have responses from a direct mailing with a treatment group and a control group. Now what? Without data mining, I can calculate the uplift between the two groups but not for individual responses. With the data mining techniques I know, I can identify the ‘do not disturbs,’ but there’s more than avoiding mailing that group. How is uplift modeling implemented in general, and how could it be done in R or Weka?


I first heard the term "uplift modeling" from Nick Radcliffe, then of Quadstone. I think he may have invented it. In our book, Data Mining Techniques, we use the term "differential response analysis." It turns out that "differential response" has a very specific meaning in the child welfare world, so perhaps we'll switch to "incremental response" or "uplift" in the next edition. But whatever it is called, you can approach this problem in a cell-based fashion without any special tools. Cell-based approaches divide customers into cells or segments in such a way that all members of a cell are similar to one another along some set of dimensions considered to be important for the particular application. You can then measure whatever you wish to optimize (order size, response rate, . . .) by cell and, going forward, treat the cells where treatment has the greatest effect.

Here, the quantity  to measure is the difference in response rate or average order size between treated and untreated groups of otherwise similar customers. Within each cell, we need a randomly selected treatment group and a randomly selected control group; the incremental response or uplift is the difference in average order size (or whatever) between the two. Of course some cells will have higher or lower overall average order size, but that is not the focus of incremental response modeling. The question is not "What is the average order size of women between 40 and 50 who have made more than 2 previous purchases and live in a neighborhood where average household income is two standard deviations above the regional average?" It is "What is the change in order size for this group?"

Ideally, of course, you should design the segmentation and assignment of customers to treatment and control groups before the test, but the reader who submitted the question has already done the direct mailing and tallied the responses. Is it now too late to analyze incremental response?  That depends: If the control group is a true random control group and if it is large enough that it can be partitioned into segments that are still large enough to provide statistically significant differences in order size, it is not too late. You could, for instance, compare the incremental response of male and female responders.

A cell-based approach is only useful if the segment definitions are such that incremental response really does vary across cells. Dividing customers into male and female segments won't help if men and women are equally responsive to the treatment. This is the advantage of the special-purpose uplift modeling software developed by Quadstone (now Portrait Software). This tool builds a decision tree where the splitting criteria is maximizing the difference in incremental response. This automatically leads to segments (the leaves of the tree) characterized by either high or low uplift.  That is a really cool idea, but the lack of such a tool is not a reason to avoid incremental response analysis.

Labels: , , ,

Tuesday, December 22, 2009

Interview with Eric Siegel

This is the first of what may become an occasional series of interviews with people in the data mining field. Eric Siegel is the organizer of the popular  Predictive Analytics World conference series. I asked him a little bit about himself and gave him a chance to plug his conference.  A propos, readers of this blog can get a 15% discount on a two-day conference pass by pasting the code DATAMINER010 into the Promotional Code box on the conference registration page.

Q: Not many kids (one of mine is perhaps the exception that proves the rule) have the thought "when I grow up, I want to be a data miner!"  How did you fall into this line of work?

To many laypeople, the word "data" sounds dry, arcane, meaningless - boring! And number-crunching on it doubly so. But this is actually the whole point. Data is the uninterpreted mass of things that've happened.  Extracting what's up, the means behind the madness, and in so doing modeling and learning about human behavior... well, I feel nothing in science or engineering is more interesting.
In my "previous life" as an academic researcher, I focused on core predictive modeling methods. The ability for a computer to automatically learn from experience (data really is recorded experience, after all), is the best thing since sliced bread. Ever since I realized, as I grew up from childhood, that space travel would in fact be a tremendous, grueling pain in the neck (not fun like "Star Wars"), nothing in science has ever seemed nearly as exciting.

In my current 9-year career as a commercial practitioner, I've found that indeed the ability to analytically "learn" and apply what's been learned turns out to provide plenty of business value, as I imagined back in the lab.  Research science is fun in that you have the luxury of abstraction and are often fairly removed from the need to prove near-term industrial applicability. Applied science is fun for the opposite reason: The tangle of challenges, although some less abstract and in that sense more mundane, are the only thing between you and getting the great ideas of the world to actually work, come to fruition, and deliver an irrefutable impact.

Q: Most conferences happen once a year.  Why does PAW come around so much more frequently?

In fact, many commercial conferences focused the industrial deployment of technology occur multiple times per year, in contrast to research conferences, which usually take place annually.  There's an increasing demand for a more frequent commercial event as predictive analytics continues to "cross chasms" towards more widescale penetration. There's just too much to cover - too many brand-name case studies and too many hot topics - to wait a year before each event.

Q: You use the phrase "predictive analytics" for what I've always called "data mining." Do the terms mean something different, or is it just that fashions change with the times?

"Data mining" is indeed often used synonymously with "predictive analytics", but not always. Data mining's definitions usually entail the discovery of non-trivial, useful patterns/knowledge/insights from data -- if you "dig" enough, you get a "nugget." This is a fairly abstract definition and therefore envelops a wide range of analytical techniques. On the other hand, predictive analytics is basically the commerical deployment of predictive modeling specifically (that is, in academic jargon, supervised learning, i.e., optimizing a statitistical model over labeled/historical cases). In business applications, this basically translates to a model that produces a score for each customer, prospect, or other unit of interest (business/outlet location, SKU, etc), which is roughly the working definition we posted on the Predictive Analytics World website. This would seem to potentially exclude related data mining methods such as forecasting, association mining and clustering (unsupervised learning), but, naturally, we include some sessions at the conference on these topics as well, such as your extremely-well-received session on forecasting October 2009 in DC.

Q: How do you split your time between conference organizing and analytical consulting work?  (That's my polite way of trying to rephrase a question I was once asked: "What's the split between spewing and doing?")

When one starts spewing a lot, there becomes much less time for doing. In the last 2 years, as my 2-day seminar on predictive analytics has become more frequent (both as public and customized on-site training sessions - see, and I helped launch Predictive Analytics World, my work in services has become less than half my time, and I now spend very little time doing hands-on, playing a more advisory and supervisory role for clients, alongside other senior consultants who do more hands-on for Prediction Impact services engagements.

Q: I can't help noticing that you have a Ph.D.  As someone without any advanced degrees, I'm pretty good at rationalizing away their importance, but I want to give you a chance to explain what competitive advantage it gives you.

The doctorate is a research-oriented degree, and the Ph.D. dissertation is in a sense a "hazing" process. However, it's become clear to me that the degree is very much net positive for my commercial career. People know it entails a certain degree of discipline and aptitude. And, even if I'm not conducting academic research most of the time, every time one applies analytics there there is an experimental component to the task. On the other hand, many of the best data miners - the "rock star" consultants such as yourself - did not need a doctorate program in order to become great at data mining.

Q: Moving away from the personal, how do you think the move of data and computing power into the cloud is going to change data mining?

I'd say there's a lot of potential in making parallelized deployment more readily available to any and all data miners.  But, of all the hot topics in analytics, I feel this is the one into which I have the least visibility. It does, after all, pertain more to infrastucture and support than to the content, meaning and insights gained from analysis.

But, turning to the relevant experts, be sure to check out Feb PAW's upcoming session, "In-database Vs. In-cloud Analytics: Implications for Deployment" - see

Q: Can you give examples of problems that once seemed like hot analytical challenges that have now become commoditized?

Great question. Hmm... common core analyical methods such as decision trees and logistic regression may be the only true commodities to date in our field. What do you think?

Q: There are some tasks that we used to get hired for 10 or 15 years ago that no one comes to us for these days. Direct mail response models is an example. I think people feel like they know how to do those themselves. Or maybe that is something the data vendors pretty much give away with the data.

Which of today's hot topics in data mining do you see as ripe for commiditization?

UPLIFT (incremental lift) modeling is branching out, with applications going beyond response and churn modeling (see

Expanding traditional data sets with SOCIAL DATA is continuing to gain traction across a growing range of verticals as analytics pracitioners find great value (read: tremendous increases in model lift) leveraging the simple fact that people behave similarly to those to whom they're socially connected. Just as the healthcare industry has discovered that quitting smoking is "contagious" and that the risk of obesity dramatically increases if you have an obese friend, telecommunications, online social networks and other industries find that "birds of a feather" churn and even commit fraud "together". Is this more because people influence one-another, or because they befriend others more like themselves?  Either way, social connections are hugely predictive of the customer behaviors that matter to business.

Q: There have been several articles in the popular press recently, like this one in the NY Times,  saying that statistics and data mining are the hottest fields a young person could enter right now.  Do you agree?

Well, for the subjective reasons in my answer to your first question above, I would heartily agree. If I recall, that NY Times article focused on the demand for data miners as the career's central appeal. Indeed, it is a very marketable skill these days, which certainly doesn't hurt.

Labels: , ,

Thursday, December 17, 2009

What do group members have in common?

We received the following question via email.


I have a data set which has both numeric and string attributes. It is a data set of our customers doing a particular activity (eg: customers getting one particular loan). We need to find out the pattern in the data or the set of attributes which are very common for all of them.

Classification/regression not possible , because there is only one class
Association rule cannot take my numeric value into consideration
clustering clusters similar people, but not common attributes.

 What is the best method to do this? Any suggestion is greatly appreciated.

The question "what do all the customers with a particular type of loan have in common"  sounds seductively reasonable. In fact, however, the question is not useful at all because the answer is "Almost everything."  The proper question is "What, if anything, do these customers have in common with one another, but not with other people?"  Because people are all pretty much the same, it is the tiny ways they differ that arouse interest and even passion.  Think of two groups of Irishmen, one Catholic and one Protestant. Or two groups of Indians, one Hindu and one Muslim. If you started with members of only one group and started listing things they had in common, you would be unlikely to come up with anything that didn't apply equally to the other group as well.

So, what you really have is a classification task after all.  Take the folks who have the loan in question and an equal numbers of otherwise similar customers who do not. Since you say you have a mix of numeric and string attributes, I would suggest using decision trees. These can split equally well on numeric values ( x>n ) or categorical variables ( model in ('A','B','C') ). If the attributes you have are, in fact, able to distinguish the two groups, you can use the rules that describe leaves that are high in holders of product A as "what holders of product A have in common" but that is really shorthand for "what differentiates holders of product A from the rest of the world."

Labels: , ,

Friday, October 23, 2009

PAW conference, privacy issues, déjà vu

I attended the Predictive Analytics World conference this week and I thought it was very succesful. Although the conference was fairly small, I heard several interesting presentations and ran into several interesting attendees. In other words, I think the density of interesting people on both sides of the podium was higher than at some larger conferences.

One of the high points for me was a panel discussion on consumer privacy issues. Normally, I find panel discussions a waste of time, but in this case the panel members had clearly given a lot of thought to the issues and had some interesting things to say.  The panel consisted of Stephen Baker, a long-time Business Week  writer and author of  The Numerati, (a book I haven't read, but which, I gather, suggests that people like me are using our data mining prowess to rule the world); Jules Polonetsky, currently of the Future of Privacy Forum, and previously Chief Privacy Officer and SVP for Consumer Advocacy at AOL, Chief Privacy Officer and Special Counsel at DoubleClick, New York City Consumer Affairs Commissioner in the Giuliani administration; and Mikael Hagström, Executive Vice President, EMEA and Asia Pacific for SAS. I was particularly taken by Jules's idea that companies that use personal information to provide services that would not otherwise be possible should agree on a universal symbol for "smart" kind of like the easily recognizable symbol for recycling. Instead of (well, I guess it would have to be in addition to) a privacy policy that no one reads and is all about how little they know about you and how little use they will make of it, the smart symbol on a web site would be a brag about how well the service provider can leverage your profile to improve your experience. Clicking on it would lead you to the details of what they now know about you, how they plan to use it, and what's in it for you. You would also be offered an opportunity to fill in more blanks and make corrections. Of course, every "smart" site would also have a "dumb" version for users who don't choose to opt in.

This morning, as I was telling Gordon about all this in a phone call, we started discussing some of our own feelings about privacy issues, many of which revolve around the power relationship between us as individuals and the organization wishing to make use of information about us. If the supermarket wants to use my loyalty card data to print coupons for me, I really don't mind. If an insurance company wants to use that same loyalty card data to deny me insurance because I buy too much meat and alchohol, I mind a lot. As I gave that example, I had an overwhelming feeling of déjà  vu. Or perhaps it was déjà lu? In fact, it was déjà écrit! I had posted a blog entry on this topic ten years ago, almost to the day. Only there weren't any blogs back then so attention-seeking consultants wrote columns in magazines instead. This one, which appeared in the October 26, 1999 issue of Intelligent Enterprise, said what I was planning to write today pretty well.

Labels: ,

Friday, October 16, 2009

SVM with redundant cases

We received the following question from a reader:

I just discovered this blog -- it looks great. I apologize if this question has been asked before -- I tried searching without hits.

I'm just starting with SVMs and have a huge amount of data, mostly in the negative training set (2e8 negative examples, 2e7 positive examples), with relatively few features (eg less than 200). So far I've only tried linear SVM (liblinear) due to the size, with middling success, and want to under-sample at least the negative set to try kernels.

A very basic question. The bulk of the data is quite simple and completely redundant -- meaning many examples of identical feature sets overlapping both positive and negative classes. What differs is the frequency in each class. I think I should be able to remove these redundant samples and simply tell the cost function the frequency of each sample in each class. This would reduce my data by several orders of magnitude.

I have been checking publications on imbalanced data but I haven't found this simple issue addressed. Is there a common technique?

Thanks for any insight. Will start on your archives.
There are really two parts to the question. The first part is a general question about using frequencies to reduce the number of records. This is a fine approach. You can list each distinct record only once along with its frequency. The frequency counts how many times a particular pattern of feature values (including the class assigned to the target) appears. The second part involves the effect on the SVM algorithm of having many cases with identical features but different assigned classes. That sounded problematic to me, but since I am not an expert on support vector machines, I forwarded your question to someone who is--Lutz Hamel, author of Knowledge Discovery with Support Vector Machines.

Here is his reply:

I have some fundamental questions about the appropriateness of SVM for this classification problem:

Identical observation feature vectors produce different classification outcomes. If this is truly meaningful then we are asking the SVM to construct a decision plane through a point with some of the examples in this point classified as positive and some as negative. This is not possible. This means one of two things: (a) we have a sampling problem where different observations are mapped onto the same feature vectors. (b) we have a representation problem where the feature vector is not powerful enough to distinguish observations that should be distinguished.

It seems to me that this is not a problem of a simple unbalanced dataset but a problem of encoding and perhaps coming up with derived features that would make this a problem suitable for decision plane based classification algorithms such as SVMs. (is assigning the majority label to points that carry multiple observations an option?)
SVM tries to find a hyperplane that separates your classes. When, (as is very common with things such as marketing response data, or default, or fraud, or pretty much any data I ever work with), there are many training cases where identical values of the predictors lead to different outcomes, support vector machines are probably not the best choice. One alternative you could consider is decision trees. So long as there is a statistically significant difference in the distribution of the target classes, a decision tree can make splits. Any frequently occuring pattern of features will form a leaf and, taking the frquencies into account, the proportion of each class in the leaf provides estimates of the probabilities for each class given that pattern.

Labels: , ,

Monday, June 8, 2009

Confidence in Logistic Regression Coefficients

I work in the marketing team of a telecom company and I recently encountered an annoying problem with an upsell model. Since the monthly sale rate is less than 1% of our customer base, I used oversampling as you mentioned in your book ‘Mastering data mining’ with data over the last 3 sales months so that I had a ratio of about 15% buyers and 85% non-buyers (sample size of about 20K). Using alpha=5%, I got parameter estimates which were from a business perspective entirely explicable. However, when I then re-estimated the model on the total customer base to obtain the ‘true’ parameter estimates which I will use for my monthly scoring two effects were suddenly insignificant at alpha=5%.

I never encountered this and was wondering what to do with these effects: should I kick them out of the model or not ? I decided to keep them in since they did have some business meaning and concluded that they must have become insignificant since it is only a micro-segment in your entire population.
To your opinion, did I interpret this correctly ? . . .
Many thanks in advance for your advice,

Michael responds:

Hi Wendy,

This question has come up on the blog before. The short answer is that with a logistic regression model trained at one concentration of responders, it is a bit tricky to adjust the model to reflect the actual probability of response on the true population. I suggest you look at some papers by Gary King on this topic.

Gordon responds:

Wendy, I am not sure that Prof. King deals directly with your issue, of changing confidence in the coefficients estimates. To be honest, I have never considered this issue. Since you bring it up, though, I am not surprised that it may happen.

My first comment is that the results seem usable, since they are explainable. Sometimes statistical modeling stumbles on relationships in the data that make sense, although they may not be fully statistically significant. Similarly, some relationships may be statistically significant, but have no meaning in the real world. So, use the variables!

Second, if I do a regresson on a set of data, and then duplicate the data (to make it twice as big) and run it again, I'll get the same estimates as on the orignal data. However, the confidence in the coefficients will increase. I suspect that something similar is happening on your data.

If you want to fix that particular problem, then use a tool (such as SAS Enterprise Miner and probably proc logistic) that supports a frequency option on each row. Set the frequency to one for the more common events and to an appropriate value less than one for more common events. I do this as a matter of habit, because it works best for decision trees. You have pointed out that the confidence in the coefficients is also affected by the frequencies, so this is a good habit with regressions as well.

Labels: , , ,

Monday, April 13, 2009

Customer-Centric Forecasting White Paper Available

In our consulting practice, we work with many subscription-based businesses including newspapers, mobile phone companies, and software-as-a-service providers. All of these companies need to forecast future subscriber levels. With production support from SAS, I have recently written a white paper describing our approach to creating such forecasts. Very briefly, the central idea is that the subscriber population is a constantly changing mix of customer segments based on geography, acquisition channel, product mix, subscription type, payment type, demographic characteristics, and the like. Each of these segments has a different survival curve. Overall subscriber numbers come from aggregating planned additions and forecast losses at the segment level. Managers can simulate the effects of alternative acquisition strategies by changing assumptions about the characteristics of future subscribers and watching how the forecast changes. The paper is available on our web site. I will also be presenting a keynote talk on customer-centric forecasting on July 1st at the A2009 conference in Copenhagen.

Labels: ,

Wednesday, April 8, 2009

MapReduce, Hadoop, Everything Old Is New Again

One of the pleasures of aging is watching younger generations discover pleasures one remembers discovering long ago--sex, porcini, the Beatles. Occasionally though, it is frustrating to see old ideas rediscovered as new ones. I am especially prone to that sort of frustration when the new idea is one I once championed unsuccessfully. Recently, I've been feeling as though I was always a Beatles fan but until recently all my friends preferred Herman's Hermits. Of course, I'm glad to see them coming around to my point of view, but still . . .

What brings these feelings up is all the excitement around MapReduce. It's nice to see a parallel programming paradigm that separates the description of the mapping from the description of the function to be applied, but at the same time, it seens a bit underwhelming. You see, I literally grew up with the parallel programming language APL. In the late 60's and early 70's my father worked at IBM's Yorktown Heights research center in the group that developped APL and I learned to program in that language at the age of 12. In 1982 I went to Analogic Corporation to work on an array processor implementation of APL. In 1986, while still at Analogic, I read Danny Hillis's book The Connection Machine and realized that he had designed the real APL Machine. I decided I wanted to work at the company that was building Danny's machine. I was hired by Guy Steele, who was then in charge of the software group at Thinking Machines. In the interview, all we talked about was APL. The more I learned about the Connection Machine's SIMD architecture, the more perfect a fit it seemed for APL or an APL-like language in which hypercubes of data may be partitioned into subcubes of any rank so that arbitrary functions can be applied to them. In APL and its descendents such as J, reduction is just one of rich family of ways that the results of applying a function to various data partitions can be glued together to form a result. I described this approach to parallel programming in a paper published in ACM SIGPLAN Notices in 1990, but as far as I know, no one ever read it. (You can, though. It is available here.) My dream of implementing APL on the Connection Machine gradually faded in the face of commercial reality. The early Connection Machine customers, having already been forced to learn Lisp, were not exactly clamouring for another esoteric language; they wanted Fortran. And Fortran is what I ended up working on. As you can tell, I still have regrets. If we'd implemented a true parallel APL back then, no one would have to invent MapReduce today.

Labels: , ,

Saturday, November 1, 2008

Should model scores be rescaled?

Here’s a quick question for your blog;

- background -

I work in a small team of data miners for a telecommunications company. We usually do ‘typical’ customer churn and mobile (cell-phone) related analysis using call detail records (CDR’s)

We often use neural nets to create a decimal range score between zero and one (0.0 – 1.0), where zero equals no churn and maximum 1.0 equals highest likelihood of churn. Another dept then simply sorts an output table in descending order and runs the marketing campaigns using the first 5% (or whatever mailing size they want) of ranked customers.

- problem -

We have differing preferences in the distribution of our prediction score for churn. Churn occurs infrequently, lets say 2% (it is voluntary churn of good fare paying customers) per month. So 98% of customers have a score of 0.0 and 2% have a score of 1.0.

When I build my predictive model I try to mimic this distribution. My view that is most of the churn prediction scores would be skewed toward 0.1 or 0.2, say 95% of all predicted customers, and from 0.3 to 1.0 of the churn score would apply to maybe 5% of the customer base.

Some of my colleagues re-scale the prediction score so that there are an equal number of customers spread throughout.

- question -

What are your views/preferences on this?

I see no reason to rescale the scores. Of course, if the only use of the scores is to mail the top 5% of the list it makes no difference since the transformation preserves the ordering, but for other applications you want the score to be an estimate of the actual probability of cancellation.

In general, scores that represent the probability of an event are more useful than scores which only order a list in descending order by probability of the event. For example, in a campaign response model, you can multiply the probability that a particular prospect will respond by the value of that response to get an expected value of making the offer. If the expected value is greater than the cost, the offer should not be made. Gordon and I discuss this and related issues in our book Mastering Data Mining.

This issue often comes up when stratified sampling is used to create a balanced model set of 50% responders and 50% non-responders. For some modeling techniques--notably, decision trees--a balanced model set will produce more and better rules. However, the proportion of responders at each leaf is no longer an estimate of the actual probability of response. The solution is simple: simply apply the model to a test set that has the correct distribution of responders to get correct estimates of the response probability.


Labels: ,