Friday, October 23, 2009

Counting Users From Unique Cookies

Counting people/unique visitors/users at web sites is a challenge, and is something that I've been working on for the past couple of months for the web site of a large media company. The goal is to count the number of distinct users over the course of a month. Counting distinct cookies is easy; the challenge is turning these into human beings. These challenges include:

  • Cookie deletions. A user may manually delete their cookies one or more times during the month.

  • Disallowing first party cookies. A user may allow session cookies (while the browser is running), but not allow the cookies to be committed to disk.

  • Multiple browsers. A single user may use multiple browsers on the same machine during the month. This is particularly true when the user upgrades his or her browser.

  • Multiple machines. A single user may use multiple machines during the month.

And, I have to admit, that the data that I'm using has one more problem, which is probably not widespread. The cookies are actually hashed into four bytes. This means that it is theoretically possible for two "real" cookies to have the same hash value. Not only theoretically possible, but it happens (although not too frequently).

I came across a very good blog by Angie Brown that lays out the assumptions in making the calculation, including a spreadsheet for varying the assumptions. One particularly interesting factoid from the blog is that the number cookies that appear only once during the month exceeds the number of unique visitors, even under quite reasonable assumptions. Where I am working, one camp believes that the number of unique visitors is approximated by the number of unique cookies.


A white paper by ComCast states that the average user has 2.5 unique cookies per month due to cookie deletion. The paper is here, and a PR note about it is it is here. This paper is widely cited, although it has some serious methodological problems due to the fact that its data sources are limited to DoubleClick and Yahoo!.


In particular, Yahoo! is quite clear about its cookie expiration policies (two weeks for users clicking the "keep me logged in for 2 weeks" box and eight hours for Yahoo! mail). I do not believe that this policy has changed significantly in the last few years, although I am not 100% sure.


The white paper from ComCast does not mention these facts, which means that most of the cookies that a user has are due to automatic deletion, not user behavior. How many distinct cookies does a user have, due only to the user's behavior?

If I make the following assumptions:

  • The Yahoo! users have an average of 2.5 cookies per month.

  • ComCast used the main Yahoo! cookies, and not the Yahoo! mail cookies.

  • All Yahoo! users use the site consistently throughout the month.

  • All Yahoo! users have the "keep me logged in for 2 weeks" box checked.
Then I can estimate the number of cookies per user per machine per month. The average user would have 31/14 = 2.2 cookies per month, strictly due to the automatic deletion. This leaves 0.3 cookies per month due to manual deletion. Of course, the user starts with one cookie. So the average number of cookies per month per user per machine is 1.3.

By the way, I find this number much more reasonable. I also think that it misses the larger source of overcounting -- users who use more than one machine. Unfortunately, there is no single approach. In the case that I'm working on, we have the advantage that a minority of users are registered, so we can use them as a sample.




PAW conference, privacy issues, déjà vu

I attended the Predictive Analytics World conference this week and I thought it was very succesful. Although the conference was fairly small, I heard several interesting presentations and ran into several interesting attendees. In other words, I think the density of interesting people on both sides of the podium was higher than at some larger conferences.

One of the high points for me was a panel discussion on consumer privacy issues. Normally, I find panel discussions a waste of time, but in this case the panel members had clearly given a lot of thought to the issues and had some interesting things to say.  The panel consisted of Stephen Baker, a long-time Business Week  writer and author of  The Numerati, (a book I haven't read, but which, I gather, suggests that people like me are using our data mining prowess to rule the world); Jules Polonetsky, currently of the Future of Privacy Forum, and previously Chief Privacy Officer and SVP for Consumer Advocacy at AOL, Chief Privacy Officer and Special Counsel at DoubleClick, New York City Consumer Affairs Commissioner in the Giuliani administration; and Mikael Hagström, Executive Vice President, EMEA and Asia Pacific for SAS. I was particularly taken by Jules's idea that companies that use personal information to provide services that would not otherwise be possible should agree on a universal symbol for "smart" kind of like the easily recognizable symbol for recycling. Instead of (well, I guess it would have to be in addition to) a privacy policy that no one reads and is all about how little they know about you and how little use they will make of it, the smart symbol on a web site would be a brag about how well the service provider can leverage your profile to improve your experience. Clicking on it would lead you to the details of what they now know about you, how they plan to use it, and what's in it for you. You would also be offered an opportunity to fill in more blanks and make corrections. Of course, every "smart" site would also have a "dumb" version for users who don't choose to opt in.

This morning, as I was telling Gordon about all this in a phone call, we started discussing some of our own feelings about privacy issues, many of which revolve around the power relationship between us as individuals and the organization wishing to make use of information about us. If the supermarket wants to use my loyalty card data to print coupons for me, I really don't mind. If an insurance company wants to use that same loyalty card data to deny me insurance because I buy too much meat and alchohol, I mind a lot. As I gave that example, I had an overwhelming feeling of déjà  vu. Or perhaps it was déjà lu? In fact, it was déjà écrit! I had posted a blog entry on this topic ten years ago, almost to the day. Only there weren't any blogs back then so attention-seeking consultants wrote columns in magazines instead. This one, which appeared in the October 26, 1999 issue of Intelligent Enterprise, said what I was planning to write today pretty well.

Saturday, October 17, 2009

See you at a conference soon

It appears to be conference season. I'll be taking part in several over the next several weeks and would enjoy meeting any readers who will also be in attendance. These are all business-oriented conferences with a practical rather than academic focus.

First up is Predictive Analytics World in Alexandria, VA October 20-21.  I will be speaking on our experience using survival analysis to forecast future subscriber levels for a variety of subscription-based businesses.This will be my first time attending this conference, but the previous iteration in San Francisco got rave reviews.

Next on my calendar is M2009 in Las Vegas, NV October 26-27. This conference is sponsored by SAS, but it is a general data mining conference, not a SAS users group or anything like that. I've been attending since the first one in 1998 (I think) and have watched it grow into the best-attended of the annual data mining conferences. Gordon Linoff and I will be debuting the latest version of our three-day class Data Mining Techniques Theory and Practice as part of the post-conference training October 28-30.  Too bad about the location, but I guess conference space is cheap in Las Vegas.

I only get to spend a partial weekend at home before leaving for TDWI World in Orlando, FL. This is not, strictly speaking, a data mining conference (the DW stands for "data warehouse") but once people have all that data warehoused, they may want to use if for predictive modeling. I will be teaching a one-day class called Predictive Modeling for Non-Statisticians on Tuesday, November 3. That is election day where I live, so I better get my absantee ballot soon.

The following week, it's off to San Jose, CA for the Business Analytics Summit November 12-13.  I believe this is the first annual running of this conference. I'll be on a panel discussing data mining and business analytics. Perhaps I will discover what the phrase  "business analytics" means! That would come in handy since I will be teaching a class on Marketing Analytics at Boston College's Carroll School of Management next semester. My guess: it means the same as data mining, but some people don't like that word so they've come up with a new one.

Friday, October 16, 2009

SVM with redundant cases

We received the following question from a reader:

I just discovered this blog -- it looks great. I apologize if this question has been asked before -- I tried searching without hits.

I'm just starting with SVMs and have a huge amount of data, mostly in the negative training set (2e8 negative examples, 2e7 positive examples), with relatively few features (eg less than 200). So far I've only tried linear SVM (liblinear) due to the size, with middling success, and want to under-sample at least the negative set to try kernels.

A very basic question. The bulk of the data is quite simple and completely redundant -- meaning many examples of identical feature sets overlapping both positive and negative classes. What differs is the frequency in each class. I think I should be able to remove these redundant samples and simply tell the cost function the frequency of each sample in each class. This would reduce my data by several orders of magnitude.

I have been checking publications on imbalanced data but I haven't found this simple issue addressed. Is there a common technique?

Thanks for any insight. Will start on your archives.
There are really two parts to the question. The first part is a general question about using frequencies to reduce the number of records. This is a fine approach. You can list each distinct record only once along with its frequency. The frequency counts how many times a particular pattern of feature values (including the class assigned to the target) appears. The second part involves the effect on the SVM algorithm of having many cases with identical features but different assigned classes. That sounded problematic to me, but since I am not an expert on support vector machines, I forwarded your question to someone who is--Lutz Hamel, author of Knowledge Discovery with Support Vector Machines.

Here is his reply:

I have some fundamental questions about the appropriateness of SVM for this classification problem:

Identical observation feature vectors produce different classification outcomes. If this is truly meaningful then we are asking the SVM to construct a decision plane through a point with some of the examples in this point classified as positive and some as negative. This is not possible. This means one of two things: (a) we have a sampling problem where different observations are mapped onto the same feature vectors. (b) we have a representation problem where the feature vector is not powerful enough to distinguish observations that should be distinguished.

It seems to me that this is not a problem of a simple unbalanced dataset but a problem of encoding and perhaps coming up with derived features that would make this a problem suitable for decision plane based classification algorithms such as SVMs. (is assigning the majority label to points that carry multiple observations an option?)
SVM tries to find a hyperplane that separates your classes. When, (as is very common with things such as marketing response data, or default, or fraud, or pretty much any data I ever work with), there are many training cases where identical values of the predictors lead to different outcomes, support vector machines are probably not the best choice. One alternative you could consider is decision trees. So long as there is a statistically significant difference in the distribution of the target classes, a decision tree can make splits. Any frequently occuring pattern of features will form a leaf and, taking the frquencies into account, the proportion of each class in the leaf provides estimates of the probabilities for each class given that pattern.

Monday, September 21, 2009

Data Mining and Statistics

We recently received the following question from a reader. (Hi Brad!)

What is the difference between Data Mining and statistics and should I care?
The way I think about it, data mining is the process of using data to figure stuff out. Statistics is a collection of tools used for understanding data. We explicitly use statistical tools all the time to answer questions such as "is the observed change in conversion rate (or response, or order size, . . .) significant or might it be just due to chance?" We also use statistics implicitly when, for example, a chi-square test inside a decision tree algorithm decides which of several candidate splits will make it into a model. When I make a histogram showing number of orders by order size bin, I am really exploring a distribution although I may not choose to describe it that way. So, data miners use statistics and much of what statisticians do might be called data mining.

There is, however, a cultural difference between people who call themselves statisticians and people who call themselves data miners. This difference has its origins in different expectations about data size. Statistics grew up in an era of small data and many statisticians still live in that world. There are strong practical and budgetary limits to how many patients you can recruit for a clinical trial, for instance. Statisticians have to extract every last drop of information from their small data sets and so they have developed a lot of clever tools for doing that. Data Miners tend to live in a big data world. With big data, we can often replace cleverness with more data. Gordon's most recent post on oversampling is an example. If you have sufficient data that you can throw away most of the common cases and still have enough data to work with, that really is easier than keeping track of lots of weights. Similarly, with enough data, it is much easier (and more accurate) to estimate the probability that a subscriber will cancel with tenure of 100 days by counting the many people who do quit and dividing by the even larger number of people who could have quit but didn't, than to make some assumptions about the shape of the hazard function.

Tuesday, September 15, 2009

Adjusting for Oversampling

We recently received two similar questions about oversampling . . .

If you don´t mind, I would like to ask you a Question regarding Oversampling as you wrote in your book (Mastering Data Mining...).

I can understand how you calculate predictive lift when using oversampling, though don´t know how to do it for the confusion matrix.

Would you mind telling me how do I compute then the confusion matrix for the actual population (not the oversampled set)?

Thanks in advance for your reply and help.

Best,
Diego


Gentlemen-

I have severely unbalanced training data (180K negative cases, 430 positive cases). Yeah...very unbalanced.

I fit a model in a software program that allows instance weights (weka). I give all the positive cases a weight of 1 and all the negative cases a weight of 0.0024. I fit a model (not a decision tree so running the data through a test set is not an option to recalibrate) - like a neural network. I output the probabilities and they are out of whack - good for predicting the class or ranking but not for comparing predicted probability against actual.

What can we do to fit a model like this but then output probabilities that are in line with the distribution? Is this new (wrong) probabilities just the price we have to pay for instance weights to (1) get a model to build (2) get reasonably good classification? Can I have my cake and eat it too (classification and probs that are close to actual)?

Many many thanks!
Brian


The problem in these cases is the same. The goal is to predict a class, usually a binary class, where one outcome is rarer than the other. To generate the best model, some method of oversampling is used so the model set has equal numbers of the two outcomes. There are two common ways of doing this. Diego is probably using all the rare outcomes and an equal-sized random sample of the common outcomes. This is most useful when there are a large number of cases, and reducing the number of rows makes the modeling tools run faster. Brian is using a method where weights are used for the same purpose. Rare cases are given a weight of 1 and common cases are given a weight less than 1, so that the sum of the weights of the two groups is equal.

Regardless of the technique (neural network, decision trees, logistic regression, neearest neighbor, and so on), the resulting probabilities are "directionally" correct. A group of rows with a larger probability are more likey to have the modeled outcome than a group with a lower probability. This is useful for some purposes, such as getting the top 10% with the highest scores. It is not useful for other purposes, where the actual probability is needed.

Some tools can back into the desired probabilities, and do correct calculations for lift and for the confusion matrix. I think SAS Enterprise Miner, for instance, uses prior probabilties for this purpose. I say "think" because I do not actually use this feature. When I need to do this calculation, I do it manually, because not all tools support it. And, even if they do, why bother learning how. I can easily do the necessary calculations in Excel.

The key idea here is simply counting. Assume that we start with data that is 10% rare and 90% common, and we oversample so it is 50%-50%. The relationship between the original data and the model set is:
  • rare outcomes: 10% --> 50%
  • common outcomes: 90% --> 50%
To put it differently, each rare outcome in the original data is worth 5 in the model set. Each common outcome is worth 5/9 in the model set. We can call these numbers the oversampling rates for each of the outcomes.

We now apply these mappings to the results. Let's answer Brian's question for a particular situation. Say we have the above data and a result has a modeled probability of 80%. What is the actual probability?

Well, 25% means that there is 0.25 rare outcomes for 0.75 common ones. Let's undo the mapping above:
  • 0.80 / 5 = 0.16
  • 0.20 / (5/9) = 0.36
So, the expected probability on the original data is 0.16/(0.16+0.36) = 30.8%. Notice that the probability has decreased, but it is still larger than the 10% in the original data. Also notice that the lift on the model set is 80%/50% = 1.6. The lift on the original data is 3.08 (30.8% / 10%). The expected probability goes down, and the lift goes up.

This calculation can also be used for the cross-correlation matrix (or confusion matrix). In this case, you just have to divide each cell by the appropriate overampling rate. So, if the confusion matrix said:
  • 10 rows in the model set are rare and classified as rare
  • 5 rows in the model set are rare and classified as common
  • 3 rows in the model set are common and classified as rare
  • 12 rows in the model set are common and classified as common
(I apologize for not including a table, but that is more trouble than it is worth in the blog.)

In the original data, this means:
  • 2=10/5 rows in the original data are rare and classified as rare
  • 1=5/5 rows in the original data are rare and classified as common
  • 5.4 = 3/(5/9) rows inthe original data are common and classified as rare
  • 21.6 = 12/(5/9) rows in the original data are common and classified as common
These calculations are quite simple, and it is easy to set up a spreadsheet to do them.

I should also mention that this method readily works for any number of classes. Having two classes is simply the most common case.

Thursday, September 10, 2009

TDWI Question: Consolidating SAS and SPSS Groups

Yesterday, I had the pleasure of being on a panel for a local TDWI event here in New York focused on advanced analytics (thank you Jon Deutsch). Mark Madsen of Third Nature gave an interesting, if rapid-fire, overview of data mining technologies. Of course, I was excited to see that Mark included Data Analysis Using SQL and Excel as one of the first steps in getting started in data mining -- even before meeting me. Besides myself, the panel included my dear friend Anne Milley from SAS, Ali Pasha from Teradata, and a gentleman from Information Builders whose name I missed.

I found one of the questions from the audience to be quite interesting. The person was from the IT department of a large media corporation. He has two analysis groups, one in Los Angeles that uses SPSS and the other in New York that uses SAS. His goal, of course, is to reduce costs. He prefers to have one vendor. And, undoubtedly, the groups are looking for servers to run their software.

This is a typical IT-type question, particularly in these days of reduced budgets. I am more used to encountering such problems in the charged atmosphere of a client. The more relaxed atmosphere of a TDWI meeting perhaps gives a different perspective.

The groups are doing the same thing from the perspective of an IT director. Diving in a bit futher, the two groups do very different things -- at least from my perspective. Of course, both are using software running on computers to analyze data. The group in Los Angeles is using SPSS to analyze survey data. The group in New York is doing modeling using SAS. I should mention that I don't know anyone in the groups, and only have the cursory information provided at the TDWI conference.

Conflict Alert! Neither group wants to change and both are going to put up a big fight. SPSS has a stronghold in the market for analyzing survey data, with specialized routines and procedures to handle this data. (SAS probably has equivalent functionality, but many people who analyze survey data gravitate to SPSS.) Similarly, the SAS programmers in New York are not going to take kindly to switching to SPSS, even if offers the same functionality.

Each group has the skills and software that they need. Each group has legacy code and methods, that are likely tied to their tools. The company in question is not a 20-person start-up. It is a multinational corporation. Although the IT department might see standarizing a tool as beneficial, in actual fact, the two groups are doing different things and the costs of switching are quite high -- and might involve losing skilled people.

This issue brings up the question of what do we want to standardize on. The value of advanced analytics comes in two forms. The first is the creative process of identifying new and interesting phenomena. The second is the communication process of spreading the information where it is needed.

Although people may not think of nerds as being creative, really, we are. It is important to realize that imposing standards or limiting resources may limit creativity, and hence the quality of the results. This does not mean that cost control is unnecessary. Instead, it means that there are intangible costs that may not show up in a standard cost-benefit analysis.

On the other hand, communicating results through an organization is an area where standards are quite useful. Sometimes the results might be captured as a simple email going to the right person. Other times, the communication must go to a broader audience. Whether byy setting up an internal Wiki, updating model scores in a database, or loading a BI tool, having standards is important in this case. Many people are going to be involved, and these people should not have to learn special tools for one-off analyses -- so, if you have standardized on a BI tool, make the resources available to put in new results. And, from the perspective of the analysts, having standard methods of communicating results simplifies the process of transforming smart analyses into business value.