<?xml version='1.0' encoding='UTF-8'?><rss xmlns:atom='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' version='2.0'><channel><atom:id>tag:blogger.com,1999:blog-3366935554564939610</atom:id><lastBuildDate>Thu, 03 Jul 2008 13:17:12 +0000</lastBuildDate><title>Data Miners Blog</title><description/><link>http://www.data-miners.com/blog/</link><managingEditor>noreply@blogger.com (Michael J. A. Berry)</managingEditor><generator>Blogger</generator><openSearch:totalResults>29</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-7368240624700360998</guid><pubDate>Fri, 06 Jun 2008 01:00:00 +0000</pubDate><atom:updated>2008-06-05T22:11:49.997-04:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>gordon</category><category domain='http://www.blogger.com/atom/ns#'>Ask a data miner</category><category domain='http://www.blogger.com/atom/ns#'>user question</category><title>Qualifications for Studying Data Mining</title><description>&lt;em&gt;A recent question . . . &lt;/em&gt;&lt;br /&gt;&lt;em&gt;&lt;/em&gt;&lt;br /&gt;&lt;em&gt;&lt;span style="color:#000099;"&gt;I am hoping to begin my masters degree in Data Mining. I have come from a Software Development primary degree. I am a bit worried over the math involved in Data Mining.Could you tell me, do I need to have a strong mathematical aptitude to produce a good Thesis on Data Mining?&lt;/span&gt;&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;First, I think a software development background is a good foundation for data mining.  Data mining is as much about data (and hence computers and databases) as it is about analysis (and hence statistics, probability, and math).&lt;br /&gt;&lt;br /&gt;Michael and I are not academics so we cannot speak to the thesis requirements for a particular data mining program.  Both of us majored in mathematics (many years ago) and then worked as software engineers.  We do have some knowledge of both fields, and the combination provided a good foundation for our data mining work.&lt;br /&gt;&lt;br /&gt;To be successful in data mining, you do need some familiarity with math, particularly applied math -- things like practical applications of probability, algebra, the ability to solve word problems, and the ability to use spreadsheets.  Unlike theoretical statistics, the purpose of data mining is not to generate rigorous proofs of various theorems; the purpose is to find useful patterns in data, to validate hypotheses, to set up marketing tests.  We need to know when patterns are unexpected, and when patterns are expected.&lt;br /&gt;&lt;br /&gt;This is a good place to add a plug for my book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;&lt;em&gt;Data Analysis Using SQL and Excel&lt;/em&gt;&lt;/a&gt;, which has two or three chapters devoted to practical statistics in the context of data analysis.&lt;br /&gt;&lt;br /&gt;In short, if you are math-phobic, then you might want to reconsider data mining.  If your challenges in math are solving complex integrals, then you don't have much to worry about.&lt;br /&gt;&lt;br /&gt;--gordon</description><link>http://www.data-miners.com/blog/2008/06/qualifications-for-studying-data-mining.html</link><author>noreply@blogger.com (Gordon S. Linoff)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-2273449026313693722</guid><pubDate>Sat, 17 May 2008 14:20:00 +0000</pubDate><atom:updated>2008-05-17T18:35:04.153-04:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>statistics</category><category domain='http://www.blogger.com/atom/ns#'>gordon</category><title>The Agent Problem:  Sampling From A Finite Population</title><description>&lt;span style="font-style: italic;"&gt;A drawer is filled with socks and you remove eight of them randomly.  Four are black and four are white.  How confident are you in estimating the proportion of white and black socks in the drawer?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The standard statistical approach is to assume that the number of socks in the drawer is infinite, and to use the formula for the standard error of a proportion:  SQRT([proportion] * [(1 - [proportion])/[number taken out]) or, more simply, SQRT(p*q/n).  In this case, the standard error is SQRT(0.5*0.5/8) = 17.7%&lt;br /&gt;&lt;br /&gt;However, this approach clearly does not work in all cases.  For instance, if there are exactly eight socks in the drawer, then the sample consists of all of them.  We are 100% sure that the proportion is exactly 50%.&lt;br /&gt;&lt;br /&gt;If there are ten socks in the drawer, then the proportion of black socks ranges from 4/10 to 6/10.  These extremes are within one standard error of the observed average.  Or to phrase it differently, any reasonable confidence interval (80%, 90%, 95%) contains all possible values.  The confidence interval is wider than what is possible.&lt;br /&gt;&lt;br /&gt;What does this have to do with business problems?  I encountered essentially the same situation when looking at the longitudinal behavior of patients visiting physicians.  I had a sample of patients who had visited the physicians and was measuring the use of a particular therapy for a particular diagnosis.  Overall, about 20-30% of all patients where in the longitudinal data.  And, I had pretty good estimates of the number of diagnoses for each physician.&lt;br /&gt;&lt;br /&gt;There are several reasons why this is important.  For the company that provides the therapy, knowing which physicians are using it is important.  In addition, if the company does any marketing efforts, they would like to see how they perform.  So, the critical question is:  how well does the observed patient data characterize the physician behavior.&lt;br /&gt;&lt;br /&gt;This is very similar to the question posed earlier.  If the patient data contains eight new diagnoises and four start on the therapy of interest, how confident am I that the doctor is starting 50% of new patients on the therapy?&lt;br /&gt;&lt;br /&gt;If there are eight patients in total, then I am 100% confident, since all of them managed to be in my sample.  On the other hand, if the physician has 200 patients, then the statistical measures of standard error are more appropriate.&lt;br /&gt;&lt;br /&gt;The situation is exacerbated by another problem.  Although the longitudinal data contains 20%-30% of all patients, the distribution over the physicians is much wider.  Some physicians have 10% of their patients in the data and some have 50% or more.&lt;br /&gt;&lt;br /&gt;The solution is actually quite simple, but not normally taught in early statistics or business statistics courses.  There is something called the &lt;span style="font-style: italic;"&gt;finite population correction&lt;/span&gt; for exactly this situation.&lt;br /&gt;&lt;code&gt;&lt;br /&gt;[stderr-finite] = [stderr-infinite]*fpc&lt;br /&gt;fpc = SQRT(([population size]- [sample size])/([population size] - 1))&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;So, we simply adjust the standard error and continue with whatever analysis we are using.&lt;br /&gt;&lt;br /&gt;There is one caveat to this approach.  When observed proportion is 0% or 100%, then the standard error will always be 0, even with the correction.  In this case, we need to have a better estimate.  In practice, I add or subtract 0.5 from the proportion to calculate the standard error.&lt;br /&gt;&lt;br /&gt;This problem is definitely not limited to physicians and medical therapies.  I think it becomes an issue in many circumstances where we want to project a global number onto smaller entities.&lt;br /&gt;&lt;br /&gt;So, an insurance company may investigate cases for fraud.  Overall, they have a large number of cases, but only 5%-10% are in the investigation.  If they want to use this information to understand fraud at the agent level, then some agents will have 1% investigated and some 20%.  For many of these agents, the correction factor is needed to understand our confidence in their customers' behavior.&lt;br /&gt;&lt;br /&gt;The problem occurs because the assumption of an infinite population is reasonable over everyone.  However, when we break it into smaller groups (physicians or agents), then the assumption may no longer be valid.</description><link>http://www.data-miners.com/blog/2008/05/agent-problem-sampling-from-finite.html</link><author>noreply@blogger.com (Gordon S. Linoff)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-7003932318442616115</guid><pubDate>Sat, 03 May 2008 20:09:00 +0000</pubDate><atom:updated>2008-05-04T16:40:28.278-04:00</atom:updated><title>Adjusting for oversampling</title><description>A few days ago, a reader of this blog used the "ask a data miner" link on the right to mail us this question. (Or, these questions.)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:Times New Roman;font-size:100%;color:blue;"   &gt;&lt;span style="font-weight: bold;color:blue;" &gt;Question&lt;/span&gt;&lt;/span&gt;&lt;span style=";font-size:100%;color:blue;"  &gt;&lt;span style="color:blue;"&gt;:&lt;/span&gt;&lt;/span&gt;&lt;p&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Times New Roman;font-size:100%;color:blue;"   &gt;&lt;span style="color:blue;"&gt;When modeling rare events in marketing, it has been suggested by many to take a sample stratified by the dependent variable(s) in order to allow the modeling technique a better chance of detecting a difference (or differences in the case of k-level targets). The guidance for the proportion of the event in the sample seems to range between 15-50% for a binary outcome (no guidance have I seen for a k-level target). I am confused by this oversampling and have a couple questions I am hoping you can help with:&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Times New Roman;font-size:100%;color:blue;"   &gt;&lt;span style="color:blue;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;ol style="margin-top: 0in;" start="1" type="1"&gt;&lt;li  style="color:blue;"&gt;&lt;span style=";font-family:Times New Roman;font-size:100%;color:blue;"   &gt;Is there      a formula for adjusting the predicted probability of the event (s) when      there has been oversampling on the dependent?&lt;/span&gt;&lt;/li&gt;&lt;li  style="color:blue;"&gt;&lt;span style=";font-family:Times New Roman;font-size:100%;color:blue;"   &gt;Is this      correction only needed when ascertaining the lift of the model or      comparing to other models which were trained on a dataset with the same      oversampling proportion OR would you need the adjustment to be done to the      predicted value when you score a new dataset – such as when you      train a model on a previous campaign and use the model to score the      candidates for the upcoming campaign? &lt;/span&gt;&lt;/li&gt;&lt;li  style="color:blue;"&gt;&lt;span style=";font-family:Times New Roman;font-size:100%;color:blue;"   &gt;I use      logistic regression and decision trees for classification of categorical dependent      variables – does the adjustment from (question 1) apply to both of      these? Does the answer to (question 2) also apply to both of these techniques?&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;Especially with decision tree models, we suggest using stratified sampling to construct a model set with approximately equal numbers of each of the outcome classes. In the most common case, there are two classes and one, the one of interest, is much rarer than the other. People rarely respond to direct mail; loan recipients rarely default; health care providers rarely commit fraud. The difficulty with rare classes is that decision tree algorithms keep splitting the model set into smaller and smaller groups of records that are purer and purer. When one class is very rare, the data passes the purity test before any splits are made. The resulting model always predicts the common outcome. If only 1% of claims are fraudulent, a model that says no claims are fraudulent will be correct 99% of the time! It will also be useless. Creating a balanced model set where half the cases are fraud forces the algorithm to work harder to differentiate between the two classes.&lt;br /&gt;&lt;br /&gt;To answer the first question, yes, there is a formula for adjusting the predicted probability produced on the oversampled data to get the predicted probability on data with the true distribution of classes in the population. Suppose there is 1% fraud in the population and 50% fraud in the model set. Each example of fraud in the model set represents a single actual case of fraud in the population while each non-fraud case in the model set represents 99 cases of fraud in the population. We say the &lt;span style="font-style: italic;"&gt;oversampling rate &lt;/span&gt;is 99. So, if a certain leaf in a decision tree built on the balanced data has 95 fraudulent cases and 5 non-fraudulent cases, the actual probability of fraud predicted by that leaf is 95/(95+5*99) or about 0.16 because each of the 5 non-fraudulent cases represents 99 such cases. We discuss this at length in Chapter 7 of our book, &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0471331236/thedataminers"&gt;Mastering Data Mining&lt;/a&gt;. You can also arrive at this result by applying the model to the original, non-oversampled data and simply counting the number of records of each class found at each leaf. This is sometimes called &lt;span style="font-style: italic;"&gt;backfitting&lt;/span&gt; the model.&lt;br /&gt;&lt;br /&gt;To answer the second question, this calculation is only necessary if you are actually trying to estimate the probability of the classes. If all you want to do is generate scores that can be used to rank order a list or compare lift for several models all built from the oversampled data, there is no need to correct for oversampling because the order or results will not change.&lt;br /&gt;&lt;br /&gt;Using oversampled data also changes the results of logistic regression models, but in a more complicated way. As it happens, this is a particular interest of Professor Gary King, who taught the only actual class in statistics that I have ever taken. He has written &lt;a href="http://gking.harvard.edu/projects/rareevents.shtml"&gt;several papers&lt;/a&gt; on the subject.</description><link>http://www.data-miners.com/blog/2008/05/adjusting-for-oversampling.html</link><author>noreply@blogger.com (Michael J. A. Berry)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-8769636272261992551</guid><pubDate>Thu, 01 May 2008 23:39:00 +0000</pubDate><atom:updated>2008-05-03T15:56:29.604-04:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>statistics</category><category domain='http://www.blogger.com/atom/ns#'>Ask a data miner</category><category domain='http://www.blogger.com/atom/ns#'>user question</category><category domain='http://www.blogger.com/atom/ns#'>marketing</category><title>Statistical Test for Measuring ROI on Direct Mail Test</title><description>&lt;em&gt;If I want to test the effect of return of investment on a mail/ no mail sample, however, I cannot use a parametric test since the distribution of dollar amounts do not follow a normal distribution. What non-parametric test could I use that would give me something similar to a hypothesis test of two samples?&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;Recently, we received an email with the question above. Since it was addressed to &lt;a href="mailto:bloggers@data-miners.com"&gt;bloggers@data-miners.com&lt;/a&gt;, it seems quite reasonable to answer it here.&lt;br /&gt;&lt;br /&gt;First, I need to note that Michael and I are not statisticians. We don't even play one on TV (hmm, that's an interesting idea). However, we have gleaned some knowledge of statistics over the years, much from friends and colleagues who are respected statisticians.&lt;br /&gt;&lt;br /&gt;Second, the question I am going to answer is the following: Assume that we do a test, with a test group and a control group. What we want to measure is whether the average dollars per customer is significantly different for the test group as compared to the control group. The challenge is that the dollar amounts themselve do not follow a known distribution, or the distribution is known not to be a normal distribution. For instance, we might only have two products, one that costs $10 and one that costs $100.&lt;br /&gt;&lt;br /&gt;The reason that I'm restating the problem is because a term such as ROI (return on investment) gets thrown around a lot. In some cases, it could mean the current value of discounted future cash flows. Here, though, I think it simply means the dollar amount that customers spend (or invest, or donate, or whatever depending on the particular business).&lt;br /&gt;&lt;br /&gt;The overall approach is that we want to measure the average and standard error for each of the groups. Then, we'll apply a simple "standard error" of the difference to see if the difference is consistently positive or negative. This is a very typical use of a z-score. And, it is a topic that I discuss in more detail in Chapter 3 of my book "&lt;a href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;". In fact, the example here is slightly modified from the example in the book.&lt;br /&gt;&lt;br /&gt;A good place to start is the Central Limit Theorem. This is a fundamental theorem for statistics. Assume that I have a population of things -- such as customers who are going to spend money in response to a marketing campaign. Assume that I take a sample of these customers and measure an average over the sample. Well, as I take more an more samples, the distribution of the averages follows a normal distribution &lt;em&gt;regardless of the original distribution of values&lt;/em&gt;. (This is a slight oversimplification of the Central Limit Theorem, but it captures the important ideas.)&lt;br /&gt;&lt;br /&gt;In addition, I can measure the relationship between the characteristics of the overall population and the characteristics of the sample:&lt;br /&gt;&lt;br /&gt;(1) The average of the sample is as good an approximation as any of the average of the overall population.&lt;br /&gt;&lt;br /&gt;(2) The standard error on the average of the sample is the standard deviation of the overall population divided by the square root of the size of the sample. Alternatively, we can phrase this in terms of variance: the variance of the sample average is the variance of the population average divided by the size of the sample.&lt;br /&gt;&lt;br /&gt;Well, we are close. We know the average of each sample, because we can measure the average. If we knew the standard deviation of the overall population, then we could get the standard error for each group. Then, we'd know the standard error and we would be done. Well, it turns out that:&lt;br /&gt;&lt;br /&gt;(3) The standard deviation of the sample is as good an approximation as any for the standard deviation of the population. This is convenient!&lt;br /&gt;&lt;br /&gt;Let's assume that we have the following scenario.&lt;br /&gt;&lt;br /&gt;Our test group has 17,839 customers, and the overall average purchase is $85.48. The control group has 53,537 customers, and the average purchase is $70.14. Is this statistically different?&lt;br /&gt;&lt;br /&gt;We need some additional information, namely the standard deviation for each group. For the test group, the standard deviation is $197.23. For the control group, it is $196.67.&lt;br /&gt;&lt;br /&gt;The standard error for the two groups is then $197.23/sqrt(17,839) and $196.67/sqrt(53,537), which comes to $1.48 and $0.85, respectively.&lt;br /&gt;&lt;br /&gt;So, now the question is: is the difference of the means ($85.48 - $70.14 = $15.34) significantly different from zero. We need another formula from statistics to calculate the standard error of the difference. This formula says that the standard error is the square root of the sums of the squares of standard errors. So the value is $1.71 = sqrt(0.85^2 + 1.48^2).&lt;br /&gt;&lt;br /&gt;And we have arrived at a place where we can use the z-score. The difference of $15.34 is about 9 standard deviations from 0 (that is, 9*1.71 is about 15.34). It is highly, highly, highly unlikely that the difference includes 0, so we can say that the test group is significantly better than the control group.&lt;br /&gt;&lt;br /&gt;In short, we can apply the concepts of normal distributions, even to calculations on dollar amounts. We do need to be careful and pay attention to what we are doing, but the Central Limit Theorem makes this possible. If you are interested in this subject, I do strongly recommend &lt;em&gt;Data Analysis Using SQL and Excel&lt;/em&gt;, particularly Chapter 3.&lt;br /&gt;&lt;br /&gt;--gordon</description><link>http://www.data-miners.com/blog/2008/05/statistical-test-for-measuring-roi-on.html</link><author>noreply@blogger.com (Gordon S. Linoff)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-3869525524498422861</guid><pubDate>Mon, 21 Apr 2008 19:06:00 +0000</pubDate><atom:updated>2008-04-21T16:05:50.872-04:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>gordon</category><category domain='http://www.blogger.com/atom/ns#'>SAS Code</category><title>Using SET with Unique to Join Tables in SAS Data Steps</title><description>Recently, I have had to write a bunch of SAS code for one of our clients. Although I strive to do as much as possible using &lt;span style="font-family:courier new;"&gt;proc sql&lt;/span&gt;, there are some things that just require a &lt;span style="font-family:courier new;"&gt;data&lt;/span&gt; step. Alas.&lt;br /&gt;&lt;br /&gt;When using the &lt;span style="font-family:courier new;"&gt;data&lt;/span&gt; step, I wish I were able call a query directly:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;data whereever;&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;....&lt;/span&gt;set (SELECT beautiful things using SQL syntax);&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;....&lt;/span&gt;and so on with the SAS code&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;However, this is not possible.&lt;br /&gt;&lt;br /&gt;A SAS programmer might point out that there are two easy work-arounds. First, you can simply call the query and save it as a SAS data set. Alternatively, you can define a view and access the view from the data step.&lt;br /&gt;&lt;br /&gt;I do not like either of these solutions. One reason why I like SQL is that I can combine many different parts of a solution into a single SQL statement -- my SQL queries usually have lots of subqueries. Another reason I like SQL is it reduces the need for clutter -- intermediate files/tables/data sets -- which need to be named and tracked and managed and eventually deleted. I ran out of clever names for such things about fifteen years ago and much prefer having the database do the dirty work of tracking such things. Perhaps this is why I wrote &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;a book on using SQL for data analysis&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;So, I give up on the SQL syntax, but I still want to be able to do similar processing. The data step does make it possible to do joins, using a syntax that is almost intuitive (at least for data step code). The advertised syntax looks like:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;proc sql;&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;....&lt;/span&gt;create index lookupkey on lookup;&lt;br /&gt;&lt;br /&gt;data whereever;&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;....&lt;/span&gt;set master;&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;....&lt;/span&gt;set lookup (keep=lookupkey lookupvalue) key=lookupkey;&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;....&lt;/span&gt;and so on with the SAS code&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;This example is highly misleading! (So look below for a better version.) But, before explaining the problems and the solution, let me explain how the code works.&lt;br /&gt;&lt;br /&gt;The first statement is a &lt;span style="font-family:courier new;"&gt;proc sql&lt;/span&gt; statement that builds an index on the lookup data set using the lookup key column. Real SAS programmers might prefer &lt;span style="font-family:courier new;"&gt;proc datasets&lt;/span&gt;, but I'm not a real SAS programmer. They do the same thing.&lt;br /&gt;&lt;br /&gt;The second statement is the data step. The key part of the data step is the second &lt;span style="font-family:courier new;"&gt;set&lt;/span&gt; statement which uses the &lt;span style="font-family:courier new;"&gt;key=&lt;/span&gt; keyword. This keyword says to look up the corresponding value in another data set and fetch the first row where the values match. The "key" itself is an index, which is why I created the index first.&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-family:courier new;"&gt;keep&lt;/span&gt; part of the statement is just for efficiency's sake. This says to only keep the two variables that I want, the lookup key (which is needed for the index) and the lookup value. There may be another two hundred columns in the lookup table (er, data set), but these are the only ones that I want.&lt;br /&gt;&lt;br /&gt;This basic example is quite deceptive. Indexes in SAS are a lot like indexes in databases. They are both called indexes and both give fast access to rows in a table, based on values in one or more columns. Both can be created in SQL.&lt;br /&gt;&lt;br /&gt;However, they are not the same. The above syntax does work under some circumstances, such as when all the lookup keys are in the lookup table and when no two rows in a row in the master table have the same key (or some strange condition like that). Most importantly, I've found that the syntax seems to work on small test data but not on larger sets. This is a most nefarious type of difference. And, there are no warnings or errors.&lt;br /&gt;&lt;br /&gt;The problem is that SAS indexes allow duplicates but treat indexes with duplicate keys differently from indexes with unique keys. Even worse, SAS determines this by how the index is created, not by the context. And for me (the database guy) the most frustrating thing is that the default is for the strange behavior instead of the nice clean behavior I'm expecting. I freely admit a bias here.&lt;br /&gt;&lt;br /&gt;So we have to explicitly say that the index has no duplicates. In addition, SAS does not have reasonable behavior when there is no match. "Reasonable" behavior would be to set the lookup value to missing and to continue dutifully processing data. Instead, SAS generates an error and puts garbage in the lookup value.&lt;br /&gt;&lt;code&gt;&lt;br /&gt;proc sql;&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;....&lt;/span&gt;create &lt;span style="color:#993399;"&gt;unique&lt;/span&gt; index lookupkey on lookup;&lt;br /&gt;&lt;br /&gt;data whereever;&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;....&lt;/span&gt;set master;&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;....&lt;/span&gt;set lookup (keep=lookupkey lookupvalue) key=lookupkey&lt;span style="color:#993399;"&gt;/unique&lt;/span&gt;;&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;....&lt;/span&gt;&lt;span style="color:#993399;"&gt;if (_iorc_ = %sysrc(_dsenom)) then do;&lt;/span&gt;&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;........&lt;/span&gt;&lt;span style="color:#993399;"&gt;_ERROR_ = 0;&lt;/span&gt;&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;........&lt;/span&gt;&lt;span style="color:#993399;"&gt;lookupvalue = .;&lt;/span&gt;&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;....&lt;/span&gt;&lt;span style="color:#993399;"&gt;end;&lt;/span&gt;&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;....&lt;/span&gt;and so on with the SAS code&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;The important change the presence of the &lt;span style="font-family:courier new;"&gt;unique&lt;/span&gt; signifier in both the &lt;span style="font-family:courier new;"&gt;create index&lt;/span&gt; statement and in the &lt;span style="font-family:courier new;"&gt;set&lt;/span&gt; statement. I have found that having it in one place is not sufficient, even when the index actually has no duplicates.&lt;br /&gt;&lt;br /&gt;The error handling also tro ubles me. Strange functions called "iorc" are bad enough, even without being preceding by an underscore.  Accessing global symbols such as _ERROR should be a sign that something extraordinary is going on. But nothing unusual is happening; the code is just taking into account the fact that the key is not in the lookup table.&lt;br /&gt;&lt;br /&gt;In the end, I can use the data step to mimic SQL joins, including left outer joins (by taking into account, by using appropriate indexes and keys.  Although I don't particularly like the syntax, I do find this capability very, very useful.  The data step I referred to at the beginning of this post has eleven such lookups, and many of the lookup tables have hundreds of thousands or millions of rows.&lt;br /&gt;&lt;br /&gt;--gordon</description><link>http://www.data-miners.com/blog/2008/04/using-set-with-unique-to-join-tables-in.html</link><author>noreply@blogger.com (Gordon S. Linoff)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-6007766890486144802</guid><pubDate>Sat, 12 Apr 2008 18:41:00 +0000</pubDate><atom:updated>2008-04-12T19:14:11.684-04:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>Ask a data miner</category><title>Using validation data in Enterprise Miner</title><description>&lt;blockquote&gt;Dear Sir/Madam,&lt;br /&gt;&lt;br /&gt;I am a lecturer at De Montfort University in the UK and teach modules on&lt;br /&gt;Data Mining at final year BSc and MSc level. For both of these we use the&lt;br /&gt;Berry &amp;amp; Linoff Data Mining book. I have a couple of questions regarding SAS that I've been unable to find the answer to and I wondered if you could point in the direction of a source of info where I could find the answers. They are to do with partitioning data in SAS EM and how the different data sets are used. In the Help from SAS EM I see that it says the validation set is used in regression  "to choose a final subset of predictors from all the subsets computed during stepwise regression"  - so is the validation set not used in regression otherwise (e.g. in forward deletion and backward deletion)?&lt;br /&gt;&lt;br /&gt;Also I'm not sure where we see evidence of the test set being used in any of the models I've developed (NNs, Decision Trees, Regression). I presume the lift charts are based on the actual model (resulting from the training and validation data sets) though I noticed if I only had a training and a validation data set (i.e. no test set) the lift chart gave a worse model.&lt;br /&gt;&lt;br /&gt;I hope you don't mind me asking these questions - My various books and the help don't seem to explain fully but I know it must be documented somewhere.&lt;br /&gt;&lt;br /&gt;best wishes, Jenny Carter&lt;br /&gt;&lt;br /&gt;Dr. Jenny Carter&lt;br /&gt;Dept. of Computing&lt;br /&gt;De Montfort University&lt;br /&gt;The Gateway&lt;br /&gt;Leicester&lt;/blockquote&gt;&lt;br /&gt;Hi Jenny,&lt;br /&gt;&lt;br /&gt;I'd like to take this opportunity to go beyond your actual question about SAS Enterprise Miner to make a general comment on the use of validation sets for variable selection in regression models and to guard against overfitting in decision tree and neural network models.&lt;br /&gt;&lt;br /&gt;Historically, statistics grew up in a world of small datasets. As a result, many statistical tools reuse the same data to fit candidate models as to evaluate and select them. In a data mining context, we assume that there is plenty of data so there is no need to reuse the training data. The problem with using the training data to evaluate a model is that overfitting may go undetected. The best model is not the one that best describes the training data; it is the one that best generalizes to new datasets. That is what the validation is for. The details of how Enterprise Miner accomplishes this vary with the type of model. In no case does the test set get used for either fitting the model or selecting from among candidate models. Its purpose is to allow you to see how your model will do on data that was not involved in the model building or selection process.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Regression Models&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;When you use any of the model selection methods (Forward, Stepwise, Backward), you also get to select a method for evaluating the candidate models formed from different combinations of explanatory variables. Most of the choices make no use of the validation data. &lt;span style="font-style: italic;"&gt;Akaike's Information Criterion&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;Schwarz's Bayesian Criterion&lt;/span&gt; both add a penalty term for the number of effects in the model to a function of the error sum of squares. This penalty term is meant to compensate for the fact that additional model complexity appears to lower the error on the training data even when the model is not actually improving. When you choose &lt;span style="font-style: italic;"&gt;Validation Error&lt;/span&gt; as the selection criterion, you get the model that minimizes error on the validation set. That is our recommended setting. You must also take care to set Use Selection Default to &lt;span style="font-style: italic;"&gt;No&lt;/span&gt; in the Model Selection portion of the property sheet of Enterprise Miner will ignore the rest of your settings.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/blog/uploaded_images/modselect-762437.gif"&gt;&lt;img style="cursor: pointer;" src="http://www.data-miners.com/blog/uploaded_images/modselect-762434.gif" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;When a training set, validation set, and test set are all present, Enterprise Miner will report statistics such as the root mean squared error for all three sets. The error on the test set, which is not used to fit models nor to select candidate models, is the best predictor of performance on unseen data.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Decision Trees&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span&gt;&lt;br /&gt;With decision trees, the validation set is used to select a subtree of the tree grown using the training set. This process is called "pruning." Pruning helps prevent overfitting. Some splits which have a sufficiently high worth (chai-square value) on the training data to enter the initial tree, fail to improve the error rate of the tree when applied to the validation data. This is especially likely to happen when small leaf sizes are allowed. By default, if a validation set is present, Enterprise Miner will use it for subtree selection.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Neural Networks&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Training a neural network is an iterative process. Each training iteration adjusts the weights associated with each network connection. As training proceeds, the network becomes better and better at "predicting" the training data. By the time training stops, the model is almost certainly overfit. Each set of weights is a candidate model. The selected model is the one that minimizes error on the validation set.  In the chart shown below, after 20 iterations of training the error on the training set is still declining, but the best model was reached after on 3 training iterations.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/blog/uploaded_images/nntrain-758363.gif"&gt;&lt;img style="cursor: pointer;" src="http://www.data-miners.com/blog/uploaded_images/nntrain-758361.gif" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;</description><link>http://www.data-miners.com/blog/2008/04/using-validation-data-in-enterprise.html</link><author>noreply@blogger.com (Michael J. A. Berry)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-448863936142429210</guid><pubDate>Tue, 08 Apr 2008 21:14:00 +0000</pubDate><atom:updated>2008-04-08T18:43:19.429-04:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>MapReduce</category><category domain='http://www.blogger.com/atom/ns#'>database</category><title>Databases, MapReduce, and Disks</title><description>I just came across an interesting blog posting by Tom White entitled &lt;a href="http://www.lexemetech.com/2008/03/disks-have-become-tapes.html"&gt;"Disks Have Become Tapes"&lt;/a&gt;. This is an interesting posting, but it makes the following claim:  relational databases are limited by the seek speed of disks whereas MapReduce-based methods take advantage of the streaming capabilities of disks.  Hence, MapReduce is better than RDBMS for various types of processing.&lt;br /&gt;&lt;br /&gt;Once again, I read a comment in a blog that seems misguided and gives inaccurate information.  My guess is that people learn relational databases from the update/insert perspective and don't understand complex query processing.  Alas.  I do recommend my book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;&lt;span style="font-style: italic;"&gt;Data Analysis Using SQL and Excel&lt;/span&gt;&lt;/a&gt; for such folks.  Relational databases can take advantage of high-throughput disks. &lt;br /&gt;&lt;br /&gt;Of course, the problem is not new.  Tom White quotes David DeWitt quoting Jim Gray saying "Disks are the new tapes" (&lt;a href="http://www.databasecolumn.com/2007/09/disk-trends.html"&gt;here&lt;/a&gt;).  And the numbers are impressive.  It takes longer to read a high capacity disk now than it did twenty years ago, because capacity has increased much faster than transfer rates.  As for random seeks on the disk, let's not go there.  Seek times have hardly improved at all over this time period.  Seeking on a disk is like going to Australia in a canoe -- the canoe works well enough to cross a river, so why not an ocean?  And, as we all know, RDBMSs use a lot of seeks for queries so they cannot take advantage of modern disks.  MapReduce to the rescue!&lt;br /&gt;&lt;br /&gt;Wait, is that common wisdom really true?&lt;br /&gt;&lt;br /&gt;It is true that for updating or fetching a single row, an RDBMS does use disk seeks to get there (especially if there is an index).  However, this is much faster than the alternative of streaming through the whole table -- even on a fancy, multi-cheap processor MapReduce systems connected to zillions of inexpensive disks.&lt;br /&gt;&lt;br /&gt;On a complex query, the situation is a bit more favorable to the RDBMS for several reasons.  First, large analytic queries typically read entire tables (or partitions of tables).  They do not "take advantage" of indexing, since they read all rows using full table scans.&lt;br /&gt;&lt;br /&gt;However, database engines do not read rows.  They read pages.  Between the query processor and the data is the page manager.  Or, as T. S. Elliott wrote in his poem "The Hollow Men" [on an entirely different topic]:&lt;br /&gt;&lt;quote&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Between the idea&lt;/span&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;And the reality&lt;/span&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Between the motion&lt;/span&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;And the act&lt;/span&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Falls the shadow&lt;/span&gt;&lt;br /&gt;&lt;/quote&gt;&lt;br /&gt;In this case, the shadow is the page manager, a very important part but often overlooked component of a database management system.&lt;br /&gt;&lt;br /&gt;Table scans read the pages assigned to a table.  So, query performance is based on a balance of disk performance (both throughput and latency) and page size. For a database used for analytics, use a big page size.  4k is way small . . . 128k or even 1Mbyte could be very reasonable (and I have seen systems with even larger page sizes).  Also, remember to stuff the pages full.  There is no reason to partially fill pages unless the table has updates (which is superfluous for most data warehouse tables).&lt;br /&gt;&lt;br /&gt;Databases do a lot of things to improve performance.  Probably the most important boost is accidental.  Large database tables are typically loaded in bulk, say once-per-day.  As a result, the pages are quite likely to be allocated sequentially.  Voila!  In such cases, the seek time from one page to the next is minimal.&lt;br /&gt;&lt;br /&gt;But, databases are smarter than that.  The second boost is pre-fetching pages that are likely to be needed.  Even a not-so-smart database engine can realize when it is doing a full table scan.  The page manager can seek to the next page at the same time that the processor is processing data in memory.  That is, the CPU is working, while the page manager spends its time waiting for new pages to load.  Although the page manager is waiting, the CPU is quite busy processing other data, so there is no effective wait time.&lt;br /&gt;&lt;br /&gt;This overlap between CPU cycles and disk is very important for database performance on large queries.  And you can see it on a database machine.  In a well-balanced system, the CPUs are often quite busy on a large query and the disks are less busy.&lt;br /&gt;&lt;br /&gt;Modern RDBMS have a third capability with respect to complex queries.  Much of the work is likely to take place in temporary tables.  The page manager would often store these on sequential pages, and they would be optimized for sequential access.  In addition, temporary tables only store the columns that they need.&lt;br /&gt;&lt;br /&gt;In short, databases optimize their disk access in several ways.  They take advantage of high-throughput disks by:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;using large page sizes to reduce the impact of latency;&lt;/li&gt;&lt;li&gt;storing large databases on sequential pages;&lt;/li&gt;&lt;li&gt;prefetching pages while the processor works on data already in memory;&lt;/li&gt;&lt;li&gt;efficiently storing temporary tables.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;At least they are doing something!  By the way, the balance between latency and throughput goes back at least to the 1980s when I entered this business.  And I suspect that it is a much older concern.&lt;br /&gt;&lt;br /&gt;The advantage and disadvantage of the MapReduce approach is that it leaves such optimizations in the hands of the operating system and the programmer.  Fortunately, modern computer languages are smart with respect to sequential file I/O, so reading some records and then processing them would normally be optimized.&lt;br /&gt;&lt;br /&gt;Of course, a programmer can disrupt this by writing temporary or output files to the same disk system being used to read data.  Well, actually, disks are also getting smarter.  With multiple platters and multiple read heads, modern disks can support multiple seeks to different areas.&lt;br /&gt;&lt;br /&gt;A bigger problem arises with complex algorithms.  MapReduce does not provide built-in support for joining large tables.  Nor even for joining smaller tables.  A nested loop join in MapReduce code could kill the performance of a query.  An RDBMS might implement the same join using hash tables that gracefully overflow memory, should that be necessary.  An exciting development in a programmer's life is when a hash table in memory gets too big and he or she learns about operating system page faults, a concern that the database engine takes care of by itself.&lt;br /&gt;&lt;br /&gt;As I've mentioned before, RDBMS versus MapReduce is almost a religious battle.  MapReduce has capabilities that RDBMSs do not have, and not only because programming languages are more expressive than SQL.  The paradigm is strong and capable for certain tasks.&lt;br /&gt;&lt;br /&gt;On the other hand, SQL is a comparatively easy language to learn (I mean compared to programming for MapReduce) and relational databases engines often have decades of experience built into them, for partitioning data, choosing join and aggregation algorithms, building temporary tables, keeping processors busy and disks spinning, and so on.  In particular, RDBMSs do know a trick or two to optimize disk performance and take advantage of modern highish-latency higher-throughput disks.&lt;br /&gt;&lt;br /&gt;--gordon</description><link>http://www.data-miners.com/blog/2008/04/databases-mapreduce-and-disks.html</link><author>noreply@blogger.com (Gordon S. Linoff)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-7594100033038610950</guid><pubDate>Mon, 17 Mar 2008 01:19:00 +0000</pubDate><atom:updated>2008-03-16T22:26:06.769-04:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>Not Data Mining</category><title>Getting an Iphone</title><description>[This posting has nothing to do with data mining.]&lt;br /&gt;&lt;br /&gt;Last week, a friend gave me an iPhone for my birthday.  Before that, I had admired the iPhone at a distance as several of my friends and colleagues used theirs.  I should also admit that I'm something of a Luddite.  Technology for technology sake does not appeal to me; it often just means additional work.  Having spent the weekend setting up and getting used to the phone, the fear is confirmed.  However, the end result is worth it.&lt;br /&gt;&lt;br /&gt;The first step in using the iPhone is getting service, which is as simple as downloading iTunes, hooking up the phone, and going through a few menus.  Of course, there are a few complications.  The most recent version of iTunes does not support the version of Windows I have on my laptop.  Remember the Luddite in me, causing me to be resistant to a much needed laptop upgrade.&lt;br /&gt;&lt;br /&gt;That issue was easily resolved by moving to another computer.  The second fear was porting my number from T-Mobile to AT&amp;amp;T.  This turned out to be a non-issue.  Just click a box on one of the screens, put in my former number (and look up my account number) and the phone companies do the rest.&lt;br /&gt;&lt;br /&gt;So once you have an iPhone in place, then expect to spend several hours learning how to operate it.  After getting lost in the interface, perhaps somewhere in contacts, I painfully learned that there is only one way to get back to the home page.  I'm pretty sure I tried all other combinations by hitting options on the screen.  However, there is actually a little button on the bottom of the screen -- a real button -- that brings back the home page.  Well, at least they got rid of all the keys with numbers on them.&lt;br /&gt;&lt;br /&gt;The next step is sync'ing the iPhone to your life.  This is simplest if your mail, calendar, and contacts are all handled in Outlook or Yahoo!.  Somehow, Apple is not compatible with Google.  Alas.  So, bringing in my contacts from Google meant:&lt;br /&gt;&lt;br /&gt;(1) Spending an hour or two cleaning up my contact list in Google, and adding telephone numbers from my old phone.  Since the iPhone has email capabilities, I really wanted to bring in email addresses as well as phone numbers.&lt;br /&gt;&lt;br /&gt;(2) Exporting the Google contacts into a text file.&lt;br /&gt;&lt;br /&gt;(3) Very importantly:  renaming the "Name" column in the first line to "First Name".  Google has only one name field, but Yahoo (and the iPhone) want two fields.&lt;br /&gt;&lt;br /&gt;(4) Uploading my contacts into my Yahoo account.&lt;br /&gt;&lt;br /&gt;(5) Sync'ing the iPhone up with my Yahoo account.&lt;br /&gt;&lt;br /&gt;Okay, I can accept that some global politics keeps the iPhone from talking directly to Google.  But, why do I need to connect to the computer to do the sync?  Why can't I do it over the web wirelessly?&lt;br /&gt;&lt;br /&gt;Okay, that's the contacts, and we'll see how it works.&lt;br /&gt;&lt;br /&gt;The calendar is more difficult.  For that, I just use Safari -- the iPhone browser -- to go to Google calendar.  This seems to work well enough.  However, even this can be complicated because I have two Google accounts -- one for email (glinoff@gmail.com) and one for all my Data Miners related stuff (gordon@data-miners.com).  The calendar is on the latter.  I seem to have gotten a working version up in Safari, by going through the calendar page.&lt;br /&gt;&lt;br /&gt;Note that I did not use Google's suggestion of pasting in the URL for my private calendar.  I found that the functionality when I do this is not complete.  It is hard to add in events.&lt;br /&gt;&lt;br /&gt;And this brings up a subject about Safari.  First, it is incredible what it can do on a small portable device.  On the other hand, it is insane that I was unable to set up my AT&amp;amp;T account using Safari.  Each time I went through the same routine.  AT&amp;amp;T send me a temporary password.  I went to the next screen, and filled in new passwords and answers to the security questions (somewhat painfully, one character at a time, but I was on a train at the time).  After finishing, I would go to a validation screen, the validation would fail, and I would go back to the first page.  The only thing that saed me was the training reaching Penn Station and the iPhone running out of battery power.&lt;br /&gt;&lt;br /&gt;Once I got home, I did the same thing on my computer.  And, it worked the first time.&lt;br /&gt;&lt;br /&gt;I also noticed that certain forms do not work perfectly in Safari, such as the prompts for Google calendar.  On the other hand, it was easy to go to web pages, add book marks, and put the pages on the home screen.&lt;br /&gt;&lt;br /&gt;Fortunately, the email does not actually go through the Safari interface.  This makes it easy to read email, because the application is customized.  However, Safari would have some advantages.  First, Safari rotates when the screen rotates, but the email doesn't (which is unfortunately because stubby fingers work better in horizontal mode).  Also, only the most recent 50 emails are downloaded, so searching through history is not feasible.  On the plus side, sending an email, still shows up in gmail.&lt;br /&gt;&lt;br /&gt;Perhaps the most impressive feature of the phone are the maps.  There is a home key on the map which tells you where you are.  Very handy.  We were watching the movie "The Water Horse".  Within a minute, I could produce a map and satellite pictures of Loch Ness in Scotland, with all the zoom-in and zoom-out features.  Followed close by is the ability to surf the web.  And both of these are faster on a wide-area network, which I have.&lt;br /&gt;&lt;br /&gt;I still haven't used the music or video, so there is more to learn.  But the adventure seems worth it so far.</description><link>http://www.data-miners.com/blog/2008/03/getting-iphone.html</link><author>noreply@blogger.com (Gordon S. Linoff)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-8138545920274347513</guid><pubDate>Wed, 12 Mar 2008 19:36:00 +0000</pubDate><atom:updated>2008-03-17T11:25:35.873-04:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>In The News</category><title>Data Mining Brings Down Governor Spitzer</title><description>When New York Governor Elliott Spitzer resigned earlier today the proximate cause was the revelation that he had spent thousands of dollars (maybe tens of thousands) on prostitutes. This hypocrisy on the part of the former NY Attorney General who is married with three teenage daughters, and a long record of prosecuting the wrongdoings of others made his continuation in office untenable.&lt;br /&gt;&lt;br /&gt;But how was he caught? The answer is that the complicated financial transactions he made in an attempt to disguise his spending on prostitutes were flagged by fraud detection software that banks now use routinely to detect money laundering and other financial crimes. In a &lt;a href="http://www.npr.org/templates/story/story.php?storyId=88132229"&gt;news report on NPR&lt;/a&gt; this morning, reporter Adam Davidson interviewed a representative from &lt;a href="http://www.actimize.com/"&gt;Actimize&lt;/a&gt;, an Israeli company that specializes in fraud detection and compliance software. The software scores every bank transaction with a number from 0 to 100 indicating the probability of fraud. The software takes into account attributes of the particular transaction, but also its relationship to other transaction (as when several small transactions with the same source and destination are used to disguise a large transaction), the relationship of account owners involved in the transaction, and attributes of the account owner such as credit score and, unfortunately for Governor Spitzer, whether or not the account owner is  a "PEP" (politically exposed person). PEPs attract more scrutiny since they are often in a position to be bribed or engage in other corrupt practices.&lt;br /&gt;&lt;br /&gt;Banks are required to report SARs (Suspicious Activity Reports) to &lt;a href="http://www.fincen.gov/"&gt;FinCEN&lt;/a&gt;, the Treasury Department's financial crimes enforcement network. The reports--about a million of them in 2006--go into a database hosted at the IRS and teams of investigators around the country look into them. One such team, based in Long Island, looked into Sptizer's suspicious transactions and eventually discovered the connection to the prostitution ring.&lt;br /&gt;&lt;br /&gt;Ironically, one of the reasons there are so many more SARs filed each year now than there were before 2001 is that in 2001, then New York Attorney General, Elliott Spitzer aggressively pursued wrong-doing at financial institutions and said they had to be aware of criminal activity conducted through their accounts. Apparently, the software banks installed to find transactions that criminal organizations are trying to hide from the IRS is also capable of finding transactions that Johns are trying to hide from their wives.</description><link>http://www.data-miners.com/blog/2008/03/data-mining-brings-down-governor.html</link><author>noreply@blogger.com (Michael J. A. Berry)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-7545365293144787988</guid><pubDate>Sat, 09 Feb 2008 17:10:00 +0000</pubDate><atom:updated>2008-02-10T16:26:01.484-05:00</atom:updated><title>MapReduce and K-Means Clustering</title><description>Google offers slides and presentations on many research topics &lt;a href="http://code.google.com/edu/"&gt;online&lt;/a&gt; including distributed systems.  And one of &lt;a href="http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html"&gt;these presentations&lt;/a&gt; discusses MapReduce in the context of clustering algorithms.&lt;br /&gt;&lt;br /&gt;One of the claims made in this particular presentation is that "&lt;span style="font-style: italic;"&gt;it can be necessary to send tons of data to each Mapper Node.  Depending on your bandwidth and memory available, this could be impossible&lt;/span&gt;."  This claim is false, which in turn removes much of the motivation for the alternative algorithm, which called "canopy clustering".&lt;br /&gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;font-size:130%;"  &gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The K-Means Clustering Algorithm&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There are many good introductions to k-means clustering available, including our book &lt;a style="font-style: italic; font-weight: bold;" href="http://www.amazon.com/exec/obidos/ASIN/0471470643/thedataminers"&gt;Data Mining Techniques for Marketing, Sales, and Customer Support&lt;/a&gt;.  The Google presentation mentioned above provides a very brief introduction.&lt;br /&gt;&lt;br /&gt;Let's review the k-means clustering algorithm.  Given a data set where all the columns are numeric, the algorithm for k-means clustering is basically the following:&lt;br /&gt;&lt;br /&gt;(1) Start with k cluster centers (chosen randomly or according to some specific procedure).&lt;br /&gt;(2) Assign each row in the data to its nearest cluster center.&lt;br /&gt;(3) Re-calculate the cluster centers as the "average" of the rows in (2).&lt;br /&gt;(4) Repeat, until the cluster centers no longer change or some other stopping criterion has been met.&lt;br /&gt;&lt;br /&gt;In the end, the k-means algorithm "colors" all the rows in the data set, so similar rows have the same color.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;" &gt;K-Means in a Parallel World&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To run this algorithm, it seems, at first, as though all the rows assigned to each cluster in Step (2) need to be brought together to recalculate the cluster centers.&lt;br /&gt;&lt;br /&gt;However, this is not true.  K-Means clustering is an example of an embarrassingly parallel algorithm, meaning that that it is very well suited to parallel implementations.  In fact, it is quite adaptable to both SQL and to MapReduce, with efficient algorithms. By "efficient", I mean that large amounts of data do not need to be sent around processors and that the processors have minimum amounts of communication.  It is true that the entire data set does need to be read by the processors for each iteration of the algorithm, but each row only needs to be read by one processor.&lt;br /&gt;&lt;br /&gt;A parallel version of the k-means algorithm was incorporated into the Darwin data mining package, developed by Thinking Machines Corporation in the early 1990s.  I do not know if this was the first parallel implementation of the algorithm.  Darwin was later purchased by Oracle, and became the basis for Oracle Data Mining.&lt;br /&gt;&lt;br /&gt;How does the parallel version work?  The data can be partitioned among multiple processors (or streams or threads).  Each processor can read the previous iteration's cluster centers and assign the rows on the processor to clusters.  Each processor then calculates new centers for its of data.  Each actual cluster center (for the data across all processors) is then the weighted average of the centers on each processor.&lt;br /&gt;&lt;br /&gt;In other words, the rows of data do not need to be combined globally.  They can be combined locally, with the reduced set of results combined across all processors.  In fact, MapReduce even contains a "combine" method for just this type of algorithm.&lt;br /&gt;&lt;br /&gt;All that remains is figuring out how to handle the cluster center information.  Let us postulate a shared file that has the centroids as calculated for each processor.  This file contains:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The iteration number.&lt;/li&gt;&lt;li&gt;The cluster id.&lt;/li&gt;&lt;li&gt;The cluster coordinates.&lt;/li&gt;&lt;li&gt;The number of rows assigned to the cluster.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;This is the centroid file.  An iteration through the algorithm is going to add another set of rows to this file.  This information is the only information that needs to be communicated globally.&lt;br /&gt;&lt;br /&gt;There are two ways to do this in the MapReduce framework.  The first uses map, combine, and reduce.  The second only uses map and reduce.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;" &gt;K-Means Using Map, Combine, Reduce&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Before begining, a file is created accessible to all processors that contains initial centers for all clusters.  This file contains the cluster centers for each iteration.&lt;br /&gt;&lt;br /&gt;The Map function reads this file to get the centers from the last finished iteration.  It then reads the input rows (the data) and calculates the distance to each center.  For each row, it produces an output pair with:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;key -- cluster id;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;value -- coordinates of row.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Now, this is a lot of data, so we use a Combine function to reduce the size before sending it to Reduce.  The Combine function calculates the average of the coordinates for each cluster id, along with the number of records.  This is simple, and it produces one record of output for each cluster:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;key is cluster&lt;/li&gt;&lt;li&gt;value is number of records and average values of the coordinates.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;The amount of data now is the number of clusters times the number of processors times the size of the information needed to define each cluster.  This is small relative to the data size.&lt;br /&gt;&lt;br /&gt;The Reduce function (and one of these is probably sufficient for this problem regardless of data size and the number of Maps) calcualtes the weighted average of its input.  Its output should be written to a file, and contain:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;the iteration number;&lt;/li&gt;&lt;li&gt;the cluster id;&lt;/li&gt;&lt;li&gt;the cluster center coordinates;&lt;/li&gt;&lt;li&gt;the size of the cluster.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;The iteration process can than continue.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;" &gt;K-Means Using Just Map and Reduce&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Using just Map and Reduce, it is possible to do the same things.  In this case, the Map and Combine functions described above are combined into a single function.&lt;br /&gt;&lt;br /&gt;So, the Map function does the following:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Initializes itself with the cluster centers from the previous iteration;&lt;/li&gt;&lt;li&gt;Keeps information about each cluster in memory.  This information is the total number of records assigned to the cluster in the processor and the total of each coordinate.&lt;/li&gt;&lt;li&gt;For each record, it updates the information in memory.&lt;/li&gt;&lt;li&gt;It then outputs the key-value pairs for the Combine function described above.&lt;/li&gt;&lt;/ul&gt;The Reduce function is the same as above.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;" &gt;K-Means Using SQL&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Of course, one of my purposes in discussing MapReduce has been to understand whether and how it is more powerful than SQL.  For fifteen years, databases have been the only data-parallel application readily available.  The parallelism is hidden underneath the SQL language, so many people using SQL do not fully appreciate the power they are using.&lt;br /&gt;&lt;br /&gt;An iteration of k-means looks like:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;SELECT @iteration+1, cluster_id,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.......&lt;/span&gt;AVERAGE(d.data) as center&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM (SELECT d.data, cc.cluster_id,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.............&lt;/span&gt;ROW_NUMBER() OVER (PARTITION BY d.data&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;................................&lt;/span&gt;ORDER BY DISTANCE(d.data, cc.center) as ranking&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM data d CROSS JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;(SELECT *&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM cluster_centers cc&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;WHERE iteration = @iteration) cc&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;) a&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;WHERE ranking = 1&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;GROUP BY cluster_id&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;This code assumes the existence of functions or code for the &lt;span style="font-family:courier new;"&gt;AVERAGE()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;DISTANCE() &lt;/span&gt;functions.  These are placeholders for the correct functions.  Also, it uses analytic functions.  (If you are not familiar with these, I recommend my book &lt;a style="font-style: italic; font-weight: bold;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;.)&lt;br /&gt;&lt;br /&gt;The efficiency of the SQL code is determined, to a large extent, by the analytic function that ranks all the cluster centers.  We hope that a powerful parallel engine will recognize that the data is all in one place, and hence that this function will be quite efficient.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;" &gt;A Final Note About K-Means Clustering&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The K-Means clustering algorithm does require reading through all the data for each iteration through the algorithm.  In general, it tends to converge rather quickly (tens of iterations), so this may not be an issue.  Also, the I/O for reading the data can all be local I/O, rather than sending large amounts of data through the network.&lt;br /&gt;&lt;br /&gt;For most purposes, if you are dealing with a really big dataset, you can sample it down to a fraction of its original size to get reasonable clusters.  If you are not satisfied with this method, then sample the data, find the centers of the clusters, and then use these to initialize the centers for the overall data.  This will probably reduce the number of iterations through the entire data to less than 10 (one pass for the sample, a handful for the final clustering).&lt;br /&gt;&lt;br /&gt;When running the algorithm on very large amounts of data, numeric overflow is a very real issue.  This is another reason why clustering locally, taking averages, and then taking the weighted average globally is beneficial -- and why doing sample is a good way to begin.&lt;br /&gt;&lt;br /&gt;Also, before clustering, it is a good idea to standardize numeric variables (subtract the average and divide by the standard deviation).&lt;br /&gt;&lt;br /&gt;--gordon&lt;br /&gt;Check out my latest book &lt;a style="font-style: italic; font-weight: bold;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;.</description><link>http://www.data-miners.com/blog/2008/02/mapreduce-and-k-means-clustering.html</link><author>noreply@blogger.com (Gordon S. Linoff)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-8182634975616321155</guid><pubDate>Wed, 06 Feb 2008 21:47:00 +0000</pubDate><atom:updated>2008-02-06T18:09:15.664-05:00</atom:updated><title>Using SQL to Emulate MapReduce Functionality</title><description>My previous blog entry explained that there are two ways that MapReduce functionality (&lt;a href="http://www.data-miners.com/blog/2008/01/mapreduce-and-sql-aggregations.html"&gt;here&lt;/a&gt;) is more powerful than SQL aggregations:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;MapReduce implements functions using a full-fledged programming language. This is more powerful than the functions permitted in SQL.&lt;/li&gt;&lt;li&gt;MapReduce allows one row to be part of more than one aggregation group.&lt;/li&gt;&lt;/ul&gt;In fact, SQL can emulate this functionality, bringing it much closer to MapReduce's capabilities. This post discusses how SQL can emulate this functionality and then discusses why this might not be a good idea. (This discussion has been inspired by the rather inflammatory and inaccurate post &lt;a href="http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html"&gt;here&lt;/a&gt;.)&lt;br /&gt;&lt;br /&gt;First, let me note that the first limitation is not serious, because I assume that SQL can be extended by adding new scalar and aggregation user defined functions. Although more cumbersome than built-in programming constructs, the ability to add user defined functions does make it possible to add in a wide breadth of functionality.&lt;br /&gt;&lt;br /&gt;The second strength can be emulated by assuming the existence of a table, which I'll call &lt;span style="font-family:courier new;"&gt;enumerate&lt;/span&gt;, that simply contains one column which contains numbers starting at 1.&lt;br /&gt;&lt;br /&gt;How does such a table help us? Consider a table that has a start date and a stop date for each customer. The SQL code to count up the starts and the stops might look like:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-family:courier new;"&gt;SELECT thedate, SUM(isstart) as numstarts, SUM(isstop) as numstops&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM ((SELECT start_date as thedate, 1 as isstart, 0 as issend&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color:#ffffff;"&gt;.......&lt;/span&gt;FROM customer c) union all&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color:#ffffff;"&gt;......&lt;/span&gt;(SELECT stop_date as thedate, 0 as isstart, 1 as issend&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color:#ffffff;"&gt;.......&lt;/span&gt;FROM customer c)) a&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;GROUP BY thedate&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;ORDER BY 1&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;This is inelegant, particularly as the expressions for the tables get more complicated -- imagine what happens if the &lt;span style="font-family:courier new;"&gt;customer&lt;/span&gt; table is actually a complicated set of joins and aggregations. In addition, we can see how expressing the SQL suggests that two full passes are needed through the table. Yuck!&lt;br /&gt;&lt;br /&gt;Let's assume that we have the &lt;span style="font-family:courier new;"&gt;enumerate&lt;/span&gt; table. In this case, the same query could be expressed as:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;span style="font-family:courier new;font-size:85%;"&gt;SELECT (CASE WHEN e.i = 1 THEN start_date ELSE end_date END) as thedate,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;font-size:85%;"&gt;&lt;span style="color:#ffffff;"&gt;.......&lt;/span&gt;SUM(CASE WHEN e.i = 1 THEN 1 ELSE 0 END) as numstarts,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;font-size:85%;"&gt;&lt;span style="color:#ffffff;"&gt;.......&lt;/span&gt;SUM(CASE WHEN e.i = 2 THEN 1 ELSE 0 END) as numstops&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;font-size:85%;"&gt;FROM customer c CROSS JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;font-size:85%;"&gt;&lt;span style="color:#ffffff;"&gt;.....&lt;/span&gt;(SELECT * FROM enumerate WHERE i &lt;= 2) e&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;font-size:85%;"&gt;GROUP BY (CASE WHEN e.i = 1 THEN start_date ELSE end_date END) &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;font-size:85%;"&gt;ORDER BY 1&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;This query is setting up a counter that lets us, conceptually, loop through each row in the table. The counter is set to take on two values. On the first pass through the loop, the query uses the start date; on the second, it uses the stop date. The result is the same as for the previous query. The SQL, though, is different because it does not express two passes through the data.&lt;br /&gt;&lt;br /&gt;This example is simple. It is obvious how to extend it further, for instance, if there were more dates stored in each row. It should also be obvious how this can be expressed as map/reduce functions.&lt;br /&gt;&lt;br /&gt;One of the most common places where MapReduce is used is for parsing text strings. Say we have a list of product descriptions that are like:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;"green,big,square"&lt;/li&gt;&lt;li&gt;"red,small,square"&lt;/li&gt;&lt;li&gt;"grey"&lt;/li&gt;&lt;li&gt;"medium,smelly,cube,cotton"&lt;/li&gt;&lt;/ul&gt;The idea here is that the description strings consist of any number of comma separate values. Now, let's say that our goal is to count the number of times that each keyword appears in a set of products. The first thought is that something like this really cannot be done in SQL. So, to give a help, let's assume that there are two helper functions:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;NumWords(&lt;em&gt;string&lt;/em&gt;&lt;string&gt;, &lt;em&gt;sepchar&lt;/em&gt;&lt;sepchar&gt;)&lt;/span&gt;: This function takes a string and a separate character and returns the number of words in the string.&lt;/li&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;GetWord(&lt;em&gt;string&lt;/em&gt;&lt;string&gt;, &lt;em&gt;sepchar&lt;/em&gt;&lt;sepchar&gt;, &lt;em&gt;i&lt;/em&gt;&lt;whichword&gt;)&lt;/span&gt;: This function takes a string, a separator character, and a word number and returns the word in the string.&lt;/li&gt;&lt;/ul&gt;For instance, for the examples above, &lt;span style="font-family:courier new;"&gt;NumWords()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;GetWord()&lt;/span&gt; return the following using comma as a separator and when called with 1, 2, 3, and so on:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;3 and "green", "big", and "square"&lt;/li&gt;&lt;li&gt;3 and "red", "small", "square"&lt;/li&gt;&lt;li&gt;1 and "grey"&lt;/li&gt;&lt;li&gt;4 and "medium", "smelly", "cube", "cotton"&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;These functions are not difficult to write in a procedural programming language.&lt;/p&gt;&lt;p&gt;With such functions, the SQL to count up the attributes in our products looks like:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;span style="font-family:courier new;font-size:85%;"&gt;SELECT GetWord(p.desc, ',', e.i) as attribute, COUNT(*)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;font-size:85%;"&gt;FROM product p JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;font-size:85%;"&gt;&lt;span style="color:#ffffff;"&gt;.....&lt;/span&gt;enumerate e&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;font-size:85%;"&gt;&lt;span style="color:#ffffff;"&gt;.....&lt;/span&gt;ON e.i &lt;= NumWords(p.desc, ',')&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;font-size:85%;"&gt;GROUP BY GetWord(p.desc, ',', e.i) &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;font-size:85%;"&gt;ORDER BY 2 DESC&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;The structure of this query is very similar to the previous query. The one difference is that each row has a different loop counter, because there are a different number of words in any given product description. Hence, the two tables are joined using a standard inner join operator, rather than a cross join.&lt;/p&gt;&lt;p&gt;An &lt;span style="font-family:courier new;"&gt;enumerate&lt;/span&gt; table, in conjunction with user defined functions, can give SQL much of the functionality of MapReduce.&lt;/p&gt;&lt;p&gt;One objection to the previous example is that such a database structure violates many rules of good database design.  Such a product description string is not normalized, for instance.  And packing values in strings is not a good design practice.  My first reaction is that the real world is filled with similar examples, so regardless of what constitutes good design, we still need to deal with it.&lt;/p&gt;&lt;p&gt;A more honest answer is that the world is filled with strings that contain useful information -- description strings, URLs, and so on.  SQL cannot just ignore such data, or dismiss it as not being normalized.&lt;/p&gt;&lt;p&gt;There is a problem: performance. One way to do the join is to create an intermediate table that is the cross product (or a large subset of the cross product) of the two tables.  Of course, such an intermediate table is equivalent to reading the first table twice, so we have not gained anything.&lt;/p&gt;&lt;p&gt;This is likely to happen in the first case. Without more information, SQL is likely to do nested loop joins. If the customer table is the outer loop, then each &lt;span style="font-family:courier new;"&gt;customer&lt;/span&gt; row is read and duplicated, once with &lt;span style="font-family:courier new;"&gt;i=1&lt;/span&gt; and the second time with &lt;span style="font-family:courier new;"&gt;i=2&lt;/span&gt;. This is not actually so bad. The original row is actually read once and then processed multiple times in memory for each value of &lt;span style="font-family:courier new;"&gt;i&lt;/span&gt;.&lt;/p&gt;&lt;p&gt;Of course, there is no guarantee that the SQL engine won't put the &lt;span style="font-family:courier new;"&gt;enumerate&lt;/span&gt; table as the outer loop, which requires two full reads of the &lt;span style="font-family:courier new;"&gt;customer&lt;/span&gt; table.&lt;/p&gt;&lt;p&gt;The situation becomes worse if the data is partitioned in a parallel environment. This is important, because MapReduce's advantage is that it always runs in parallel.&lt;/p&gt;&lt;p&gt;The SQL engine is likely to run a nested loop join on a single processor, even if the &lt;span style="font-family:courier new;"&gt;customer&lt;/span&gt; table is partitioned (or if the database is configured to be "multithreaded" or "multiprocessor"). Only a very smart optimizer would figure out that the &lt;span style="font-family:courier new;"&gt;enumerate&lt;/span&gt; table could be duplicated and distributed so the nested loop join could run in parallel.&lt;/p&gt;&lt;p&gt;The optimization problem is even worse in the second case, because the number of rows needed from &lt;span style="font-family:courier new;"&gt;enumerate&lt;/span&gt; varies for different products. Of course, database directives were invested to tell SQL optimizers how to do joins.  I would prefer that the database have the &lt;span style="font-family:courier new;"&gt;enumerate&lt;/span&gt; table built-in, so the optimize can take full advantage of it.&lt;/p&gt;Much of MapReduce's functionality comes from the ability to give a singel row multiple aggregation keys, while running in parallel.  Even on large datasets, we can set up SQL to solve many problems that MapReduce does by combining user-defined functions, the &lt;span style="font-family:courier new;"&gt;enumerate&lt;/span&gt; table, and appropriate compiler directives so the large joins are done in parallel.&lt;br /&gt;&lt;br /&gt;--gordon</description><link>http://www.data-miners.com/blog/2008/02/using-sql-to-emulate-mapreduce.html</link><author>noreply@blogger.com (Gordon S. Linoff)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-7189532166548257236</guid><pubDate>Fri, 25 Jan 2008 16:51:00 +0000</pubDate><atom:updated>2008-01-25T12:52:40.001-05:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>MapReduce</category><category domain='http://www.blogger.com/atom/ns#'>database</category><title>MapReduce and SQL Aggregations</title><description>This is another post discussing the article on MapReduce written by Professors Michael Stonebraker and David DeWitt (available &lt;a href="http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html"&gt;here&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;One of the claims that they make is:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;To draw an analogy to SQL, map is like the &lt;/span&gt;&lt;i style="font-style: italic;"&gt;group-by&lt;/i&gt;&lt;span style="font-style: italic;"&gt; clause of an aggregate query. Reduce is analogous to the &lt;/span&gt;&lt;i style="font-style: italic;"&gt;aggregate&lt;/i&gt;&lt;span style="font-style: italic;"&gt; function (e.g., average) that is computed over all the rows with the same group-by attribute.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This claim is worth discussing in more detail because it is a very powerful and intuitive analogy.  And, it is simply incorrect.  MapReduce is much more powerful than SQL aggregations.&lt;br /&gt;&lt;br /&gt;Another reason why I find their claim interesting is because I use the same analogy to describe MapReduce to people familiar with databases.  Let me explain this in a bit more detail.  Consider the following SQL query to count the number of customers who start in each month:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;SELECT MONTH(c.start_date), COUNT(*)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM customer c&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;GROUP BY MONTH(c.start_date)&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Now, let's consider how MapReduce would "implement" this query.  Of course, MapReduce is a programming framework, not a query interpreter, but we can still use it to solve the same problem.  The MapReduce framework solves this problem in the following way:&lt;br /&gt;&lt;br /&gt;First, the map phase would read records from the &lt;span style="font-family:courier new;"&gt;customer&lt;/span&gt; table and produce an output record with two parts.  The first part is called the "key", which is populated with &lt;span style="font-family:courier new;"&gt;MONTH(c.start_date)&lt;/span&gt;.   The second is the "value", which can be arbitrarily complicated.  In this case, it is as simple as it gets.  The value part simply contains "1".&lt;br /&gt;&lt;br /&gt;The reduce phase then reads the key-value pairs, and aggregates them.  The MapReduce framework sorts the data so records with the same key always occur together.  This makes it easy for the reduce phase to add together all the "1"s to get the count for each key (which is the month number).  The result is a count for each key.&lt;br /&gt;&lt;br /&gt;I am intentionally oversimplified this process by describing it at a high level.  The first simplification is leaving out all the C++ or Java overhead for producing the programs (although there are attempts at interpreted languages to greatly simplify this process).  Another is not describing the parallel processing aspects.  And yet another oversimplification is leaving out the "combine" step.  The above algorithm can be made more efficient by first "reducing" the values locally on each processor to get subtotals, and then "reducing" these again.  This post, however, is not about computational efficiency.&lt;br /&gt;&lt;br /&gt;The important thing to note is the following three correspondences between MapReduce and SQL aggregates.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;First, MapReduce uses a "key".  This key is the &lt;span style="font-family:courier new;"&gt;GROUP BY&lt;/span&gt; expression in a SQL aggregation statement.&lt;/li&gt;&lt;li&gt;Second, MapReduce has a "map" function.  This is the expression inside the parentheses.  This can be an arbitrary function or &lt;span style="font-family:courier new;"&gt;CASE&lt;/span&gt; statement in SQL.  In databases that support user defined functions, this can be arbitrarily complicated, as with the "map" function in MapReduce.&lt;/li&gt;&lt;li&gt;Third, MapReduce has a "reduce" function.  This is the aggregation function.  In SQL, this is traditionally one of a handful of functions (&lt;span style="font-family:courier new;"&gt;SUM()&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;AVG()&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;MIN()&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;MAX()&lt;/span&gt;).  However, in some databases support user defined aggregation functions, which can be arbitrarily complicated.&lt;/li&gt;&lt;/ol&gt;So, it seems that SQL and MapReduce are equivalent, particularly in an environment where SQL supports user defined functions written in an external language (such as C++, Java, or C-Sharp).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Wrong!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The next example extends the previous one by asking how many customers start and how many stop in each month.  There are several ways of approaching this.  The following shows one approach using a &lt;span style="font-family:courier new;"&gt;FULL OUTER JOIN&lt;/span&gt;:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;SELECT m, ISNULL(numstarts, 0), ISNULL(numstops, 0)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM (SELECT MONTH(start_date) as m, COUNT(*) as numstarts&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM customer c&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;GROUP BY MONTH(start_date)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;) start FULL OUTER JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;(SELECT MONTH(stop_date) as m, COUNT(*) as numstops&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM customer c&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;GROUP BY MONTH(stop_date)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;) stop&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;ON start.m = stop.m&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Another formulation might use an aggregation and &lt;span style="font-family:courier new;"&gt;UNION&lt;/span&gt;:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;SELECT m, SUM(isstart), SUM(isstop)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM ((SELECT MONTH(start_date) as m, 1 as isstart, 0 as isstop&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.......&lt;/span&gt;FROM customer c)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;UNION ALL&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;........&lt;/span&gt;(SELECT MONTH(stop_date) as m, 0 as isstart, 1 as isstop&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;........&lt;/span&gt;FROM custommer c)) a&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;GROUP BY m&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Now, in both of these, there are two pieces of processing, one for the starts and one for the stops.  In almost any databases optimizer that I can think of, both these queries (and other, similar queries) require two passes through the data, one pass for the starts and one pass for the stops.  And, regardless of the optimizer, the SQL statements describe two passes through the data.&lt;br /&gt;&lt;br /&gt;The MapReduce framework has a more efficient, and perhaps, even more intuitive solution.  The map phase can produce &lt;span style="font-style: italic;"&gt;two&lt;/span&gt; output keys for each record:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The first has a key that is &lt;span style="font-family: courier new;"&gt;MONTH(start_date)&lt;/span&gt; and the value is a structure containing &lt;span style="font-family: courier new;"&gt;isstart&lt;/span&gt; and &lt;span style="font-family: courier new;"&gt;isstop&lt;/span&gt; with values of 1 and 0 respectively.&lt;/li&gt;&lt;li&gt;The second has a key that is &lt;span style="font-family: courier new;"&gt;MONTH(stop_date)&lt;/span&gt; and the value are 0 and 1 respectively.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;What is important about this example is not the details, simply the fact that the processing is quite different.  The SQL methods describe two passes through the data.  The MapReduce method has only one pass through the data.  &lt;span style="font-style: italic; font-weight: bold;"&gt;In short, MapReduce can be more efficient than SQL aggregations.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;How much better becomes apparent when we look in more detail at what is happening.  When processing data, SQL limits us to one key per record for aggregation.  MapReduce does not have this limitation.  We can have as many keys as we like for each record.&lt;br /&gt;&lt;br /&gt;This difference is particularly important when analyzing complex data structures to extract features -- of which processing text from web pages is an obvious example.  To take another example, one way of doing content filtering of email for spam involves looking for suspicious words in the text and then building a scoring function based on those words (naive Bayesian models would be a typical approach).&lt;br /&gt;&lt;br /&gt;Attempting to do this in MapReduce is quite simple.  The map phase looks for each word and spits out a key value pair for that word (in this case, the key is actually the email id combined with the word).  The reduce phase counts them all up.  Either the reduce phase or a separate program can then apply the particular scoring code.  Extending such a program to include more suspicious words is pretty simple.&lt;br /&gt;&lt;br /&gt;Attempting to do this in SQL . . . well, I wouldn't do it in the base language.  It would be an obfuscated mess of &lt;span style="font-family: courier new;"&gt;CASE&lt;/span&gt; statements or a non-equi &lt;span style="font-family: courier new;"&gt;JOIN&lt;/span&gt; on a separate word table.  The MapReduce approach is simpler and more elegant in this case.  Ironically, I usually describe the problem as "SQL is not good in handling text."  However, I think the real problem is that "SQL aggregations are limited to one key per record."&lt;br /&gt;&lt;br /&gt;SQL has many capabilities that MapReduce does not have.  Over time, MapReduce may incorporate many of the positive features of SQL for analytic capabilities (and hopefully not incorporate unneeded overhead for transactions and the like).  Today, SQL remains superior for most analytic needs -- and it can take advantage of parallelism without programming.  I hope SQL will also evolve, bringing together the enhanced functionality from other approaches.&lt;br /&gt;&lt;br /&gt;--gordon</description><link>http://www.data-miners.com/blog/2008/01/mapreduce-and-sql-aggregations.html</link><author>noreply@blogger.com (Gordon S. Linoff)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-6918067739103390734</guid><pubDate>Wed, 23 Jan 2008 15:55:00 +0000</pubDate><atom:updated>2008-01-23T14:38:14.569-05:00</atom:updated><title>Relational Databases for Analysis</title><description>Professors Michael Stonebraker and David DeWitt have written a very interesting piece on relational databases and MapReduce (available &lt;a href="http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html"&gt;here&lt;/a&gt;). For those who are not familiar with MapReduce, it is a computational framework developed by Google and Yahoo for processing large amounts of data in parallel.&lt;br /&gt;&lt;br /&gt;The response to this article has, for the most part, been to defend MapReduce, which I find interesting because MapReduce is primarily useful for analytic applications. Both technologies make it possible to run large analytic tasks in parallel (taking advantage of multiple processors and multiple disks), without learning the details of parallel hardware and software. This makes both of them powerful for analytic purposes.&lt;br /&gt;&lt;br /&gt;However, Professors Stonebraker and DeWitt make some points that are either wrong, or inconsequential with respect to using databases for complex queries and data warehousing.&lt;br /&gt;&lt;br /&gt;(1) They claim that MapReduce lacks support for updates and transactions, implying that these are important for data analysis.&lt;br /&gt;&lt;br /&gt;This is not true for complex analytic queries. Although updating data within a databases is very important for transactional systems, it is not at all important for analytic purposes and data warehousing. In fact, updates imply certain database features that can be quite detrimental to performance.&lt;br /&gt;&lt;br /&gt;Updates imply row-level locking and logging. Both of these are activities that take up CPU and disk resources, but are not necessary for complex queries.&lt;br /&gt;&lt;br /&gt;Updates also tend to imply that databases pages are only partially filled. This makes it possible to insert new data without splitting pages, which is useful in transactional systems. However, partially filled pages slow down queries that need to read large amounts of data.&lt;br /&gt;&lt;br /&gt;Updates also work against vertical partitioning (also called columnar databases), where different columns of data are stored on different pages. This makes working on wide tables quite feasible, and is one of the tricks used by newer database vendors such as Netezza.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;(2) They claim that MapReduce lacks indexing capabilities, implying that indexing is useful for data analysis.&lt;br /&gt;&lt;br /&gt;One of the shortcomings of the MapReduce framework in comparison to SQL is that MapReduce does not facilitate joins. However, the major use of indexes for complex queries are for looking up values in smaller reference tables, which can often be done in memory. We can assume that all large tables require full table scans.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;(3) MapReduce is incompatible with database tools, such as data mining tools.&lt;br /&gt;&lt;br /&gt;The article actually sites Oracle Data Mining (which grew out of the Darwin project developed by Thinking Machines when I was there) and IBM Intelligent Miner. This latter reference is particular funny, because IBM has withdrawn this product from the market (see &lt;a href="http://www-306.ibm.com/software/data/iminer/"&gt;here&lt;/a&gt;). The article also fails to cite the most common of these tools, Microsoft SQL Server Data Mining, which is common because it is bundled with the database.&lt;br /&gt;&lt;br /&gt;However, data mining within databases is not a technology that has taken off. One reason is pricing. Additional applications on database platforms often increase the need for hardware -- and more hardware often implies larger database costs. In any case, networks are quite fast and tools can access data in databases without having to be physically colocated with them. Serious data mining practitioners are usually using other tools, such as SAS, SPSS, S-Splus, or R.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;By the way, I am not a convert to MapReduce (my most recent book is calld &lt;strong&gt;&lt;em&gt;Data Analysis with SQL and Excel&lt;/em&gt;&lt;/strong&gt;). Its major shortcoming is that it is a programming interface, and having to program detracts from solving business problems. SQL, for all its faults, is still much easier for most people to learn than Java or C++, and, if you do want to program, user-defined extensions can be quite beneficial. However, there are some tasks that I would not want to tackle in SQL, such as processing log files, and MapReduce is one scalable option for such processing.&lt;br /&gt;&lt;br /&gt;--gordon</description><link>http://www.data-miners.com/blog/2008/01/relational-databases-for-analysis.html</link><author>noreply@blogger.com (Gordon S. Linoff)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-7641648688277480603</guid><pubDate>Mon, 14 Jan 2008 15:42:00 +0000</pubDate><atom:updated>2008-01-14T10:51:25.921-05:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>In The News</category><title>Data Mining to Prevent Airline Crashes</title><description>It was refreshing to spot this article in the Washington Post that uses the phrase "data mining" in the same way we do rather than as a synonym for spying or otherwise violating our civil liberties.&lt;br /&gt;&lt;br /&gt;Airline crashes are extremely rare. Rare events pose a challenge in data mining. This article points out one solution which is to model a more common event which is sometimes a precursor to the very rare event of interest.&lt;br /&gt;&lt;br /&gt;(Click the title of this post to go to the Washington Post article.)</description><link>http://www.data-miners.com/blog/2008/01/data-mining-to-prevent-airline-crashes.html</link><author>noreply@blogger.com (Michael J. A. Berry)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-8926042345450063728</guid><pubDate>Sun, 25 Nov 2007 00:44:00 +0000</pubDate><atom:updated>2007-12-18T18:08:09.885-05:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>Ask a data miner</category><title>Constructing a Model Set for Reccuring Events</title><description>In the &lt;a href="http://www.data-miners.com/blog/2007/11/constructing-model-set-for-binary.html"&gt;previous post&lt;/a&gt;, I answered a question about how to set up a model set for binary churn. It is fairly common for data miners to find ways to express almost any problem as a binary outcome since binary outcome problems are easily approached with familiar tools such as logistic regression or decision trees. The context for the questions suggests an alternate approach, however. The event of interest was the purchase of refill pages for a calendar/planner. This is an example of a recurring event.  Other examples include:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Visits to a web page.&lt;/li&gt;&lt;li&gt;Purchases of additional minutes for a pre-paid phone plan.&lt;/li&gt;&lt;li&gt;Subscription renewals.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Repeat purchases.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Pregnancies.&lt;/li&gt;&lt;li&gt;Incarcerations.&lt;/li&gt;&lt;li&gt;Posts to a blog.&lt;/li&gt;&lt;/ul&gt;All of these are&lt;span style="font-style: italic;"&gt; &lt;/span&gt;examples of  &lt;span style="font-style: italic;"&gt;counting processes&lt;/span&gt;. A counting process is one where each time an event occurs it increments a total count. The event frequency is governed by an &lt;span style="font-style: italic;"&gt;intensity function&lt;/span&gt; which is a function of time and other covariates, much like the hazard function in survival analysis for non-recurring events. The intensity function can be estimated empirically, or it may be fit by a parametric or semi-parametric model using, for example, the SAS PHREG procedure.  Either way, the data must first be transformed from the way it was probably recorded--dated transactions--to a form suitable for the required calculations.&lt;br /&gt;&lt;p&gt;&lt;img src="http://www.data-miners.com/blog/uploaded_images/recur01-717942.gif" /&gt;&lt;/p&gt;&lt;br /&gt;These are customers making multiple purchases during an observation window. Each time a customer makes a purchase, a transaction record is created. When we add this data to a table in the counting process style, each customer contributes several rows. There is a row for the time from time 0, which may be the time of the initial purchase, to the second purchase, a row for the time to each subsequent purchase, and a row for the time between the final observed purchase and the end of the observation period.&lt;br /&gt;&lt;p&gt;&lt;img src="http://www.data-miners.com/blog/uploaded_images/recur02.gif" /&gt;&lt;/p&gt;&lt;br /&gt;Depending on the style of analysis used, each event may be seen as starting a new time 0 with the number of previous events as a covariate, or each event may be modeled separately with a customer only becoming part of the at-risk pool for event &lt;span style="font-style: italic;"&gt;n &lt;/span&gt;&lt;span&gt;after experiencing event &lt;span style="font-style: italic;"&gt;n&lt;/span&gt;-1.&lt;br /&gt;Either way, it is important to include the final censored time period. This period does not correspond to any transaction, but customers are "at risk" for another purchase during that period.&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;My approach to creating the table is to first create the table without the censored observations, which is reasonably straightforward. Each of these rows contains a flag indicating it is a complete, uncensored observation. Next I create just the censored observations by creating an observation going from the latest observed purchase to the end of the observation period (in this case, 22May2006). The censored rows can then be appended to the uncensored rows. These could, of course, be turned into subqueries in order to avoid creating the temporary tables.&lt;br /&gt;&lt;p&gt;&lt;img src="http://www.data-miners.com/blog/uploaded_images/cpcode.gif" /&gt;&lt;/p&gt;&lt;br /&gt;This fully expanded version of the data is what is referred to as the counting process style of input. In a realistic situation where there might be millions of customers, it makes more sense to group by tenure so that there is one row showing how many customers made a purchase with that tenure and how many customers experienced the tenure and so could have made a purchase. This is the data needed to estimate the intensity function.&lt;br /&gt;In Gordon Linoff's book, &lt;a href="http://www.data-miners.com/bookstore.htm"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;, he provides sample code for making a related, but different table using the data available on the &lt;a href="http://www.data-miners.com/sql_companion.htm"&gt;book's companion page&lt;/a&gt;. I reproduce it here for reference.&lt;br /&gt;&lt;p&gt;&lt;img src="http://www.data-miners.com/blog/uploaded_images/recurquery.gif" /&gt;&lt;/p&gt;&lt;br /&gt;The code uses the DATEDIFF function to subtract a household's first order date from all its other order dates to put things on the tenure timeline. It then counts the number of second (or third, or fourth, . . .) purchases that happen at each tenure. This query does not track the population at risk so it is not the actual intensity function, but it never the less gives a nice visual image of the way intensity peaks at yearly intervals as many customers make regular annual purchases, just as the purchasers of calendars in the previous posting did.&lt;br /&gt;&lt;p&gt;&lt;img src="http://www.data-miners.com/blog/uploaded_images/intensity.gif" /&gt;&lt;/p&gt;</description><link>http://www.data-miners.com/blog/2007/11/constructing-model-set-for-reccuring.html</link><author>noreply@blogger.com (Michael J. A. Berry)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-6561802382141625640</guid><pubDate>Thu, 01 Nov 2007 16:24:00 +0000</pubDate><atom:updated>2007-11-02T11:51:33.133-04:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>Ask a data miner</category><title>Constructing a model set for binary outcome churn</title><description>Yesterday I received the following question from a reader who is trying to build a churn model for a business where refill purchases are expected to occur annually. The post raises several questions including how to define churn when it happens passively, how to prepare data for a binary outcome churn model, and whether it might be more appropriate to model refills as a repeating event. Although this question happens to be about annual planning book refills, the situation is similar with prepaid phone cards, transit passes, toner cartridges, etc. I will address the issue of modeling repeating events in a follow-up post, but first I will answer the question that was actually asked.&lt;br /&gt;&lt;blockquote style="color: rgb(0, 0, 102);"&gt;Michael,&lt;br /&gt;&lt;br /&gt;I need advise. I hope you do not mind me asking questions.&lt;br /&gt;&lt;br /&gt;Our Churn variable definition is if customer did not purchased in 13 months then we consider this customer has churned.&lt;br /&gt;&lt;br /&gt;In this situation, if I want to build a model to see who is likely to leave, my churn variable will take values …&lt;br /&gt;&lt;br /&gt;Churn  = 1 (when last purchased date &gt; 13 month)&lt;br /&gt;else  Churn = 0&lt;br /&gt;&lt;br /&gt;After building a model, my Scoring data (To figure out who is likely to leave) should be…….&lt;br /&gt;&lt;br /&gt;1. Customers who purchased within 13 months to see who are likely to leave   or&lt;br /&gt;&lt;br /&gt;2. Entire database or maybe 4 year buyers (customers whose last purchase date is within 4 years)??  Or&lt;br /&gt;&lt;br /&gt;3. Use Modeling file which I have used create churn model as Scoring file?&lt;br /&gt;&lt;br /&gt;Please let me know.&lt;br /&gt;&lt;br /&gt;Thanks.&lt;br /&gt;&lt;br /&gt;With Best Regards,&lt;br /&gt;&lt;br /&gt;Nilima&lt;/blockquote&gt;First some context.  I know from her email address (which I have removed to protect her from spam) that Nilima works for a company that sells planners and pocket calendars.  The planners have an outer cover that lasts for years. When you order a planner, it comes with a year's worth of pages. As part of the order you specify what month to start with. A year later, you should need a refill. The product is not useful without its refill pages, so if 13 months go by without an order, it is likely that the customer has been lost. (Perhaps he or she now synchronizes a PDA with Outlook, or uses Google Apps, or is now enjoying a schedule-free retirement.)&lt;br /&gt;&lt;br /&gt;As an aside, a purely time-since-last-purchase based definition of churn would not work if the product in question were wall calendars that only cover a particular year. In that case, the definition of churn might be "hasn't made a purchase by the end of January" without regard to when the previous purchase was made. There is undoubtedly also a fair amount of seasonality in the purchase of planners--the beginning of the calendar year and the beginning of the academic year seem like likely times to make an initial purchase--but that's OK. The business problem is to identify customers likely to not refill on their anniversary. For this purpose, it is not important that some months have more of these anniversaries than others.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The Data &lt;/span&gt;&lt;br /&gt;The questioner is not a client of ours and I have never seen her data. I will assume that she has several years of history and that there is data for every customer who ever made a purchase during that time. I will further assume that all purchases are captured and that she can reliably link a purchase to a purchaser so repeat purchases are recognized as such. The business goal is to score all active, at-risk customers with a churn probability (or, equivalently and more cheerfully, with a refill probability). Presumably, customers with a high enough churn score will be given some extra incentive to refill.&lt;br /&gt;&lt;br /&gt;It sounds as though Nilima has already taken the first step which is to summarize the purchase transactions to create a customer signature with one row per customer and columns describing them. Possible fields include&lt;br /&gt;&lt;br /&gt;Fields derived from purchase data&lt;br /&gt;&lt;ul&gt;&lt;li&gt;number of past refills&lt;/li&gt;&lt;li&gt;months since last refill&lt;br /&gt;&lt;/li&gt;&lt;li&gt;months since first purchase&lt;/li&gt;&lt;li&gt;original product purchased&lt;/li&gt;&lt;li&gt;number of contacts since last refill&lt;/li&gt;&lt;/ul&gt;Fields captured at registration time&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Age at time of first purchase&lt;/li&gt;&lt;li&gt;Sex&lt;/li&gt;&lt;li&gt;Country&lt;/li&gt;&lt;li&gt;Postal code or Zip code&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Fields derived from the above and in combination with census data&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Age at scoring time&lt;/li&gt;&lt;li&gt;Zip median income&lt;/li&gt;&lt;li&gt;Zip population density&lt;/li&gt;&lt;li&gt;Zip percent foreign born&lt;/li&gt;&lt;/ul&gt;Fields that could be purchased from a data vendor&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Estimated household income&lt;/li&gt;&lt;li&gt;Estimated number of children&lt;/li&gt;&lt;li&gt;Estimated number of cars&lt;/li&gt;&lt;li&gt;Cluster assignment (e.g. "urban achievers", "bible devotion")&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Rolling Back the Clock &lt;/span&gt;&lt;br /&gt;Building a predictive model requires data from two distinct time periods. All data is from the past. To build a predictive model, you find patterns in data from the distant past that explain results in the more recent past. The result is a model that can be applied today to predict things that will happen in the future.&lt;br /&gt;&lt;br /&gt;In the current case, you could take a snapshot of what all active customers looked like 14 months ago as your data from the distant past.  In this data set, all of the tenure fields and count fields are reset to what they looked like way back when. Some customers now considered lapsed were still active. Some customers who have now made 4 refills had only made three. Customers who are now 65 were only 63, and so forth. Your data from the recent past would then be a single flag indicating whether the customer made a refill within 13 months of his or her previous refill or initial purchase. Note that because the churn definition is in terms of months since last purchase, the calendar date when a customer becomes lapsed must be calculated separately for each customer.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;SAS PROC SQL Code Example &lt;/span&gt;&lt;br /&gt;As I said, I have not seen the data that prompted Nilima's question. I do have some similar data that I can share with readers, however.  Gordon Linoff and I teach a 2-day class on &lt;a href="http://www.data-miners.com/catalog.htm#sdums"&gt;Applying Survival Analysis for Business Time-to-Event Problems&lt;/a&gt;. For that class we use a customer signature with a row for each subscriber, past and present, of a mobile phone company. You can get the data by &lt;a href="http://www.data-miners.com/formpage.htm"&gt;registering on our web site&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The focus of the class is on calculating hazard probabilities for each tenure and using them to create survival curves that can be used to predict a subscriber's remaining lifetime and create subscriber level forecasts. If we wanted to use that data for a binary outcome churn model, we would have to roll back time as described above.  The following SAS code creates a dataset of customers who were active 100 days before the extract or cutoff date. Time is rolled back so that subscribers appear as they did at the observation date. In particular, the subscriber's tenure and age are defined as of the observation date.&lt;br /&gt;&lt;br /&gt;The code does a few other interesting things that may be worth noting.  In the mobile telephony industry, handset is a known driver of churn. Subscribers know that they can get a new, cooler phone by signing up with a competitor as a new subscriber. Subscribers with uncool phones are most at risk, but which phones are the least cool is constantly changing over time. Therefore, rather than trying to incorporate the handset model into the model, we incorporate the churn rate associated with each model in the 100 days before the observation date by counting the number of people who stopped with each model and dividing by the number of people carrying each model. &lt;br /&gt;&lt;br /&gt;Another big factor in churn is whether subscribers are on or off contract. Subscribers on contract must pay a fee to cancel their subscriptions. This code calculates two flags--one indicating whether the subscriber is off-contract as of the observation date and another indicating whether the subscriber is scheduled to go off contract (and so become more likely to churn) before the cutoff date.&lt;br /&gt;&lt;br /&gt;The code creates 3 future variables, any of which could be the target for a binary outcome churn model. FutureCHURN is true for anyone who stopped for any reason between the observation date and the cutoff date. FutureVOLUNTARY is true for anyone who stopped voluntarily and FutureINVOLUNTARY is true for anyone who stopped involuntarily.&lt;br /&gt;&lt;br /&gt;&lt;img src="http://www.data-miners.com/images/blogpics/churnquery.gif" alt="SQL code" /&gt;</description><link>http://www.data-miners.com/blog/2007/11/constructing-model-set-for-binary.html</link><author>noreply@blogger.com (Michael J. A. Berry)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-3242723853810315196</guid><pubDate>Thu, 27 Sep 2007 22:18:00 +0000</pubDate><atom:updated>2007-09-28T11:06:58.995-04:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>Netflix</category><title>Which movies did 305344 fail to rate?</title><description>&lt;span style="font-family: arial; color: rgb(51, 51, 255);"&gt;Originally posted to a previous version of this blog 27 April 2007.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I expected that the 117 movies not rated by someone or something that seems to  rate every movie would have few raters and an earliest rating date close to the  cutoff date for the data. That would be consistent with a rating program of some  sort that scores the entire database periodically. This did not prove to be the  case. The list of movies customer 305344 failed to rate includes Dr. Shivago,  Citizen Kane and A Charlie Brown Christmas.&lt;br /&gt;&lt;br /&gt;Unlike most of the recent  questions, this one cannot be looked up in the rater signature or the movie  signature because this information has been summarized away. Instead I used a  query on the original training data that has all the rating transactions. Later,  I looked up the earliest rating date for each movie not rated by the alpha movie  geek to test my hypothesis that they would be movies only recently made  available for rating.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;select t.movid from&lt;br /&gt;(select  r.movid as movid, sum(custid=305344) as geek&lt;br /&gt;from netflix.train r&lt;br /&gt;group  by movid) t&lt;br /&gt;where t.geek = 0&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The most rated movies  not rated by the alpha rater geek&lt;/b&gt;&lt;br /&gt;&lt;table&gt;&lt;br /&gt;&lt;tbody&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;title&lt;/td&gt;&lt;td&gt;n_raters&lt;/td&gt;&lt;td&gt;earliest&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;Mystic River&lt;/td&gt;&lt;td&gt;143,682&lt;/td&gt;&lt;td&gt;2003-09-20&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt;&lt;td&gt;Collateral&lt;/td&gt;&lt;td&gt;132,237&lt;/td&gt;&lt;td&gt;2004-05-10&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;Sideways&lt;/td&gt;&lt;td&gt;117,270&lt;/td&gt;&lt;td&gt;2004-10-22&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;The Notebook&lt;/td&gt;&lt;td&gt;115,990&lt;/td&gt;&lt;td&gt;2004-05-19&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;Ray&lt;/td&gt;&lt;td&gt;108,606&lt;/td&gt;&lt;td&gt;2004-10-22&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;The Aviator&lt;/td&gt;&lt;td&gt;108,354&lt;/td&gt;&lt;td&gt;2004-11-30&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;Million Dollar Baby&lt;/td&gt;&lt;td&gt;102,861&lt;/td&gt;&lt;td&gt;2004-11-16&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;Hotel Rwanda&lt;/td&gt;&lt;td&gt;92,345&lt;/td&gt;&lt;td&gt;2004-12-09&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;The Hunt for Red October&lt;/td&gt;&lt;td&gt;83,249&lt;/td&gt;&lt;td&gt;1999-12-17&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;12 Monkeys&lt;/td&gt;&lt;td&gt;76,475&lt;/td&gt;&lt;td&gt;1999-12-30&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;Crash&lt;/td&gt;&lt;td&gt;65,074&lt;/td&gt;&lt;td&gt;2005-04-14&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;Citizen Kane&lt;/td&gt;&lt;td&gt;61,758&lt;/td&gt;&lt;td&gt;2001-03-17&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;The Saint&lt;/td&gt;&lt;td&gt;28,448&lt;/td&gt;&lt;td&gt;2000-01-05&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;Doctor Zhivago&lt;/td&gt;&lt;td&gt;17,785&lt;/td&gt;&lt;td&gt;2000-01-12&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;Hackers&lt;/td&gt;&lt;td&gt;17,452&lt;/td&gt;&lt;td&gt;2000-01-06&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;The Grapes of Wrath&lt;/td&gt;&lt;td&gt;16,392&lt;/td&gt;&lt;td&gt;2001-03-18&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;The Pledge&lt;/td&gt;&lt;td&gt;10,969&lt;/td&gt;&lt;td&gt;2001-01-21&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;A Charlie Brown Christmas&lt;/td&gt;&lt;td&gt;7,546&lt;/td&gt;&lt;td&gt;2000-08-03&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;The Tailor of Panama&lt;/td&gt;&lt;td&gt;7,421&lt;/td&gt;&lt;td&gt;2001-03-28&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;tr&gt; &lt;td&gt;The Best Years of Our Lives&lt;/td&gt;&lt;td&gt;7,031&lt;/td&gt;&lt;td&gt;2000-01-06&lt;/td&gt;&lt;/tr&gt;&lt;br /&gt;&lt;/tbody&gt;&lt;/table&gt;</description><link>http://www.data-miners.com/blog/2007/09/which-movies-die-305344-fail-to-rate.html</link><author>noreply@blogger.com (Michael J. A. Berry)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-8681328011151303036</guid><pubDate>Thu, 27 Sep 2007 22:15:00 +0000</pubDate><atom:updated>2007-09-27T18:17:30.224-04:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>Netflix</category><title>How weird is customer 305344, the rating champion?</title><description>&lt;p class="weblogbody"&gt;&lt;span style="font-family: arial; color: rgb(0, 0, 153);"&gt;Originally posted to a previous version of this blog on 27 April 2007.&lt;/span&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;img alt="305344" src="http://lewis.data-miners.com/blogpics/c305344.gif" align="left" /&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;This most prolific of all customers has rated 17,653 of  the 17,770 movies; all but 117 of them. As with the last super rater we  examined, his ratings are heavily skewed toward the negative end of the scale.  Of course, anyone forced to see every movie in the Netflix catalog would  probably hate most of them. . .</description><link>http://www.data-miners.com/blog/2007/09/how-weird-is-customer-305344-rating.html</link><author>noreply@blogger.com (Michael J. A. Berry)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-5256589522286245890</guid><pubDate>Thu, 27 Sep 2007 22:12:00 +0000</pubDate><atom:updated>2007-09-27T18:15:04.158-04:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>Netflix</category><title>How weird is customer 2439493?</title><description>&lt;p class="weblogbody"&gt;Remember the 5 raters who rated more than 10,000 movies  each? It is important to know whether their opinions are similar to those of  less prolific raters because for rarely-rated movies, theirs may be the only  opinion we have to work with.&lt;br /&gt;&lt;br /&gt;Here is the distribution of 2439493's  ratings:&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;img alt="2439493" src="http://lewis.data-miners.com/blogpics/c2439493.gif" align="left" /&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Customer 2439493 rated 16,565 movies and hated 15,024 of them! If he is an  actual human, why does he watch so many movies if he dislikes them so much? If  he actually watched all the movies and if the average running time was 1.5  hours, then he spent 22,536 hours watching movies he hated. If he did it as a  full-time job, it would take him about 8 years.&lt;/p&gt;&lt;br /&gt;This customer did like  334 movies (about 2%) well enough to give them a five. Other than a fondness for  Christmas and Santa themed movies, it is hard to discern what his favorites have  in common. Here is a partial listing:&lt;br /&gt;&lt;ul&gt;&lt;i&gt;Rudolph the Red-Nosed Reindeer &lt;/i&gt;&lt;li&gt;&lt;i&gt;Jingle All the Way &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;The Santa Clause 2 &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;The Santa Clause &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;Back to the Future Part III &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;Frosty's Winter Wonderland / Twas the Night Before Christmas &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;Sleeping Beauty: Special Edition &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;Jack Frost &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;The Princess Diaries (Fullscreen) &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;The Little Princess &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;A Christmas Carol &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;The Passion of the Christ &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;Ernest Saves Christmas &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;Bambi: Platinum Edition &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;New York Firefighters: The Brotherhood of 9/11 &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;Left Behind: World at War &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;The Year Without a Santa Claus &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;Miss Congeniality &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;National Lampoon's Christmas Vacation: Special Edition &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;Groundhog Day &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;Maid in Manhattan &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;Jesus of Nazareth &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;The Sound of Music &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;A Charlie Brown Christmas &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;Miracle on 34th Street &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;Mary Poppins &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;The Brave Little Toaster &lt;/i&gt;&lt;/li&gt;&lt;li&gt;&lt;i&gt;The Grinch&lt;/i&gt;&lt;/li&gt;&lt;/ul&gt;  This guy rated 16,565 movies and &lt;i&gt;Miss  Congeniality&lt;/i&gt; and &lt;i&gt;National Lampoon's Christmas Vacation &lt;/i&gt;both made the  top 2%! Representative? I hope not.</description><link>http://www.data-miners.com/blog/2007/09/how-weird-is-customer-2439493.html</link><author>noreply@blogger.com (Michael J. A. Berry)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-9110797711009643809</guid><pubDate>Thu, 27 Sep 2007 22:08:00 +0000</pubDate><atom:updated>2007-09-27T18:11:17.930-04:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>Netflix</category><title>How does my ratings distribution compare?</title><description>&lt;span style="font-family: arial; color: rgb(0, 0, 153);"&gt;Originally posted to a previous version of this blog on 18 April 2007.&lt;/span&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;img alt="mjab" src="http://lewis.data-miners.com/blogpics/mjabdist.gif" align="left" /&gt;&lt;--This is how &lt;i&gt;my&lt;/i&gt; scores are  distributed.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;img alt="population" src="http://lewis.data-miners.com/blogpics/popdist.gif" align="right" /&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;This is how scores are distributed in the  population.--&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I have rated 110 movies--a few more than the  median. My mode is 3 whereas the population mode is 4.&lt;br /&gt;&lt;br /&gt;Thinking about my own  data is not just narcissism; I have found it useful when looking at supermarket  loyalty card data, telephone call data, and drug prescription data. I find it  gets me thinking in more interesting ways. &lt;br /&gt;&lt;br /&gt;For instance, I never give  any movies a 1, so how come other people do? Couldn't they have guessed they  wouldn't like that movie before seeing it? Personally, I never watch a movie  unless I expect to like it at least a little. Ah, but that wasn't always the  case! When my kids were young and living at home, I often suffered through  movies that were not to my liking. Come to think of it, it is not only parents  who sometimes watch things picked by other people. Room mates, spouses, and  dates can all have terrible taste!</description><link>http://www.data-miners.com/blog/2007/09/how-does-my-ratings-distribution.html</link><author>noreply@blogger.com (Michael J. A. Berry)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-4355564554380549908</guid><pubDate>Thu, 27 Sep 2007 22:03:00 +0000</pubDate><atom:updated>2007-09-27T18:07:19.378-04:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>Netflix</category><title>Outliers for number of movies rated</title><description>&lt;p class="weblogbody"&gt;&lt;span style="color: rgb(0, 0, 153);font-family:arial;" &gt;Originally posted to a previous version of this blog on 18 April 2007.&lt;/span&gt;&lt;br /&gt;&lt;/p&gt;&lt;p class="weblogbody"&gt;My raters signature has a column for the number of movies  each subscriber has rated. In the &lt;a href="http://en.wikipedia.org/wiki/J_%28programming_language%29"&gt;J&lt;/a&gt; code  below, this column is called n_ratings.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;+/ n_ratings &gt;/ 1000  5000 10000&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;In English, this compares each subscriber's number of  ratings with the three values creating a table with three columns. The table  contains 1 where the number of ratings is greater and 0 where it is less than or  equal to the corresponding value. These 1's and 0's are then summed. The result  vector is &lt;code&gt;13100 43 5&lt;/code&gt;.&lt;br /&gt;&lt;br /&gt;The 13,100 people who rated more  than a thousand movies are presumably legitimate movie buffs who have seen, and  have opinions on, a lot of movies. Rating 10,000 movies does not seem like the  expected behavior of a single human. Could these be the collective opinions of  an organization? Or automatic ratings generated by a computer program? I don't  know. What I do know is that such outliers should be treated with care. One  concern is that for movies that have been rated by very few subscribers, the  ratings will be dominated by these outliers.&lt;br /&gt;&lt;br /&gt;There has been some  discussion of this issue on the &lt;a href="http://www.netflixprize.com/community/viewtopic.php?id=141"&gt;Netflix Prize  Forum&lt;/a&gt;.&lt;br /&gt;&lt;/p&gt;</description><link>http://www.data-miners.com/blog/2007/09/outliers-for-number-of-movies-rated.html</link><author>noreply@blogger.com (Michael J. A. Berry)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-4556233776688919177</guid><pubDate>Thu, 27 Sep 2007 22:01:00 +0000</pubDate><atom:updated>2007-09-27T18:03:03.287-04:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>Netflix</category><title>How many movies does each rater rate?</title><description>&lt;span style="font-family: arial; color: rgb(0, 0, 153);"&gt;Originally posted to a previous version of this blog on 17 April 2007.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Part of my rater signature is how many movies the rater has rated. A few people  (are they really people? Perhaps they are programs or organizations?) have rated  nearly all the movies. Most people have rated very few.&lt;br /&gt;&lt;br /&gt;&lt;img alt="movies rated" src="http://lewis.data-miners.com/blogpics/ratingspr.gif" /&gt;</description><link>http://www.data-miners.com/blog/2007/09/how-many-movies-does-each-rater-rate.html</link><author>noreply@blogger.com (Michael J. A. Berry)</author></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-3366935554564939610.post-8768026599786922238</guid><pubDate>Thu, 27 Sep 2007 21:55:00 +0000</pubDate><atom:updated>2007-09-27T18:00:25.786-04:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>Netflix</category><title>How Many People Rate Each Movie?</title><description>&lt;span style="font-family: arial; color: rgb(51, 51, 255);"&gt;Originally posted to a previous version of this blog on 17 April 2007.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Part of my movie signature is a count of how many people have rated each movie. The number drops off very quickly after the most popular movies.&lt;br /&gt;&lt;br /&gt;&lt;img src="http://lewis.data-miners.com/blogpics/raterspm.gif" /&gt;&lt;br /&gt;&lt;br /&gt;The mean number of raters is 5,654.5. The median number of raters is 561. As mentioned in an earlier post, the smallest number of raters for any movie is 3.</description><link>http://www.data-miners.com/blog/2007/09/how-many-peo