<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-3366935554564939610</id><updated>2010-03-31T13:44:04.988-04:00</updated><title type='text'>Data Miners Blog</title><subtitle type='html'>A place to read about topics of interest to data miners, ask questions of the data mining experts at Data Miners, Inc., and discuss the books of Gordon Linoff and Michael Berry.</subtitle><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default?start-index=26&amp;max-results=25'/><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://www.data-miners.com/blog/atom.xml'/><author><name>Michael J. A. Berry</name><uri>http://www.blogger.com/profile/06077102677195066016</uri><email>noreply@blogger.com</email></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>92</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-7460335983315184258</id><published>2010-03-31T13:44:00.001-04:00</published><updated>2010-03-31T13:44:05.001-04:00</updated><title type='text'>This blog has moved</title><content type='html'>&lt;br /&gt;       This blog is now located at http://blog.data-miners.com/.&lt;br /&gt;       You will be automatically redirected in 30 seconds, or you may click &lt;a href='http://blog.data-miners.com/'&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;       For feed subscribers, please update your feed subscriptions to&lt;br /&gt;       http://blog.data-miners.com/feeds/posts/default.&lt;br /&gt;  &lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-7460335983315184258?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/7460335983315184258/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2010/03/this-blog-has-moved.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/7460335983315184258'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/7460335983315184258'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2010/03/this-blog-has-moved.html' title='This blog has moved'/><author><name>Michael J. A. Berry</name><uri>http://www.blogger.com/profile/06077102677195066016</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='14679622169454737233'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-4244233752559867739</id><published>2010-03-14T22:32:00.001-04:00</published><updated>2010-03-14T22:36:39.956-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Survival Analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='Michael'/><title type='text'>Bitten by an Unfamiliar Form of Left Truncation</title><content type='html'>Alternate title: &lt;b&gt;Data Mining Consultant with Egg on Face&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Last week I made a client presentation. The project was complete. I was presenting the final results to the client.&amp;nbsp; The CEO was there. Also the CTO, the CFO, the VPs of Sales and Marketing, and the Marketing Analytics Manager. The client runs a subscription-based business and I had been analyzing their attrition patterns. Among my discoveries was that customers with "blue" subscriptions last longer than customers with "red" subscriptions. By taking the difference of the area under the two survival curves truncated at one year and multiplying by the subscription cost, I calculated the dollar value of the difference. I put forward some hypotheses about why the blue product was stickier and suggested a controlled experiment to determine whether having a blue subscription actually caused longer tenure or was merely correlated with it. Currently, subscribers simply pick blue or red at sign-up. There is no difference in price.&amp;nbsp; I proposed that half of new customers be given blue by default unless they asked for red and the other half be given red by default unless they asked for blue. We could then look for differences between the two randomly assigned groups.&lt;br /&gt;&lt;br /&gt;All this seemed to go over pretty well.&amp;nbsp; There is only one problem.&amp;nbsp; The blue customers may not be better after all.&amp;nbsp; One of the attendees asked me whether the effect I was seeing could just be a result of the fact that blue subscriptions have been around longer than red ones so the oldest blue customers are older than the oldest red customers. I explained that this would not bias my findings because all my calculations were based on the tenure time line, not the calendar time line. We were comparing customers' first years without regard to when they happened. I explained that there &lt;i&gt;would&lt;/i&gt; be a problem if the data set suffered from left truncation, but I had tested for that, and it was not a problem because we knew about starts and stops since the beginning of time.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Left truncation&lt;/i&gt; is something that creates a bias in many customer databases.&amp;nbsp; What it means is that there is no record of customers who stopped before some particular date in the past--the left truncation date. The most likely reason is that the company has been in existence longer than its data warehouse. When the warehouse was created, all active customers were loaded in, but customers who had already left were not. Fine, for most applications, but not for survival analysis. Think about customers who started before the warehouse was built.&amp;nbsp; One (like many thousands of others) stops before the warehouse gets built with a short tenure of two months. Another, who started on the same day as the first, is still around two be loaded into the warehouse with a tenure of two years.&amp;nbsp; Lots of short-tenure people are missing and long-tenure people are over represented. Average tenure is inflated and retention appears to be better than it really is.&lt;br /&gt;&lt;br /&gt;My client's data did not have that problem.&amp;nbsp; At least, not in the way I am used to looking for it.&amp;nbsp; Instead, it had a large number of stopped customers for whom the subscription type had been forgotten. I (foolishly) just left these people out of my calculations.&amp;nbsp; Here is the problem: Although the customer start and stop dates are remembered for ever, certain details, including the subscription type,&amp;nbsp; are purged after a certain amount of time. For all the people who started back when there were only blue subscriptions and had short or even average tenures, that time had already past. The only ones for whom I could determine the subscription type were those who had unusually long tenures.&amp;nbsp; Eliminating the subscribers for whom the subscription type had been forgotten had exactly the same effect as left truncation!&lt;br /&gt;&lt;br /&gt;If this topic and things related to it sound interesting to you, it is not too late to sign up for a two-day class I will be teaching in New York &lt;b&gt;later this week&lt;/b&gt;.&amp;nbsp; The class is called &lt;a href="http://www.data-miners.com/catalog.htm#sdums"&gt;Survival Analysis for Business Time to Event Problems&lt;/a&gt;. It will be held at the offices of SAS Institute in Manhattan this Thursday and Friday, March 18-19.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-4244233752559867739?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/4244233752559867739/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2010/03/bitten-by-unfamiliar-form-of-left.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/4244233752559867739'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/4244233752559867739'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2010/03/bitten-by-unfamiliar-form-of-left.html' title='Bitten by an Unfamiliar Form of Left Truncation'/><author><name>Michael J. A. Berry</name><uri>http://www.blogger.com/profile/06077102677195066016</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='14679622169454737233'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-2657292361243080006</id><published>2010-02-25T15:55:00.006-05:00</published><updated>2010-02-27T10:42:51.981-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='SAS Code'/><title type='text'>Agglomerative Variable Clustering</title><content type='html'>Lately, I've been thinking about the topic of reducing the number of variables, and how this is a lot like clustering variables (rather than clustering rows).  This post is about a method that seems intuitive to me, although I haven't found any references to it.  Perhaps a reader will point me to references and a formal name.  This method using Pearson correlation and principal components to agglomeratively cluster the variables.&lt;br /&gt;&lt;br /&gt;Agglomerative clustering is the process of assigning records to clusters, starting with the records that are closest to each other.  This process is repeated, until all records are placed into a single cluster.  The advantage of agglomerative clustering is that it creates a structure for the records, and the user can see different numbers of clusters.  Divisive clustering, such as implemented by SAS's varclus proc, produces something similar, but from the top-down.&lt;br /&gt;&lt;br /&gt;Agglomerative variable clustering works the same way.  Two variables are put into the same cluster, based on their proximity.  The cluster then needs to be defined in some manner, by combining information in the cluster.&lt;br /&gt;&lt;br /&gt;The natural measure for proximity is the square of the (Pearson) correlation between the variables.  This is a value between 0 and 1 where 0 is totally uncorrelated and 1 means the values are colinear.  For those who are more graphically inclined, this statistic has an easy interpretation when there are two variables.  It is the R-square value of the first principal component of the scatter plot.&lt;br /&gt;&lt;br /&gt;Combining two variables into a cluster requires creating a single variable to represent the cluster.  The natural variable for this is the first principal component.&lt;br /&gt;&lt;br /&gt;My proposed clustering method repeatedly does the following:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Finds the two variables with the highest correlation.&lt;/li&gt;&lt;li&gt;Calculates the principal component for these variables and adds it into the data.&lt;/li&gt;&lt;li&gt;Maintains the information that the two variables have been combined.&lt;/li&gt;&lt;/ol&gt;The attached SAS code (available at &lt;a href="http://www.data-miners.com/blog/sas-var-hierarchical-clustering-v01.sas"&gt;sas-var-hierarchical-clustering-v01.sas&lt;/a&gt;) does exactly this, although not in the most efficient and robust way.  The bulk of the code is a macro, called &lt;span style="font-family:courier new;"&gt;buildcolumns&lt;/span&gt;, that appends the new cluster variables to the data set and maintains another table called &lt;span style="font-family:courier new;"&gt;columns&lt;/span&gt; which has the information about the rows.  After I run this code, I can select different numbers of variables using the expression:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;proc sql;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;select colname&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;from columns&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;where counter &lt;= [some number] &lt;&gt; &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;These variables can then be used for predictive models or visualization purposes.&lt;br /&gt;&lt;br /&gt;The inner loop of the code works by doing the following:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Calling &lt;span style="font-family:courier new;"&gt;proc corr&lt;/span&gt; to calculate the correlation of all variables not already in a cluster.&lt;/li&gt;&lt;li&gt;Transposing the correlations into a table with three columns, two for the variables and one for the correlation using &lt;span style="font-family:courier new;"&gt;proc transpose&lt;/span&gt;.&lt;/li&gt;&lt;li&gt;Finding the pair of variables with the largest correlation.&lt;/li&gt;&lt;li&gt;Calculating the first principal component for these variables.&lt;/li&gt;&lt;li&gt;Appending this principal component to the data set.&lt;/li&gt;&lt;li&gt;Updating the &lt;span style="font-family:courier new;"&gt;columns&lt;/span&gt; data set with information about the new cluster.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;The data set referred to in the code comes from the companion site for &lt;a style="font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;.  The code will fail (by running an infinite loop) if any variables are missing or if two variables are exactly correlated.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-2657292361243080006?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/2657292361243080006/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2010/02/hierarchical-variable-clustering.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/2657292361243080006'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/2657292361243080006'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2010/02/hierarchical-variable-clustering.html' title='Agglomerative Variable Clustering'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-1790023569198993177</id><published>2010-02-10T18:52:00.003-05:00</published><updated>2010-02-11T12:43:40.810-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Michael'/><category scheme='http://www.blogger.com/atom/ns#' term='J'/><title type='text'>Why there is always a J window open on my desktop</title><content type='html'>People often ask me what tools I use for data analysis. My usual answer is SQL and I explain that just as Willie Sutton robbed banks because "that's where the money is," I use SQL because that is where the data is. But sometimes, it gets so frustrating trying to figure out how to get SQL to do something as seemingly straight forward as a running total or running maximum, that I let the data escape from the confines of its relational tables and into J where it can be free. I assume that most readers have never heard of J, so I'll give you a little taste of it here.&amp;nbsp; It's a bit like R only a lot more general and more powerful. It's even more like APL, of which it is a direct descendant, but those of us who remember APL are getting pretty old these days.&lt;br /&gt;&lt;br /&gt;The question that sent me to J this time came from a client who had just started collection sales data from a web site and wanted to know how long they would have to wait before being able to make some statistically valid conclusions about whether spending differences between two groups who had received different marketing treatments were statistically significant. One thing I wanted to look at was how much various measures such as average order size and total revenue fluctuate from day to day and how many days does it take before the overall measures settle down near their long-term means. For example, I'd like to calculate the average order size with just one day's worth of purchases, then two day's worth, then three day's worth, and so on. This sort of operation, where a function is applied to successively longer and longer prefixes is called a scan.&lt;br /&gt;&lt;br /&gt;A warning: J looks really weird when you first see it. One reason is that many things that are treated as a single token are spelled with two characters. I remember when I first saw Dutch, there were all these impossible looking words with "ij" in them--ijs and rijs, for example. Well, it turns out that in Dutch "ij" is treated like a single letter that makes a sound a bit like the English "eye." So ijs is ice and rijs is rice and the Rijn is a famous big river. In J, the second character of these two-character symbols is usually a '.' or a ':'.&lt;br /&gt;&lt;br /&gt;=: is assignment. &amp;lt;. is lesser of. &amp;gt;. is greater of. And so on. You should also know that anything following NB. on a line is comment text.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; x=: ? 100#10&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; NB. One hundred random integers between 0 and 9&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; +/ x&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; NB. Like putting a + between every pair of x--the sum of x.&lt;br /&gt;424&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;lt;. / x&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; NB. Smallest x&lt;br /&gt;0&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;gt;. / x&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; NB. Largest x&lt;br /&gt;9&lt;br /&gt;&amp;nbsp;&amp;nbsp; mean x&lt;br /&gt;4.24&lt;br /&gt;&amp;nbsp;&amp;nbsp; ~. x&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; NB. Nub of x. (Distinct elements.)&lt;br /&gt;3 0 1 4 6 2 8 7 5 9&lt;br /&gt;&amp;nbsp;&amp;nbsp; # ~. x&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; NB. Number of distinct elements.&lt;br /&gt;10&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; x # /. x&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; NB. How many of each distinct element. ( /. is like SQL GROUP BY.)&lt;br /&gt;6 10 15 13 15 9 9 12 6 5&lt;br /&gt;&amp;nbsp;&amp;nbsp; +/ \ x&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; NB. Running total of x.&lt;br /&gt;3 3 4 8 12 13 19 23 25 33 41 48 54 56 61 67 69 72 73 74 75 . . .&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;gt;./ \ x&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; NB. Running maximum of x.&lt;br /&gt;3 3 3 4 4 4 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 . . .&lt;br /&gt;&amp;nbsp;&amp;nbsp; mean \ x&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; NB. Running mean of x.&lt;br /&gt;3 1.5 1.33333 2 2.4 2.16667 2.71429 2.875 2.77778 3.3 3.72727 . . .&lt;br /&gt;&amp;nbsp;&amp;nbsp; plot mean \ x&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; NB. Plot running mean of x.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://www.data-miners.com/blog/uploaded_images/plot1-712783.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://www.data-miners.com/blog/uploaded_images/plot1-712781.gif" width="320" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; plot var \ x&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; NB. Plot running variance of x.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://www.data-miners.com/blog/uploaded_images/plot2-778109.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://www.data-miners.com/blog/uploaded_images/plot2-778106.gif" width="320" /&gt;&lt;/a&gt;&amp;nbsp;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&amp;nbsp; &lt;/div&gt;J is available for free from &lt;a href="http://www.jsoftware.com/"&gt;J software&lt;/a&gt;. Other than as a fan, I have no relationship with that organization.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-1790023569198993177?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/1790023569198993177/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2010/02/why-there-is-always-j-window-open-on-my.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1790023569198993177'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1790023569198993177'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2010/02/why-there-is-always-j-window-open-on-my.html' title='Why there is always a J window open on my desktop'/><author><name>Michael J. A. Berry</name><uri>http://www.blogger.com/profile/06077102677195066016</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='14679622169454737233'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-1670521866370335345</id><published>2010-02-10T12:15:00.004-05:00</published><updated>2010-02-11T12:44:24.187-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='SQL'/><title type='text'>Creating DDL For An Entire Database In SQL Server 2008</title><content type='html'>Recently, I started a new project which has a database component.  I looked around for some visual data modeling tools, and I settled on just using the diagrams capability of SQL Server.  Since the client is using SQL Server, it was simple to download SQL Server Express and get started using their diagramming tool.&lt;br /&gt;&lt;br /&gt;After creating a bunch of tables, I learned that SQL Server Database Diagrams do not produce the Data Definition Language (DDL) to create the database.  Instead, the tables are created in sync with the diagram.  Furthermore, SQL Server does not have a command that creates the DDL for an entire database.  Right clicking on two dozen tables is cumbersome.  But even worse, it would not provide complete DDL, since the table DDL does not include index definitions.&lt;br /&gt;&lt;br /&gt;I have seen some debate on the web about the merits of graphical tools versus text DDL.  Each has their advantages, and, personally, I believe that a decent database tool should allow users to switch between the two.  The graphical environment lets me see the tables and their relationships.  The text allows me to make global changes, such as:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Changing all the SMALLDATETIME data types to DATE when I go to a commercial version of SQL Server.  The Expression version does not support DATE, alas.&lt;/li&gt;&lt;li&gt;Adding auditing columns -- such as user, creation date, and update date -- to almost all tables.&lt;/li&gt;&lt;li&gt;Adding table-specific comments.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Doing these types of actions in a point-and-click environment is cumbersome, inefficient, and prone to error.  At the same time, the GUI environment is great for designing the tables and visualizing their relationships.&lt;br /&gt;&lt;br /&gt;So, I searched on the web for a DDL program that would allow me to create the DDL for an entire SQL Server database.  Because I did not find any, I decided that I had to write something myself.   The attached file contains &lt;a href="http://www.data-miners.com/blog/script-all-tables.sql"&gt;script-all-tables.sql&lt;/a&gt; contains my script.&lt;br /&gt;&lt;br /&gt;This script uses SQL to generate SQL code -- a trick that I talk about in my book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers" style="font-style: italic; font-weight: bold;"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;.  The script generates code for the following:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Dropping all tables in the database, if they exist.&lt;/li&gt;&lt;li&gt;Creating new versions of the tables, taking into account primary keys, data types, and identity columns.&lt;/li&gt;&lt;li&gt;Creating foreign key constraints on the table.&lt;/li&gt;&lt;li&gt;Creating indexes on the table.&lt;/li&gt;&lt;/ol&gt;This is a very common subset of DDL used for databases.  And, importantly, it seems to cover almost all that you can do using Database Diagrams.  However, the list of what it is missing from fully re-creating any database is very, very long, ranging from user defined types, functions, and procedures, to the storage architecture, replication, and triggers.&lt;br /&gt;&lt;br /&gt;The script uses the view in the &lt;span style="font-style: italic;"&gt;sys&lt;/span&gt; schema rather than in &lt;span style="font-style: italic;"&gt;Information_Schema&lt;/span&gt; simply because I found it easier to find the information that I needed to put the SQL together.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-1670521866370335345?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/1670521866370335345/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2010/02/creating-ddl-for-entire-database-in-sql.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1670521866370335345'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1670521866370335345'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2010/02/creating-ddl-for-entire-database-in-sql.html' title='Creating DDL For An Entire Database In SQL Server 2008'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-8780607061973592016</id><published>2010-02-02T13:16:00.003-05:00</published><updated>2010-02-02T13:47:31.220-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='Ask a data miner'/><title type='text'>Simpson's Paradox and Marketing</title><content type='html'>A reader asked the following question:&lt;br /&gt;&lt;br /&gt;&lt;div id=":14o" class="ii gt"&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;&lt;span style="font-family:trebuchet ms,sans-serif;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;div&gt;&lt;div style="font-style: italic;"&gt;&lt;span style="font-family:trebuchet ms,sans-serif;"&gt;Hi Michael/Gordon,&lt;/span&gt;&lt;/div&gt; &lt;div style="font-style: italic;"&gt; &lt;/div&gt; &lt;div style="font-style: italic;"&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;&lt;span style="font-family:trebuchet ms,sans-serif;"&gt;In campaign measurements, it's possible to get a larger lift at the overall level compared to all the individual decile level lifts or vice versa, because of the differences in sample size across the deciles, and across Test &amp;amp; Control. &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;  &lt;div style="font-style: italic;"&gt; &lt;/div&gt; &lt;div style="font-style: italic;"&gt;&lt;span style="font-family:trebuchet ms,sans-serif;"&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;According to wikipedia, it's known as Simpson's paradox (or the Yule-Simpson effect) and is explained as an apparent paradox&lt;/span&gt;&lt;/span&gt;&lt;span style="color: rgb(0, 0, 0);font-family:trebuchet ms,sans-serif;" &gt; in which the successes in different groups seem to be reversed when the groups are combined.&lt;/span&gt;&lt;/div&gt;  &lt;div style="font-style: italic;"&gt; &lt;/div&gt; &lt;div style="font-style: italic;"&gt;&lt;span style="font-family:Trebuchet MS;"&gt;In such scenarios, how do you calculate the overall lift? Which methods are commonly used in the industry?&lt;/span&gt;&lt;/div&gt; &lt;div style="font-style: italic;"&gt; &lt;/div&gt; &lt;div style="font-style: italic;"&gt;&lt;span style="font-family:Trebuchet MS;"&gt;Thanks,&lt;/span&gt;&lt;/div&gt; &lt;div style="font-style: italic;"&gt;&lt;span style="font-family:Trebuchet MS;"&gt;Datalligence&lt;/span&gt;&lt;/div&gt;&lt;a style="font-style: italic;" href="http://datalligence.blogspot.com/" target="_blank"&gt;&lt;span style="color: rgb(0, 0, 0);font-family:trebuchet ms,sans-serif;" &gt;http://datalligence.blogspot.&lt;wbr&gt;com/&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;Simpson's Paradox is an interesting phenomenon, where results about subgroups of a population do not generalize to the overall population.  I think the simplest version that I've heard is an old joke . . . "I heard you moved from Minnesota to Iowa, raising the IQ of both states."&lt;br /&gt;&lt;br /&gt;How could this happen? For the joke to work, the average IQ in Minnesota must be higher than the average IQ in Iowa.  And, the person who moves must have an IQ between these two values.  Voila, you can get the paradox that the averages in both states go up, although they are based on exactly the same population.&lt;br /&gt;&lt;br /&gt;I didn't realize that this paradox has a name (or, if I did, then I had forgotten).  Wikipedia has a very good article on &lt;a href="http://en.wikipedia.org/wiki/Simpson%27s_paradox"&gt;Simpson's Paradox&lt;/a&gt;, which includes real world examples from baseball, medical studies, and an interesting discussion of a gender discrimination lawsuit at Berkeley.  In the gender discrimination lawsuit, women were accepted at a much lower rate than men overall.  However, department by department, women were typically accepted at a higher rate than men.  The difference is that women applied to more competitive departments than men.  These departments have lower rates of acceptance, lowering the overall rate for women.&lt;br /&gt;&lt;br /&gt;Simpson's Paradox arises when we are taking weighted averages of evidence from different groups.  Different weightings can produce very different, even counter-intuitive results.  The results become much less paradoxical when we see the actual counts rather than just the percentages.&lt;br /&gt;&lt;br /&gt;The specific question is how to relate this paradox to lift, and understanding marketing campaigns.  Assume there is a marketing campaign, where one group receives a particular treatment and another group does not.  The ratio of performance between these two groups is the lift of the marketing campaign.&lt;br /&gt;&lt;br /&gt;To avoid Simpson's paradox, you need to ensure that the groups are as similar as possible, except for what's being tested.  If the test is for the marketing message, there is no problem, both groups can be pulled from the same population.  If, instead, the test is for the marketing group itself (say high value customers), then Simpson's Paradox is not an issue, since we care about how the group performs rather than how the entire population performs.&lt;br /&gt;&lt;br /&gt;As a final comment, I could imagine finding marketing results where Simpson's Paradox has surfaced, because the original groups were not well chosen.  Simpson's Paradox arises because the sizes of the test groups are not proportional to their sizes in the overall population.   In this case, I would be tempted to weight the results from each group based on the expected size in the overall population to calculate the overall response and lift.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-8780607061973592016?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/8780607061973592016/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2010/02/simpsons-paradox-and-marketing.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/8780607061973592016'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/8780607061973592016'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2010/02/simpsons-paradox-and-marketing.html' title='Simpson&apos;s Paradox and Marketing'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-7145271994237916172</id><published>2010-01-19T11:36:00.000-05:00</published><updated>2010-01-19T11:36:54.862-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Michael'/><title type='text'>Oracle load scripts now avalable for Data Analysis Using SQL and Excel</title><content type='html'>Classes started this week for the spring semester at Boston College where I am teaching a class on marketing analytics to MBA students at the Carroll School of Management.&amp;nbsp; The class makes heavy use of Gordon's book, &lt;a href="http://www.data-miners.com/bookstore.htm"&gt;Data Analysis Using SQL and Excel&lt;/a&gt; and the data that accompanies it. Since the local database is Oracle, I have at long last added Oracle load scripts to the book's &lt;a href="http://www.data-miners.com/sql_companion.htm"&gt;companion page&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;Due to laziness, my method of creating the Oracle script was to use the existing MySQL script and edit bits that didn't work in Oracle.&amp;nbsp; As it happens, the MySQL scripts worked pretty much as-is to load the tab-delimited data into Oracle tables using Oracle's sqlldr utility. One case that did &lt;u&gt;not&lt;/u&gt; work taught me something about the danger of mixing tab-delimited data with input formats in sqlldr.&amp;nbsp; Even though it has nothing to do with data mining, as a public service, that will be the topic of my next post.&lt;br /&gt;&lt;br /&gt;Preview: Something that works perfectly well when your field delimiter is comma, fails mysteriously when it is tab.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-7145271994237916172?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://www.data-miners.com/sql_companion.htm' title='Oracle load scripts now avalable for Data Analysis Using SQL and Excel'/><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/7145271994237916172/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2010/01/oracle-load-scripts-now-avalable-for.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/7145271994237916172'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/7145271994237916172'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2010/01/oracle-load-scripts-now-avalable-for.html' title='Oracle load scripts now avalable for Data Analysis Using SQL and Excel'/><author><name>Michael J. A. Berry</name><uri>http://www.blogger.com/profile/06077102677195066016</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='14679622169454737233'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-2526895588214669915</id><published>2010-01-09T17:31:00.006-05:00</published><updated>2010-01-09T18:35:30.488-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='Ab Initio'/><category scheme='http://www.blogger.com/atom/ns#' term='MapReduce'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop'/><title type='text'>Hadoop and Parallel Dataflow Programming</title><content type='html'>Over the past three months, I have been teaching myself enough Hadoop to get comfortable with using the environment for analytic purposes.&lt;br /&gt;&lt;br /&gt;There has been a lot of commentary about Hadoop/MapReduce versus relational databases (such as the articles referenced in my previous &lt;a href="http://www.data-miners.com/blog/2010/01/mapreduce-versus-relational-databases.html"&gt;post&lt;/a&gt; on the subject).  I actually think this discussion is misplaced because comparing open-source software with commercial software aligns people on "religious" grounds.  Some people will like anything that is open-source.  Some people will attack anything that is open-source (especially people who work for commercial software vendors).  And, the merits of real differences get lost.  Both Hadoop and relational databases are powerful systems for analyzing data, and each has its own distinct set of advantages and disadvantages.&lt;br /&gt;&lt;br /&gt;Instead, I think that Hadoop should be compared to a parallel dataflow style of programming.  What is a dataflow style of programming?  It is a style where we watch the data flow through different operations, forking and combining along the way, to achieve the desired goal.  Not only is a dataflow a good way to understand relational databases (which is why I introduce it in Chapter 1 of &lt;a style="font-weight: bold; font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;), but the underlying engines that run SQL queries are dataflow engines.&lt;br /&gt;&lt;br /&gt;Parallel dataflows extend dataflow processing to grid computing.  To my knowledge, the first commercial tool that implements parallel dataflows was developed by &lt;a href="www.init.com"&gt;Ab Initio&lt;/a&gt;.  This company was a spin-off from a bleeding edge parallel supercomputer vendor called &lt;a href="http://en.wikipedia.org/wiki/Thinking_Machines"&gt;Thinking Machines&lt;/a&gt; that went bankrupt in 1994.  As a matter of full disclosure:  Ab Initio was actually formed from the group that I worked for at Thinking Machines.  Although they are very, very, very resistant to sharing information about their technology, I am rather familiar it.  I believe that the only publicly available information about them (including screen shots)  is published in our book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0471331236/thedataminers"&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Mastering Data Mining:  The Art and Science of Customer Relationship Management&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I am confident that Apache has at least one dataflow project, since when I google "dataflow apache" I get a pointer to the &lt;a href="http://freshmeat.net/projects/dapper"&gt;Dapper&lt;/a&gt; project.  My wish, however, is that Hadoop were the parallel dataflow project.&lt;br /&gt;&lt;br /&gt;Much of what Hadoop does goes unheralded by the typical MapReduce user.  On a massively parallel system, Hadoop keeps track of the different parts of an HDFS file and, when the file is being used for processing, Hadoop does its darndest to keep the processing local to each file part being processed.  This is great, since data locality is key to achieving good performance.&lt;br /&gt;&lt;br /&gt;Hadoop also keeps track of which processors and disk systems are working.  When there is a failure, Hadoop tries again, insulating the user from sporadic hardware faults.&lt;br /&gt;&lt;br /&gt;Hadoop also does a pretty good job of shuffling data around, between the map and reduce operations.  The shuffling method -- sorting, send, and sort again -- may not be the most efficient but it is quite general.&lt;br /&gt;&lt;br /&gt;Alas, there are several things that Hadoop does not do, at least when accessed through the MapReduce interface.  Supporting these features would allow it move beyond the MapReduce paradigm, giving it the power to support more general parallel dataflow constructs.&lt;br /&gt;&lt;br /&gt;The first thing that bothers me about Hadoop is that I cannot easily take a text file and just copy it with the Map/Reduce primitives.  Copying a file seems like something that should be easy.  The problem is that a key gets generated during the map processing.  The original data gets output with a key prepended, unless I do a lot of work to parse out the first field and use it as a key.&lt;br /&gt;&lt;br /&gt;Could the &lt;span style="font-family: courier new;"&gt;context.write()&lt;/span&gt; function be overloaded with a version that does not output a key?  Perhaps this would only be possible in the reduce phase, since I understand the importance of the key for going from map to reduce.&lt;br /&gt;&lt;br /&gt;A performance issue with Hadoop is the shuffle phase between the map and the reduce.  As I mentioned earlier, the sort-send-sort process is quite general.  Alas, though, it requires a lot of work.  An alternative that often works well is simply hashing.  To maintain the semantics of map-reduce, I think this would be hash-send-combine or hash-send-sort.  The beauty of using hashing is that the data can be sent to its destination while the map is still processing it.  This allows concurrent use of the processing and network during this operation.&lt;br /&gt;&lt;br /&gt;And, speaking of performance, why does the key have to go before the data?  Why can't I just point to a sequence of bytes and use that for the key?  This would enable a programming style that doesn't spend so much time parsing keys and duplicating information between values and keys.&lt;br /&gt;&lt;br /&gt;Perhaps the most frustrating aspect of Hadoop is the MapReduce framework itself.  The current version allows processing like (M+)(R)(M*).  What this notation means is that the processing starts with one or more map jobs, goes to a reduce, and continues with zero or more map jobs.&lt;br /&gt;&lt;br /&gt;THIS IS NOT GENERAL ENOUGH!  I would like to have an arbitrary number of maps and reduces connected however I like.  So, one map could feed &lt;span style="font-style: italic;"&gt;two different reduces&lt;/span&gt;, each having different keys.  At the same time, one of the reduces could feed another reduce without having to go through an intermediate map phase.&lt;br /&gt;&lt;br /&gt;This would be a big step toward parallel dataflow parallel programming, since Map and Reduce are two very powerful primitives for this purpose.&lt;br /&gt;&lt;br /&gt;There are some other primitives that might be useful.  One would be &lt;span style="font-style: italic;"&gt;broadcast&lt;/span&gt;.  This would take the output from one processing node during one phase and send it to all the other nodes (in the next phase).  Let's just say that using &lt;span style="font-style: italic;"&gt;broadcast&lt;/span&gt;, it would be much easier to send variables around for processing.  No more defining weird variables using "set" in the main program, and then parsing them in &lt;span style="font-family: courier new;"&gt;setup()&lt;/span&gt; functions.  No more setting up temporary storage space, shared by all the processors.  No more using HDFS to store small serial files, local to only one node.  Just send data through a broadcast, and it goes everywhere.  (If the broadcast is running on more than one node, then the results would be concatenated together, everywhere.)&lt;br /&gt;&lt;br /&gt;And, if I had a broadcast, then my two-pass row number code (&lt;a href="http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-parallel-program.html"&gt;here&lt;/a&gt;) would only require one pass.&lt;br /&gt;&lt;br /&gt;I think Hadoop already supports having multiple different input files into one reduce operator.  This is quite powerful, and a much superior way of handling join processing.&lt;br /&gt;&lt;br /&gt;It would also be nice to have a final sort operator.  In the real world, people often do want sorted results.&lt;br /&gt;&lt;br /&gt;In conclusion, parallel dataflows are a very powerful, expressive, and efficient way of implementing complex data processing tasks.  Relational databases use dataflow engines for their processing.  Using non-procedural languages such as SQL, the power of dataflows are hidden from the user -- and, some relatively simple dataflow constructs can be quite difficult to express in SQL.&lt;br /&gt;&lt;br /&gt;Hadoop is a powerful system that emulates parallel dataflow programming.  Any step in a dataflow can be implemented using a MapReduce pass -- but this requires reading, writing, sorting, and sending the data multiple times.  With a few more features, Hadoop could efficiently implement parallel dataflows.  I feel this would be a big boost to both performance and utility, and it would leverage the power already provided by the Hadoop framework.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-2526895588214669915?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/2526895588214669915/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2010/01/hadoop-and-parallel-dataflow.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/2526895588214669915'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/2526895588214669915'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2010/01/hadoop-and-parallel-dataflow.html' title='Hadoop and Parallel Dataflow Programming'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-1189003498857208009</id><published>2010-01-05T14:59:00.005-05:00</published><updated>2010-01-05T15:53:11.806-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='MapReduce'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><title type='text'>MapReduce versus Relational Databases?</title><content type='html'>The current issue of &lt;span style="font-weight: bold; font-style: italic;"&gt;Communications of the ACM&lt;/span&gt; has articles on MapReduce and relational databases.  One, &lt;a style="font-style: italic;" href="http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext"&gt;MapReduce a Flexible Data Processing Tool&lt;/a&gt;, explains the utility of MapReduce by two Google fellows -- appropriate authors, since Google invented the parallel MapReduce paradigm.&lt;br /&gt;&lt;br /&gt;The second article, &lt;a href="http://cacm.acm.org/magazines/2010/1/55743-mapreduce-and-parallel-dbmss-friends-or-foes/fulltext"&gt;&lt;span style="font-style: italic;"&gt;MapReduce and Parallel DBMSs:  Friend or Foe&lt;/span&gt;&lt;/a&gt;, is written by a team of authors, with Michael Stonebraker listed as the first author.  I am uncomfortable with this article, because the article purports to show the superiority of a particular database system, Vertica, without mentioning -- anywhere -- that Michael Stonebraker is listed as the CTO and Co-Founder on Vertica's &lt;a href="http://www.vertica.com/leadership"&gt;web site&lt;/a&gt;.  For this reason, I believe that this article should be subject to much more scrutiny.&lt;br /&gt;&lt;br /&gt;Before starting, let me state that I personally have no major relationships with any of the database vendors or with companies in the Hadoop/MapReduce space.  I am an advocate of using relational databases for data analysis and have written a book called &lt;a style="font-style: italic; font-weight: bold;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;.  And, over the past three months, I have been learning Hadoop and MapReduce, as attested to by numerous blog postings on the subject.  Perhaps because I am a graduate of MIT ('85), I am upset that Michael Stonebraker uses his MIT affiliation for this article, without mentioning his Vertica affiliation.&lt;br /&gt;&lt;br /&gt;The first thing I notice about the article is the number of references to Vertica.  In the main text, I count nine references to Vertica, as compared to thirteen mentions of other databases:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Aster (twice)&lt;/li&gt;&lt;li&gt;DataAllegro (once)&lt;/li&gt;&lt;li&gt;DB2 (twice)&lt;/li&gt;&lt;li&gt;Greenplum (twice)&lt;/li&gt;&lt;li&gt;Netezza (once)&lt;/li&gt;&lt;li&gt;ParAccel (once)&lt;/li&gt;&lt;li&gt;PostgreSQL (once)&lt;/li&gt;&lt;li&gt;SQL Server (once)&lt;/li&gt;&lt;li&gt;Teradata (once)&lt;/li&gt;&lt;/ul&gt;The paper describes a study which compares Vertica, another database, and Hadoop on various tasks.  The paper never explains how these databases were chosen for this purpose.  Configuration issues for the other database and Hadoop are mentioned.  The configuration and installation of Vertica -- by the absence of problems -- one assumes is easy and smooth.  I have not (yet) read the paper cited, which describes the work in more detail.&lt;br /&gt;&lt;br /&gt;Also, the paper never describes costs for the different system, which is a primary driver of MapReduce.  The software is free and runs on cheap clusters of computers, rather than expensive servers and hardware.  For a given amount of money, MapReduce may provide a much faster solution, since it can support much larger hardware environments.&lt;br /&gt;&lt;br /&gt;The paper never describes issues in the loading of data.  I assume this is a significant cost for the databases.  Loading the data for Hadoop is much simpler . . . since it just reads text files, which is a common format.&lt;br /&gt;&lt;br /&gt;From what I can gather, the database systems were optimized specifically for the tasks at hand, although this is not explicitly mentioned anywhere.  For instance, the second tasks is a &lt;span style="font-family: courier new;"&gt;GROUP BY&lt;/span&gt;, and I suspect that the data is hash partitioned by the &lt;span style="font-family: courier new;"&gt;GROUP BY&lt;/span&gt; clause.&lt;br /&gt;&lt;br /&gt;There are a few statements that I basically disagree with.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;"Lastly, the reshuffle that occurs between the Map and Reduce tasks in MR is equivalent to a GROUP BY operation in SQL."&lt;/span&gt;  The issue here at first seems like a technicality.  In a relational database, an input row can only into one group.  MR can output multiple records in the map stage, so a single row can go into multiple "groups".  This functionality is important for the word count example, which is the canonical MapReduce example.  I find it interesting that this example is not included in the benchmark.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;"Given this, parallel DBMSs provide the same computing model               as MR, with the added benefit of using a declarative language               (SQL)."&lt;/span&gt;  This is not true in several respects.  First, MapReduce does have associated projects for supporting declarative languages.  Second, in order for SQL to support the level of functionality that the authors claim, they need to use user defined functions.  Is that syntax declarative?&lt;br /&gt;&lt;br /&gt;More importantly, though, is that the computing model really is not exactly the same.  Well, with SQL extensions such as &lt;span style="font-family: courier new;"&gt;GROUPING SET&lt;/span&gt;s and window functions, the functionality does come close.  But, consider the ways that you can add a row number to data (assuming that you have no row number function built-in) using MapReduce versus traditional SQL.  Using MapReduce you can follow the two-phase program that I described in an earlier &lt;a href="http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-parallel-program.html"&gt;posting&lt;/a&gt;.  With traditional SQL, you have to do a non-equi-self join.  MapReduce has a much richer set of built-in functions and capabilities, simply because it uses java, an established programming language with many libraries.&lt;br /&gt;&lt;br /&gt;On the other hand, MapReduce does not have a concept of "null" built-in (although users can define their own data types and semantics).  And, MapReduce handles non-equijoins poorly, because the key is used to direct both tables to the same node.  In effect, you have to limit the MapReduce job to one node.  SQL can still parallelize such queries.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;"[MapReduce] still requires user code to parse the value portion of the record if it contains multiple attributes."&lt;/span&gt;  Well, parse is the wrong term, since a &lt;span style="font-family: courier new;"&gt;Writable&lt;/span&gt; class supports binary representations of data types.  I describe how to create such types &lt;a href="http://www.data-miners.com/blog/2009/12/hadoop-020-creating-types.html"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I don't actually feel qualified to comment on many of the operational aspects of optimizing Hadoop code.  I do note that the authors do not explain the main benefit of Vertica, which is the support of column partitioning.  Each column is stored separate, which makes it possible to apply very strong compression algorithms to the data.  In many cases, the Vertica data will fit in memory.  This is a huge performance boost (and one that another vendor, Paracel takes advantage of).&lt;br /&gt;&lt;br /&gt;In the end, the benchmark may be comparing the in-memory performance of a database to general performance for MapReduce.  The benchmark may not be including the ETL time for loading the data, partitioning data, and building indexes.  The benchmark may not have allocated optimal numbers of map and reduce jobs for the purpose.  And, it is possible that the benchmark is unbiased and relational databases really are better.&lt;br /&gt;&lt;br /&gt;A paper that leaves out the affiliations between its authors and the vendors used for a benchmark is only going to invite suspicion.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-1189003498857208009?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/1189003498857208009/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2010/01/mapreduce-versus-relational-databases.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1189003498857208009'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1189003498857208009'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2010/01/mapreduce-versus-relational-databases.html' title='MapReduce versus Relational Databases?'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-5656931797147731472</id><published>2010-01-02T09:39:00.016-05:00</published><updated>2010-01-02T17:25:30.820-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='MapReduce'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop'/><title type='text'>Hadoop and MapReduce:  Normalizing Data Structures</title><content type='html'>To set out to learn Hadoop and Map/Reduce, I tackled several different problems.  The last of these problems is the challenge of normalizing data, a concept from the world of relational databases.  The earlier problems were adding &lt;a href="http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-parallel-program.html"&gt;sequential row numbers&lt;/a&gt; and &lt;a href="http://www.data-miners.com/blog/2009/12/hadoop-and-mapreduce-characterizing.html"&gt;characterizing values in the data&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;This posting describes data normalization, explains how I accomplished it in Hadoop/MapReduce, and some tricks in the code.  I should emphasize here that the code is really "demonstration" code, meaning that I have not worked hard on being sure that it always works.  My purpose is to demonstrate the idea of using Hadoop to do normalization, rather than producing 100% working code.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-family:arial;" &gt;What is Normalization and Why Do Want To Do It?&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Data normalization is the process of extracting values from a single column and placing them in a reference table.  The data used by Hadoop is typically unnormalized, meaning that data used in processing is in a single record, so there is no need to join in reference tables.  In fact, doing a join is not obvious using the MapReduce primitives, although my understanding is that Hive and Pig -- two higher level languages based on MapReduce -- do incorporate this functionality.&lt;br /&gt;&lt;br /&gt;Why would we want to normalize data? (This is a good place to plug my book &lt;a style="font-weight: bold; font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;, which explains this concept in more detail in the first chapter.)   In the relational world, the reason is something called "relational integrity", meaning that any particular value is stored in one, and only one, place.  For instance, if the state of California were to its name, we would not want to update every record from California.  Instead, we'd rather go to the reference table and just change the name to the new name, and the data field contains a state id rather than the state name itself.  Relational integrity is particularly important when  data is being updated.&lt;br /&gt;&lt;br /&gt;Why would we want to normalize data used by Hadoop?  There are two reasons.  The first is that we may be using Hadoop processing to load a relational database -- one that is already designed with appropriate reference tables.  This is entirely reasonable, relational databases are an attractive way to "publish" results from complex data processing since they are better for creating end-user reports and building interactive GUI interfaces.&lt;br /&gt;&lt;br /&gt;The second reason is performance.  Extracting long strings and putting them in a separate reference table can significantly reduce the storage requirements for the data files.  By far, most of the space taken up in typical log files, for instance, consists of long URIs (what I used to call URLs).  When processing the log files, we might want to extract some features from the URIs, but keeping the entire string just occupies a lot of space -- even in a compressed file.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-family:arial;" &gt;The Process of Normalizing Data&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Normalizing data starts with data structures.  The input records are assumed to be in a delimited format, with the column names in the first row (or provided separately, although I haven't tested that portion of the code yet).  In addition, there is a "master" id file that contains the following columns:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;id -- a unique id for every value by column.&lt;/li&gt;&lt;li&gt;column name -- the name of the column.&lt;/li&gt;&lt;li&gt;value -- the id in the column.&lt;/li&gt;&lt;li&gt;count -- the total number of times the value as so far occurred.&lt;/li&gt;&lt;/ul&gt;This is a rudimentary reference file.  I could imagine, for instance, having more information than just the count as summary information -- perhaps the first and last date when the value occurs, for instance.&lt;br /&gt;&lt;br /&gt;What happens when we normalize data?  Basically, we look through the data file to find new values in each column being normalized.  We append these new values into the master id file, and then go back to the original data and replace the values with the ids.&lt;br /&gt;&lt;br /&gt;Hadoop is a good platform for this for several reasons.  First, because the data is often stored as text files, the values and the ids have the same type -- text strings.  This means that the file structures remain the same.  Second, Hadoop can process multiple columns at the same time.  Third, Hadoop can use inexpensive clusters and free software for this task, rather than relying on databases and tools, which are often more expensive.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-family:arial;" &gt;How To Normalize Data Using Hadoop/MapReduce&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The normalization process has six steps.  Most of these correspond to a single Map-Reduce pass.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-family:arial;" &gt;Step 1:  Extract the column value pairs from the original data.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This step explodes the data, by creating a new data set with multiple rows for each row in the original data.  Each output row contains a column, a value, and the number of times the value appears in the data.  Only columns being normalized are included in the output.&lt;br /&gt;&lt;br /&gt;This step also saves the column names for the data file in a temporary file.  I'll return to why this is needed in Step 6.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-family:arial;" &gt;Step 2:  Extract column-value Pairs Not In Master ID File&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This step compares the column-value pairs produced in the first step with those in the master id file.  This step is interesting, because it reads data from two different data source formats -- the master id file and the results from Step 1.  Both sets of data files use the &lt;span style="font-family:courier new;"&gt;GenericRecord&lt;/span&gt; format.&lt;br /&gt;&lt;br /&gt;To identify the master file, the map function looks at the original data to see whether "/master" appears in the path.  Alternative methods would be to look at the &lt;span style="font-family:courier new;"&gt;GenericRecord&lt;/span&gt; that is created or to use &lt;span style="font-family:courier new;"&gt;MultipleInputs&lt;/span&gt; (which I didn't use because of a &lt;a href="http://www.facebook.com/note.php?note_id=77978247002&amp;amp;ref=mf&amp;amp;_fb_noscript=1"&gt;warning&lt;/a&gt; on Cloudera's web site).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-family:arial;" &gt;Step 3:  Calculate the Maximum ID for Each Column in the Master File&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is a very simple Map-Reduce step that simply gets the maximum id for each column.  New ids that are assigned will be assigned one more than this value.&lt;br /&gt;&lt;br /&gt;This is an instance where I would very much like to have two different reduces following a map step.  If this were possible, then I could combine this step with step 2.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-family:arial;" &gt;Step 4:  Calculate a New ID for the Unmatched Values&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is a two step process that follows the mechanism for adding row numbers discussed in one of my earlier &lt;a href="http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-parallel-program.html"&gt;posts&lt;/a&gt;, with one small modification.  The final result has the maximum id value from Step 3 added onto it, so the result is a new id rather than just a row number.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-family:arial;" &gt;Step 5:  Merge the New Ids with the Existing Master IDs&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;This step merges in the results from Step 4 with the existing master id file.  Currently, the results are placed into another directly.  Eventually, they could simply override the master id file.&lt;br /&gt;&lt;br /&gt;Because of the structure of the Hadoop file system, the merge could be as simple as copying the file with the new ids into the appropriate master id data space.  However, this would result in an unbalanced master id file, which is probably not desirable for longer term processing.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-family:arial;" &gt;Step 6: Replace the Values in the Original Data with IDs&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;This final step replaces the values with ids -- the actual normalization step.  This is a two part process.  The map phase of the first part takes both the original data and the master key file.  All the column value pairs are exploded from the original data, as in Step 1, with the output consisting of:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;key:  &lt;column name=""&gt;:&lt;column value=""&gt;&lt;/column&gt;&lt;/column&gt;&lt;/li&gt;&lt;li&gt;value:  &lt;"expect"|"nomaster"&gt;, &lt;partition rownumber=""&gt;, &lt;column number=""&gt;&lt;/column&gt;&lt;/partition&gt;&lt;/li&gt;&lt;/ul&gt;The first part ("expect" or "nomaster") is an indicator of whether this column should be normalized (that is, whether or not to expect a master id).  The second field identifies the original data record, which is uniquely identified by the partition id and row number within that partition.  The third is the column number in the row.&lt;br /&gt;&lt;br /&gt;The master records are placed in the format:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;key:  &lt;column name=""&gt;:&lt;column value=""&gt;&lt;/column&gt;&lt;/column&gt;&lt;/li&gt;&lt;li&gt;value:  "master", &lt;id&gt;&lt;/id&gt;&lt;/li&gt;&lt;/ul&gt;The reduce then reads through all the records for a given column-value combination.  If one of them is a master, then it outputs the id for all records.  Otherwise, it outputs the original value.&lt;br /&gt;&lt;br /&gt;The last phase simply puts the records back together again, from their exploded form.  The one trick here is that the metadata is read from a local file.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-family:arial;" &gt;Tricks Used In This Code&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The code is available in these files:  &lt;a href="http://www.data-miners.com/blog/Normalize.java"&gt;Normalize.java&lt;/a&gt;, &lt;a href="http://www.data-miners.com/blog/GenericRecordInputFormat.java"&gt;GenericRecordInputFormat.java&lt;/a&gt;, &lt;a href="http://www.data-miners.com/blog/GenericRecord.java"&gt;GenericRecord.java&lt;/a&gt;, and &lt;a href="http://www.data-miners.com/blog/GenericRecordMetadata.java"&gt;GenericRecordMetadata.java&lt;/a&gt;.  This code uses several tricks along the way.&lt;br /&gt;&lt;br /&gt;One trick that I use in Step 4, for the phase 1 map, makes the code more efficient.  This phase of the computation extracts the maximum row number for each column.  Instead of passing all the row numbers to a combine or reduce function, it saves them in a local hash-map data structure.  I then use the &lt;span style="font-family: courier new;"&gt;cleanup()&lt;/span&gt; routine in the  map function to output the maximum values.&lt;br /&gt;&lt;br /&gt;Often the master code needs to pass variables to the map/reduce jobs.  The best way to accomplish this is by using the "set" mechanism in the &lt;span style="font-family:courier new;"&gt;Configuration&lt;/span&gt; object.  This allows variables to be assigned a string name.  The names of all the variables that I use are stored in constants that start with &lt;span style="font-family:courier new;"&gt;PARAMETER_&lt;/span&gt;, defined at the beginning of the &lt;span style="font-family:courier new;"&gt;Normalize&lt;/span&gt; class.&lt;br /&gt;&lt;br /&gt;In some cases, I need to pass arrays in, for instance, when passing in the list of column that are to be normalized.  In this case, one variable gives the number of values ("normalize.usecolumns.numvals").  Then each value is stored in a variable such as "normalize.usecolumns.0" and "normalize.usecolumns.1" and so on.&lt;br /&gt;&lt;br /&gt;Some of the important processing actually takes place in the master loop, where results are gathered and then passed to subsequent steps using this environment mechanism.&lt;br /&gt;&lt;br /&gt;The  idea behind the &lt;span style="font-family:courier new;"&gt;GenericRecord&lt;/span&gt; class is pretty powerful, with the column names at the top of the file.  &lt;span style="font-family:courier new;"&gt;GenericRecord&lt;/span&gt;s make it possible to read multiple types of input in the same map class, for instance, which is critical functionality for combining data from two different input streams.&lt;br /&gt;&lt;br /&gt;However, the Map-Reduce framework does not really recognize these column names as being different, once generic records are placed in a sequence file.  The metadata has to be passed somehow.&lt;br /&gt;&lt;br /&gt;When the code itself generates the metadata, this is simple enough.  A function is used to create the metadata, and this function is used in both the map and reduce phases.&lt;br /&gt;&lt;br /&gt;A bigger problem arises with the original data.  In particular, Step 6 of the above framework re-creates the original records, but it has lost the column names, which poses a conundrum.  The solution is to save the original metadata in Step 1, which first reads the records.  This metadata is then passed into Step 6.&lt;br /&gt;&lt;br /&gt;In this code, this is handled by simply using a file.  The first map partition of Step 1 writes this file (this partition is used to guarantee that the file is written exactly once).  The last reduce in Step 6 then reads this file.&lt;br /&gt;&lt;br /&gt;This mechanism works, but is not actually the preferred mechanism, because all the reduce tasks in Step 6 are competing to read the same file -- a bottleneck.&lt;br /&gt;&lt;br /&gt;A better mechanism is for the master program to read the file and to place the contents in variables in the jar file passed to the map reduce tasks.  Although I do this for other variables, I don't bother to do this for the file.&lt;br /&gt;&lt;a href="http://www.data-miners.com/blog/Normalize.java"&gt;&lt;br /&gt;&lt;/a&gt;&lt;a href="http://www.data-miners.com/blog/GenericRecordMetadata.java"&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-5656931797147731472?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/5656931797147731472/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2010/01/hadoop-and-mapreduce-normalizing-data.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/5656931797147731472'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/5656931797147731472'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2010/01/hadoop-and-mapreduce-normalizing-data.html' title='Hadoop and MapReduce:  Normalizing Data Structures'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-8577585281575675706</id><published>2009-12-28T17:43:00.000-05:00</published><updated>2009-12-28T17:43:16.115-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Ask a data miner'/><category scheme='http://www.blogger.com/atom/ns#' term='user question'/><category scheme='http://www.blogger.com/atom/ns#' term='marketing'/><category scheme='http://www.blogger.com/atom/ns#' term='Michael'/><title type='text'>Differential Response or Uplift Modeling</title><content type='html'>Some time before the holidays, we received the following inquiry from a reader:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;Dear Data Miners,&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I’ve read interesting arguments for uplift modeling (also called incremental response modeling) [1], but I’m not sure how to implement it.  I have responses from a direct mailing with a treatment group and a control group. Now what?    Without data mining, I can calculate the uplift between the two groups but not for individual responses.   With the data mining techniques I know, I can identify the ‘do not disturbs,’ but there’s more than avoiding mailing that group.  How is uplift modeling implemented in general, and how could it be done in R or Weka?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;[1] http://www.stochasticsolutions.com/pdf/CrossSell.pdf&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;I first heard the term "uplift modeling" from &lt;a href="http://scientificmarketer.com/"&gt;Nick Radcliffe&lt;/a&gt;, then of Quadstone. I think he may have invented it.  In our book, &lt;a href="http://www.data-miners.com/bookstore.htm"&gt;Data Mining Techniques&lt;/a&gt;, we use the term "differential response analysis." It turns out that "differential response" has a very specific meaning in the child welfare world, so perhaps we'll switch to "incremental response" or "uplift" in the next edition. But whatever it is called, you can approach this problem in a cell-based fashion without any special tools. Cell-based approaches divide customers into cells or segments in such a way that all members of a cell are similar to one another along some set of dimensions considered to be important for the particular application. You can then measure whatever you wish to optimize (order size, response rate, . . .) by cell and, going forward, treat the cells where treatment has the greatest effect.&lt;br /&gt;&lt;br /&gt;Here, the quantity&amp;nbsp; to measure is the &lt;i&gt;difference &lt;/i&gt;in response rate or average order size between treated and untreated groups of &lt;i&gt;otherwise similar&lt;/i&gt; customers. Within each cell, we need a randomly selected treatment group and a randomly selected control group; the incremental response or uplift is the difference in average order size (or whatever) between the two. Of course some cells will have higher or lower overall average order size, but that is not the focus of incremental response modeling. The question is not "What is the average order size of women between 40 and 50 who have made more than 2 previous purchases and live in a neighborhood where average household income is two standard deviations above the regional average?" It is "What is the change in order size for this group?"&lt;br /&gt;&lt;br /&gt;Ideally, of course, you should design the segmentation and assignment of customers to treatment and control groups before the test, but the reader who submitted the question has already done the direct mailing and tallied the responses. Is it now too late to analyze incremental response?&amp;nbsp; That depends: If the control group is a true random control group and if it is large enough that it can be partitioned into segments that are still large enough to provide statistically significant differences in order size, it is not too late. You could, for instance, compare the incremental response of male and female responders.&lt;br /&gt;&lt;br /&gt;A cell-based approach is only useful if the segment definitions are such that incremental response really does vary across cells. Dividing customers into male and female segments won't help if men and women are equally responsive to the treatment. This is the advantage of the special-purpose uplift modeling software developed by Quadstone (now Portrait Software). This tool builds a decision tree where the splitting criteria is maximizing the difference in incremental response. This automatically leads to segments (the leaves of the tree) characterized by either high or low uplift.&amp;nbsp; That is a really cool idea, but the lack of such a tool is not a reason to avoid incremental response analysis.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-8577585281575675706?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/8577585281575675706/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/12/differential-response-or-uplift.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/8577585281575675706'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/8577585281575675706'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/12/differential-response-or-uplift.html' title='Differential Response or Uplift Modeling'/><author><name>Michael J. A. Berry</name><uri>http://www.blogger.com/profile/06077102677195066016</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='14679622169454737233'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-146522666229935563</id><published>2009-12-27T12:48:00.016-05:00</published><updated>2009-12-28T12:53:49.243-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='MapReduce'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop'/><title type='text'>Hadoop and MapReduce:  Characterizing Data</title><content type='html'>This posting describes using Hadoop and MapReduce to characterize data -- that is, to summarize the values in various columns to learn about the values in each column.&lt;br /&gt;&lt;br /&gt;This post describes how to solve this problem using Hadoop.  It also explains why Hadoop is better for this particular problem than SQL.&lt;br /&gt;&lt;br /&gt;The code discussed in this post is available in these files:  &lt;a href="http://www.data-miners.com/blog/GenericRecordMetadata.java"&gt;GenericRecordMetadata.java&lt;/a&gt;, &lt;a href="http://www.data-miners.com/blog/GenericRecord.java"&gt;GenericRecord.java&lt;/a&gt;, &lt;a href="http://www.data-miners.com/blog/GenericRecordInputFormat.java"&gt;GenericRecordInputFormat.java&lt;/a&gt;, and &lt;a href="http://www.data-miners.com/blog/Characterize.java"&gt;Characterize.java&lt;/a&gt;.  This work builds on the classes introduced in my previous post &lt;a style="font-style: italic;" href="http://www.data-miners.com/blog/2009/12/hadoop-and-mapreduce-method-for-reading.html"&gt;Hadoop and MapReduce:  Method for Reading and Writing General Record Structures&lt;/a&gt; (the versions here fix some bugs in the earlier versions).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-size:130%;" &gt;&lt;span style="font-family:arial;"&gt;What Does This Code Do?&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The purpose of this code is to provide summaries for data in a data file.  Being Hadoop, the data is stored in a delimited text format, with one record per line, and the code uses &lt;span style="font-family:courier new;"&gt;GenericRecord&lt;/span&gt; to handle the specific data.  The generic record classes are things that I wrote to handle this situation; the Apache java libraries apparently have other approaches to solving this problem.&lt;br /&gt;&lt;br /&gt;The specific summaries for each column are:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Number of records.&lt;/li&gt;&lt;li&gt;Number of values.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Minimum and maximum values for string variables, along with the number of times the minimum and maximum values appear in the data.&lt;/li&gt;&lt;li&gt;Minimum and maximum lengths for string variables, along with the number of times these appear and an example of the value.&lt;/li&gt;&lt;li&gt;First, second, and third most common string values.&lt;/li&gt;&lt;li&gt;Number of times the column appears to be an integer.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Minimum and maximum values when treating the values as integers, along with the number of times that these appear.&lt;/li&gt;&lt;li&gt;Number of times the column appears to contain a real number.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Minimum and maximum values when treating the values as doubles, along with the number of times that these appear.&lt;/li&gt;&lt;li&gt;Count of negative, zero, and positive values.&lt;/li&gt;&lt;li&gt;Average value.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;These summaries are arbitrary.  The code should be readily extensible to other types and other summaries.&lt;br /&gt;&lt;br /&gt;My ultimate intention is to use this code to easily characterize input and result files that I create in the process of writing Hadoop code.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-size:130%;" &gt;&lt;span style="font-family:arial;"&gt;Overview of the Code&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The characterize problem is solved in two steps.  The first creates a histogram of all the values in all the columns, and the second summarizes the histogram of values, which is handled by two passes of map reduce.&lt;br /&gt;&lt;br /&gt;The histogram step takes files with the following format:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Key:  undetermined&lt;/li&gt;&lt;li&gt;Values:  text values separated by a delimited (by default a tab)&lt;/li&gt;&lt;/ul&gt;(This is the &lt;span style="font-family:courier new;"&gt;GenericRecord&lt;/span&gt; format.)&lt;br /&gt;The Map phase produces a file of the format:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Key:  column name and column value, separated by a colon&lt;/li&gt;&lt;li&gt;Value:  "1"&lt;/li&gt;&lt;/ul&gt;Combine and Reduce then add up the "1"s, producing a file of the format:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Key:  column name&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Value:  column value separated by tab&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Using a tab as a separator is a convenience, because this is also the default separator for the key.&lt;br /&gt;&lt;br /&gt;The second phase of the Map/Reduce job takes the previous output and uses the reduce function to summarize all the different values in the histogram.  This code is quite specific to the particular summaries.  The &lt;span style="font-family:courier new;"&gt;GenericRecord&lt;/span&gt; format is quite useful because I can simply add new summaries in the code, without worrying about the layout of the records.&lt;br /&gt;&lt;br /&gt;The code makes use of exception processing to handle particular data types.  For instance, the following code block handles the integer summaries:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;             try {&lt;br /&gt;                 &lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;long intval = Long.parseLong(valstr);&lt;br /&gt;                 &lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;hasinteger = true;&lt;br /&gt;                 &lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;intvaluecount++;&lt;br /&gt;                 &lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;intrecordcount += Long.parseLong(val.get("count"));&lt;br /&gt;}&lt;br /&gt;             catch (Exception NumberFormatException) {&lt;br /&gt;                 &lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;// we don't have to do anything here&lt;br /&gt;                 }&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;This block tries to convert the value to an integer (actually to a long).  When this works, then the code updates the various variables that characterize integer values.  When this fails, the code continues working.&lt;br /&gt;&lt;br /&gt;There is a similar block for real numbers, and I could imagine adding more such blocks for other formats, such as dates and times.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-size:130%;" &gt;&lt;span style="font-family:arial;"&gt;Why MapReduce Is Better Than SQL For This Task&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Characterizing data is the process of summarizing data along each column, to get an idea of what is in the data. Normally, I think about data processing in terms of SQL (after all, my most recent book is &lt;a style="font-style: italic; font-weight: bold;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;).  SQL, however, is particularly poor for this purpose.&lt;br /&gt;&lt;br /&gt;First, SQL has precious few functions for this task -- basically &lt;span style="font-family:courier new;"&gt;MIN()&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;MAX()&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;AVG()&lt;/span&gt; and judicious use of the &lt;span style="font-family:courier new;"&gt;CASE&lt;/span&gt; statement. Second, SQL generally has lousy support for string functions and inconsistent definitions for date and time functions across different databases.&lt;br /&gt;&lt;br /&gt;Worse, though, is that traditional SQL can only summarize one column at a time. The traditional SQL approach would be to summarize each column individually in a query and then connect them using &lt;span style="font-family:courier new;"&gt;UNION ALL&lt;/span&gt; statements.  The result is that the database has to do a full-table scan for each column.&lt;br /&gt;&lt;br /&gt;Although not supported in all databases, SQL syntax does now support the &lt;span style="font-family:courier new;"&gt;GROUPING SETS&lt;/span&gt; keyword which helps potentially alleviate this problem.  However, &lt;span style="font-family:courier new;"&gt;GROUPING SETS&lt;/span&gt; is messy, since the key columns each have to be in separate columns. That is, I want the results in the format "column name, column value". With &lt;span style="font-family:arial;"&gt;GROUPING SETS&lt;/span&gt;, I get "column1, column2 ... columnN", with NULLs for all unused columns, except for the one with a value.&lt;br /&gt;&lt;br /&gt;The final problem with SQL occurs when the data starts out in text files.  Much of the problem of characterizing and understanding the data happens outside the database during the load process.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-146522666229935563?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/146522666229935563/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/12/hadoop-and-mapreduce-characterizing.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/146522666229935563'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/146522666229935563'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/12/hadoop-and-mapreduce-characterizing.html' title='Hadoop and MapReduce:  Characterizing Data'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-1473308114457935703</id><published>2009-12-22T16:12:00.000-05:00</published><updated>2009-12-22T16:12:57.197-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Interview'/><category scheme='http://www.blogger.com/atom/ns#' term='Conferences'/><category scheme='http://www.blogger.com/atom/ns#' term='Michael'/><title type='text'>Interview with Eric Siegel</title><content type='html'>This is the first of what may become an occasional series of interviews with people in the data mining field. Eric Siegel is the organizer of the popular&amp;nbsp; Predictive Analytics World conference series. I asked him a little bit about himself and gave him a chance to plug his conference.&amp;nbsp; A propos, readers of this blog can get a 15% discount on a two-day conference pass by pasting the code DATAMINER010 into the Promotional Code box on the conference &lt;a href="https://www.eiseverywhere.com/ereg/newreg.php?eventid=7934&amp;amp;"&gt;registration page&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Q: Not many kids (one of mine is perhaps the exception that proves the rule) have the thought "when I grow up, I want to be a data miner!"&amp;nbsp; How did you fall into this line of work?&lt;br /&gt;&lt;br /&gt;&lt;span style="color: blue;"&gt;To many laypeople, the word "data" sounds dry, arcane, meaningless - boring! And number-crunching on it doubly so. But this is actually the whole point. Data is the uninterpreted mass of things that've happened.&amp;nbsp; Extracting what's up, the means behind the madness, and in so doing modeling and learning about human behavior... well, I feel nothing in science or engineering is more interesting. &lt;br /&gt;In my "previous life" as an academic researcher, I focused on core predictive modeling methods. The ability for a computer to automatically learn from experience (data really is recorded experience, after all), is the best thing since sliced bread. Ever since I realized, as I grew up from childhood, that space travel would in fact be a tremendous, grueling pain in the neck (not fun like "Star Wars"), nothing in science has ever seemed nearly as exciting.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: blue;"&gt;In my current 9-year career as a commercial practitioner, I've found that indeed the ability to analytically "learn" and apply what's been learned turns out to provide plenty of business value, as I imagined back in the lab.&amp;nbsp; Research science is fun in that you have the luxury of abstraction and are often fairly removed from the need to prove near-term industrial applicability. Applied science is fun for the opposite reason: The tangle of challenges, although some less abstract and in that sense more mundane, are the only thing between you and getting the great ideas of the world to actually work, come to fruition, and deliver an irrefutable impact.&lt;/span&gt;&lt;br /&gt;&lt;div class="im"&gt;&lt;br /&gt;&lt;br /&gt;Q: Most conferences happen once a year.&amp;nbsp; Why does PAW come around so much more frequently?&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="color: blue;"&gt;In fact, many &lt;i&gt;commercial &lt;/i&gt;conferences focused the industrial deployment of technology occur multiple times per year, in contrast to research conferences, which usually take place annually.&amp;nbsp; There's an increasing demand for a more frequent commercial event as predictive analytics continues to "cross chasms" towards more widescale penetration. There's just too much to cover - too many brand-name case studies and too many hot topics - to wait a year before each event.&lt;br /&gt;&lt;/div&gt;&lt;div class="im"&gt;&lt;br /&gt;&lt;br /&gt;Q: You use the phrase "predictive analytics" for what I've always called "data mining." Do the terms mean something different, or is it just that fashions change with the times?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="color: blue;"&gt;"Data mining" is indeed often used synonymously with "predictive analytics", but not always. Data mining's definitions usually entail the discovery of non-trivial, useful patterns/knowledge/insights from data -- if you "dig" enough, you get a "nugget." This is a fairly abstract definition and therefore envelops a wide range of analytical techniques. On the other hand, predictive analytics is basically the commerical deployment of predictive modeling specifically (that is, in academic jargon, supervised learning, i.e., optimizing a statitistical model over labeled/historical cases). In business applications, this basically translates to a model that produces a score for each customer, prospect, or other unit of interest (business/outlet location, SKU, etc), which is roughly the working definition we posted on the Predictive Analytics World website. This would seem to potentially exclude related data mining methods such as forecasting, association mining and clustering (unsupervised learning), but, naturally, we include some sessions at the conference on these topics as well, such as your extremely-well-received session on forecasting October 2009 in DC.&lt;br /&gt;&lt;/div&gt;&lt;div class="im"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Q: How do you split your time between conference organizing and analytical consulting work?&amp;nbsp; (That's my polite way of trying to rephrase a question I was once asked: "What's the split between spewing and doing?")&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="color: blue;"&gt;When one starts spewing a lot, there becomes much less time for doing. In the last 2 years, as my 2-day seminar on predictive analytics has become more frequent (both as public and customized on-site training sessions - see &lt;a href="http://www.businessprediction.com/" target="_blank"&gt;http://www.businessprediction.&lt;wbr&gt;&lt;/wbr&gt;com&lt;/a&gt;), and I helped launch Predictive Analytics World, my work in services has become less than half my time, and I now spend very little time doing hands-on, playing a more advisory and supervisory role for clients, alongside other senior consultants who do more hands-on for Prediction Impact services engagements.&lt;br /&gt;&lt;/div&gt;&lt;div class="im"&gt;&lt;br /&gt;&lt;br /&gt;Q: I can't help noticing that you have a Ph.D.&amp;nbsp; As someone without any advanced degrees, I'm pretty good at rationalizing away their importance, but I want to give you a chance to explain what competitive advantage it gives you.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="color: blue;"&gt;The doctorate is a research-oriented degree, and the Ph.D. dissertation is in a sense a "hazing" process. However, it's become clear to me that the degree is very much net positive for my commercial career. People know it entails a certain degree of discipline and aptitude. And, even if I'm not conducting academic research most of the time, every time one applies analytics there there is an experimental component to the task. On the other hand, many of the best data miners - the "rock star" consultants such as yourself - did not need a doctorate program in order to become great at data mining.&lt;br /&gt;&lt;/div&gt;&lt;div class="im"&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Q: Moving away from the personal, how do you think the move of data and computing power into the cloud is going to change data mining?&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="color: blue;"&gt;I'd say there's a lot of potential in making parallelized deployment more readily available to any and all data miners.&amp;nbsp; But, of all the hot topics in analytics, I feel this is the one into which I have the least visibility. It does, after all, pertain more to infrastucture and support than to the content, meaning and insights gained from analysis.&lt;br /&gt;&lt;/div&gt;&lt;div style="color: blue;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="color: blue;"&gt;But, turning to the relevant experts, be sure to check out Feb PAW's upcoming session, "In-database Vs. In-cloud Analytics: Implications for Deployment" - see &lt;a href="http://www.predictiveanalyticsworld.com/sanfrancisco/2010/agenda.php#day2-7" target="_blank"&gt;http://www.&lt;wbr&gt;&lt;/wbr&gt;predictiveanalyticsworld.com/&lt;wbr&gt;&lt;/wbr&gt;sanfrancisco/2010/agenda.php#&lt;wbr&gt;&lt;/wbr&gt;day2-7&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="im"&gt;&lt;br /&gt;&lt;br /&gt;Q: Can you give examples of problems that once seemed like hot analytical challenges that have now become commoditized?&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="color: blue;"&gt;Great question. Hmm... common core analyical methods such as decision trees and logistic regression may be the only true commodities to date in our field. What do you think?&lt;br /&gt;&lt;/div&gt;&lt;div class="im"&gt;&lt;br /&gt;Q: There are some tasks that we used to get hired for 10 or 15 years ago that no one comes to us for these days. Direct mail response models is an example. I think people feel like they know how to do those themselves. Or maybe that is something the data vendors pretty much give away with the data.&lt;br /&gt;&lt;br /&gt;Which of today's hot topics in data mining do you see as ripe for commiditization?&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="color: blue;"&gt;UPLIFT (incremental lift) modeling is branching out, with applications going beyond response and churn modeling (see &lt;a href="http://www.predictiveanalyticsworld.com/sanfrancisco/2010/agenda.php#day2-2" target="_blank"&gt;http://www.&lt;wbr&gt;&lt;/wbr&gt;predictiveanalyticsworld.com/&lt;wbr&gt;&lt;/wbr&gt;sanfrancisco/2010/agenda.php#&lt;wbr&gt;&lt;/wbr&gt;day2-2&lt;/a&gt;).&lt;br /&gt;&lt;/div&gt;&lt;div style="color: blue;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="color: blue;"&gt;Expanding traditional data sets with SOCIAL DATA is continuing to gain traction across a growing range of verticals as analytics pracitioners find great value (read: tremendous increases in model lift) leveraging the simple fact that people behave similarly to those to whom they're socially connected. Just as the healthcare industry has discovered that quitting smoking is "contagious" and that the risk of obesity dramatically increases if you have an obese friend, telecommunications, online social networks and other industries find that "birds of a feather" churn and even commit fraud "together". Is this more because people influence one-another, or because they befriend others more like themselves?&amp;nbsp; Either way, social connections are hugely predictive of the customer behaviors that matter to business.&lt;br /&gt;&lt;/div&gt;&lt;div style="color: blue;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="color: blue;"&gt;For PAW sessions on social data analysis, see &lt;a href="http://www.predictiveanalyticsworld.com/sanfrancisco/2010/agenda.php#day1-10" target="_blank"&gt;http://www.&lt;wbr&gt;&lt;/wbr&gt;predictiveanalyticsworld.com/&lt;wbr&gt;&lt;/wbr&gt;sanfrancisco/2010/agenda.php#&lt;wbr&gt;&lt;/wbr&gt;day1-10&lt;/a&gt; and &lt;a href="http://www.predictiveanalyticsworld.com/sanfrancisco/2010/agenda.php#day1-12" target="_blank"&gt;http://www.&lt;wbr&gt;&lt;/wbr&gt;predictiveanalyticsworld.com/&lt;wbr&gt;&lt;/wbr&gt;sanfrancisco/2010/agenda.php#&lt;wbr&gt;&lt;/wbr&gt;day1-12&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="im"&gt;&lt;div style="color: blue;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;Q: There have been several articles in the popular press recently, like&lt;a href="http://www.nytimes.com/2009/08/06/technology/06stats.html?_r=1&amp;amp;emc=eta1"&gt; this one in the NY Times&lt;/a&gt;,&amp;nbsp; saying that statistics and data mining are the hottest fields a young person could enter right now.&amp;nbsp; Do you agree?&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="color: blue;"&gt;Well, for the subjective reasons in my answer to your first question above, I would heartily agree. If I recall, that NY Times article focused on the demand for data miners as the career's central appeal. Indeed, it is a very marketable skill these days, which certainly doesn't hurt.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-1473308114457935703?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/1473308114457935703/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/12/interview-with-eric-siegel.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1473308114457935703'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1473308114457935703'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/12/interview-with-eric-siegel.html' title='Interview with Eric Siegel'/><author><name>Michael J. A. Berry</name><uri>http://www.blogger.com/profile/06077102677195066016</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='14679622169454737233'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-8360554308299539782</id><published>2009-12-18T18:24:00.011-05:00</published><updated>2009-12-19T17:34:11.929-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='MapReduce'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop'/><title type='text'>Hadoop and MapReduce:  Method for Reading and Writing General Record Structures</title><content type='html'>I'm finally getting more comfortable with Hadoop and java, and I've decided to write a program that will characterize data in parallel files.&lt;br /&gt;&lt;br /&gt;To be honest, I find that I am spending a lot of time writing new &lt;span style="font-style: italic;font-family:courier new;" &gt;Writable&lt;/span&gt; and &lt;span style="font-style: italic;font-family:courier new;" &gt;InputFormat &lt;/span&gt;classes, every time I want to do something.  Every time I introduce a new data structure used by the Hadoop framework, I have to define two classes.  Yucch!&lt;br /&gt;&lt;br /&gt;So, I put together a simple class called &lt;span style="font-family:courier new;"&gt;GenericRecord&lt;/span&gt; that can store a set of column names (as string) and a corresponding set of column values (as strings).  These are stored in delimited files, and the various classes understand how to parse these files.  In particular, the code can read any tab delimited file that has column names on the first row (and changing the delimitor should be easy).  One nice aspect is the ability to use the &lt;span style="font-family:courier new;"&gt;GenericRecord&lt;/span&gt; as the output of a reduce function, which means that the number and names of the output can be specified in the code -- rather than in additional files with additional classes.&lt;br /&gt;&lt;br /&gt;I wouldn't be surprised if similar code already exists with more functionality than the code I have here.  This effort is also about my learning Hadoop.&lt;br /&gt;&lt;br /&gt;This posting provides the code and explains important features on how it works.  The code is available in these files &lt;a href="http://www.data-miners.com/blog/GenericRecord.java"&gt;GenericRecord.java&lt;/a&gt;, &lt;a href="http://www.data-miners.com/blog/GenericRecordMetadata.java"&gt;GenericRecordMetadata.java&lt;/a&gt;, &lt;a href="http://www.data-miners.com/blog/GenericRecordInputFormat.java"&gt;GenericRecordInputFormat.java&lt;/a&gt;, and &lt;a href="http://www.data-miners.com/blog/GenericRecordTester.java"&gt;GenericRecordTester.java&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-size:130%;" &gt;&lt;span style="font-family:arial;"&gt;What This Code Does&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This code is analogous to the word count code, that must be familiar to anyone starting to learn MapReduce (since it seems to be the first example in all the documentation I've seen).  Instead of counting words, this code counts the occurrence of values in the columns.&lt;br /&gt;&lt;br /&gt;The code reads input files and produces output records with three columns:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A column name in the original data.&lt;/li&gt;&lt;li&gt;A value in the column.&lt;/li&gt;&lt;li&gt;The number of times the value appears.&lt;/li&gt;&lt;/ul&gt;Do note that for data with many unique values in many columns, the number of output records is likely to far exceed the number of input records.  So, the output file can be bigger than the input file.&lt;br /&gt;&lt;br /&gt;The input records are assumed to be in a text file with one record per row.  The first row contains the names of the columns, delimited by a tab (although this could easily be changed to another delimiter).  The rest of the rows contain values.  Note that this assumes that the input files are all read from the beginning; that is, that a single input file is not split among multiple map tasks.&lt;br /&gt;&lt;br /&gt;One irony of this code and the Hadoop framework is that the input files do not have to be in the same format.  So, I could upload a bunch of different files, with different numbers of columns, and different column names, and run them all in parallel.  I would have to be careful that the column names are all different, for this to work well.&lt;br /&gt;&lt;br /&gt;Examples of such files are available on the companion page for my book &lt;a style="font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;.  These are small files by the standards of Hadoop (measures in megabytes) but quite sufficient for testing and demonstrating code.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-size:130%;" &gt;&lt;span style="font-family:arial;"&gt;Overview of Approach&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There are four classes defined for this code:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;GenericRecordMetadata&lt;/span&gt; stores the metadata (column names) for a record.&lt;/li&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;GenericRecord&lt;/span&gt; stores the values for a particular record.&lt;/li&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;GenericRecordInputFormat&lt;/span&gt; provides the interface for reading the data into Hadoop.&lt;/li&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;GenericRecordTester&lt;/span&gt; provides the functions for the MapReduce framework.&lt;/li&gt;&lt;/ul&gt;The metadata consists of the names of the columns, which can be accessed either by a column index or by a column name.  The metadata has functions to translate a column name into a column index.  Because it uses a &lt;span style="font-family:courier new;"&gt;HashMap&lt;/span&gt;, the functions should run quite fast, although they are not optimal in memory space.  This is okay, because the metadata is stored only once, rather than once per row.&lt;br /&gt;&lt;br /&gt;The generic record itself stores the data as an array of strings.  It also contains a pointer to the metadata object, in order to fetch the names.  The array of strings minimizes both memory overhead and time, but does require access using an integer.  The other two classes are needed for the Hadoop framework.&lt;br /&gt;&lt;br /&gt;One small challenge is getting this to work without repeating the metadata information for each row of data.  This is handled by including the column names as the first row in any file created by the Hadoop framework, and not by putting the column names in the output for each row.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-size:130%;" &gt;&lt;span style="font-family:arial;"&gt;Setting Up The Metadata When Reading&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The class &lt;span style="font-family:courier new;"&gt;GenericRecordInputFormat&lt;/span&gt; basically does all of its work in a private class called &lt;span style="font-family:courier new;"&gt;GenericRecordRecordReader&lt;/span&gt;.  This function has two important functions:  &lt;span style="font-family:courier new;"&gt;initialize()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;nextKeyValue()&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-family:courier new;"&gt;initialize()&lt;/span&gt; function sets up the metadata, either by reading environment variables in the context object or by parsing the first line of the input file (depending on whether or not the environment variable &lt;span style="font-family:courier new;"&gt;genericrecord.numcolumns&lt;/span&gt; is defined).  I haven't tested passing in the metadata using environment variables, because setting up the environment variables poses a challenge.  These variables have to be set in the master routine in the configuration before the map function is called.&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-family:courier new;"&gt;nextKeyValue()&lt;/span&gt; function reads a line of the text file, parses it using the function &lt;span style="font-family:courier new;"&gt;split()&lt;/span&gt;,  and sets the values in the line.  The verification on the number of items read matching the number of expected items is handled in the function &lt;span style="font-family:courier new;"&gt;lineValue.set()&lt;/span&gt;, which raises an exception (currently unhandled) when there is a mismatch.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0);font-size:130%;" &gt;&lt;span style="font-weight: bold;font-family:arial;" &gt;Setting Up The Metadata When Writing&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Perhaps more interesting is the ability to set up the metadata dynamically when writing.  This is handled mostly in the &lt;span style="font-family:courier new;"&gt;setup()&lt;/span&gt; function of the &lt;span style="font-family:courier new;"&gt;SplitReduce&lt;/span&gt; class, which sets up the metadata using various function calls.&lt;br /&gt;&lt;br /&gt;Writing the column names out at the beginning of the results file uses a couple of tricks.  First, this does not happen in the &lt;span style="font-family:courier new;"&gt;setup()&lt;/span&gt; function but rather in the &lt;span style="font-family:courier new;"&gt;reduce()&lt;/span&gt; function itself, for the simple reason that the latter handles &lt;span style="font-family:courier new;"&gt;IOException&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;The second trick is that the metadata is written out by putting it into the values of a &lt;span style="font-family:courier new;"&gt;GenericRecord&lt;/span&gt;.  This works because the values are all strings, and the record itself does not care if these are actually for the column names.&lt;br /&gt;&lt;br /&gt;The third trick is to be very careful with the function &lt;span style="font-family:courier new;"&gt;GenericRecord.toString()&lt;/span&gt;.  Each column is separated by a tab character, because the tab is used to separate the key from the value in the Hadoop framework.  In the reduce output files, the key appears first (the name of the column in the original data), followed by a tab -- as put there by the Hadoop framework.  Then, &lt;span style="font-family:courier new;"&gt;toString()&lt;/span&gt; adds the values separated by tabs.  The result is a tab-delimited file that looks like column names and values, although the particular pieces are put there through different mechanisms.  I imagine that there is a way to tell Hadoop to use a different character to separate the key and value, but I haven't researched this point.&lt;br /&gt;&lt;br /&gt;The final trick is to be careful about the ordering of the columns.  The code iterates through the values of the &lt;span style="font-family:courier new;"&gt;GenericRecord&lt;/span&gt; table manually using an index rather than a &lt;span style="font-family:courier new;"&gt;for-in&lt;/span&gt; loop.  This is quite intentional, because it allows the code to control the order in which the columns appear -- which is presumably the original ordered in which they were defined.  Using the &lt;span style="font-family:courier new;"&gt;for-in&lt;/span&gt; is also perfectly valid, but the columns may appear in a different order (which is fine, because the column names also appear in the same order).&lt;br /&gt;&lt;br /&gt;The result of all this machinery is that the reduce function can now return values in a &lt;span style="font-family:courier new;"&gt;GenericRecord&lt;/span&gt;.  And, I can specify these in the reduce function itself, without having to mess around with other classes.  This is likely to be a big benefit as I attempt to develop more code using Hadoop.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-8360554308299539782?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/8360554308299539782/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/12/hadoop-and-mapreduce-method-for-reading.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/8360554308299539782'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/8360554308299539782'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/12/hadoop-and-mapreduce-method-for-reading.html' title='Hadoop and MapReduce:  Method for Reading and Writing General Record Structures'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-3006310543781551357</id><published>2009-12-17T19:12:00.002-05:00</published><updated>2010-01-03T16:37:38.905-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Ask a data miner'/><category scheme='http://www.blogger.com/atom/ns#' term='Michael'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><title type='text'>What do group members have in common?</title><content type='html'>We received the following question via email.&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;Hello,&lt;br /&gt;&lt;br /&gt;I have a data set which has both numeric and string attributes. It is a data set of our customers doing a particular activity (eg: customers getting one particular loan). We need to find out the pattern in the data or the set of attributes which are very common for all of them.&lt;br /&gt;&lt;br /&gt;Classification/regression not possible , because there is only one class&lt;br /&gt;Association rule cannot take my numeric value into consideration&lt;br /&gt;clustering clusters similar people, but not common attributes.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;What is the best method to do this? Any suggestion is greatly appreciated.&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;The question "what do all the customers with a particular type of loan have in common"&amp;nbsp; sounds seductively reasonable. In fact, however, the question is not useful at all because the answer is "Almost everything."&amp;nbsp; The proper question is "What, if anything, do these customers have in common with one another, &lt;u&gt;but not with other people&lt;/u&gt;?"&amp;nbsp; Because people are all pretty much the same, it is the tiny ways they differ that arouse interest and even passion.&amp;nbsp; Think of two groups of Irishmen, one Catholic and one Protestant. Or two groups of Indians, one Hindu and one Muslim. If you started with members of only one group and started listing things they had in common, you would be unlikely to come up with anything that didn't apply equally to the other group as well. &lt;br /&gt;&lt;br /&gt;So, what you really have is a classification task after all.&amp;nbsp; Take the folks who have the loan in question and an equal numbers of otherwise similar customers who do not. Since you say you have a mix of numeric and string attributes, I would suggest using decision trees. These can split equally well on numeric values ( x&amp;gt;n ) or categorical variables ( model in ('A','B','C') ). If the attributes you have are, in fact, able to distinguish the two groups, you can use the rules that describe leaves that are high in holders of product A as "what holders of product A have in common" but that is really shorthand for "what differentiates holders of product A from the rest of the world."&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-3006310543781551357?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/3006310543781551357/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/12/we-received-following-question-via.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/3006310543781551357'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/3006310543781551357'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/12/we-received-following-question-via.html' title='What do group members have in common?'/><author><name>Michael J. A. Berry</name><uri>http://www.blogger.com/profile/06077102677195066016</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='14679622169454737233'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-1767270456109534558</id><published>2009-12-15T15:07:00.007-05:00</published><updated>2009-12-15T16:21:23.995-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='MapReduce'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop'/><title type='text'>Hadoop 0.20:  Creating Types</title><content type='html'>In various earlier posts, I wrote code to read and write zip code data (which happens to be part of the companion page to my book &lt;a style="font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;).  This provides sample data for use in my learning Hadoop and mapreduce.&lt;br /&gt;&lt;br /&gt;Originally, I wrote the code using Hadoop 0.18, because I was using the Yahoo virtual machine.  I have since switched to the Cloudera virtual machine, which runs the most recent version of Hadoop, V0.20.&lt;br /&gt;&lt;br /&gt;I thought switching my code would be easy.  The issue is less the difficulty of the switch, then some nuances in Hadoop and java.  This post explains some of the differences between the two versions, when adding a new type into the system.  I explained my experience with the map, reduce, and job interface in another &lt;a href="http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-switching-to-020.html"&gt;post&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The structure of the code is simple.  I have a java file that implements a class called &lt;span style="font-family:courier new;"&gt;ZipCode&lt;/span&gt;, which contains the ZipCode interface with the Writable interface (which is I include using &lt;span style="font-family:courier new;"&gt;import org.apache.hadoop.io.*&lt;/span&gt;).  Another class called &lt;span style="font-family:courier new;"&gt;ZipCodeInputFormat&lt;/span&gt; implements the read/writable version so &lt;span style="font-family:courier new;"&gt;ZipCode&lt;/span&gt; can be used as input and output in MapReduce functions.  The input format class uses another, private class called &lt;span style="font-family:courier new;"&gt;ZipCodeRecordReader&lt;/span&gt;, which does all the work. Because of the rules of java, these need to be in two different files, which have the same name as the class.  The files are available in &lt;a href="http://www.data-miners.com/blog/ZipCensus.java"&gt;ZipCensus.java&lt;/a&gt; and &lt;a href="http://www.data-miners.com/blog/ZipCensusInputFormat.java"&gt;ZipCensusInputFormat.java&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;These files now use the Apache mapreduce interface rather than the mapred interface, so I must import the right packages into the java code:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;import org.apache.hadoop.mapreduce.*;&lt;br /&gt;import org.apache.hadoop.mapreduce.lib.*;&lt;br /&gt;import org.apache.hadoop.mapreduce.lib.input.*;&lt;br /&gt;import org.apache.hadoop.mapreduce.InputSplit;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;And then I had a problem when defining the &lt;span style="font-family:courier new;"&gt;ZipCodeInputFormat&lt;/span&gt; class using the code:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;public class ZipCensusInputFormat extends FileInputFormat&lt;text,&gt; {&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;public RecordReader&lt;text,&gt; createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;........&lt;/span&gt;return new ZipCensusRecordReader();&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;}  // RecordReader&lt;text,&gt;&lt;br /&gt;}  // class ZipCensusInputFormat&lt;br /&gt;&lt;/text,&gt;&lt;/text,&gt;&lt;/text,&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;The specific error given by Eclipse/Ganymede is:  "&lt;span style="font-style: italic;"&gt;The type org.apache.commons.logging.Log cannot be resolved. It is indirectly referenced from required .class files.&lt;/span&gt;"  This is a bug in Eclipse/Ganymede, because the code compiles and runs using javac/jar.  At one point, I fixed this by including various Apache commons jars.  However, since I didn't need them when compiling manually, I removed them from the Eclipse project.&lt;br /&gt;&lt;br /&gt;The interface for the RecordReader class itself has changed.  The definition for the class now looks like:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;class ZipCensusRecordReader extends RecordReader&lt;text,&gt;&lt;/text,&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Previously, this used the syntax "&lt;span style="font-family:courier new;"&gt;implements&lt;/span&gt;" rather than "&lt;span style="font-family:courier new;"&gt;extends&lt;/span&gt;".  For those familiar with java, this is the difference between an interface and an abstract class, a nuance I don't yet fully appreciate.&lt;br /&gt;&lt;br /&gt;The new interface (no pun intended) includes two new functions, &lt;span style="font-family:courier new;"&gt;initialize()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;cleanup()&lt;/span&gt; .  I like this change, because it follows the same convention used for map and reduce classes.&lt;br /&gt;&lt;br /&gt;As a result, I changed the constructor to take no arguments.   This has moved to &lt;span style="font-family:courier new;"&gt;initialize()&lt;/span&gt;, which takes two arguments of type &lt;span style="font-family:courier new;"&gt;InputSplit&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;TaskAttemptContext&lt;/span&gt;.  The purpose of this code is simply to skip the first line of the data file, which contains column names.&lt;br /&gt;&lt;br /&gt;The most important for the class is now called &lt;span style="font-family:courier new;"&gt;nextKeyValue()&lt;/span&gt; rather than &lt;span style="font-family:courier new;"&gt;next()&lt;/span&gt;.   The new function takes no arguments, putting the results in local private variables accessed using  &lt;span style="font-family:courier new;"&gt;getCurrentKey()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;getCurrentValue()&lt;/span&gt;.  The function &lt;span style="font-family: courier new;"&gt;next()&lt;/span&gt; took two arguments, one for the key and one for the value, although the results could be accessed using the same two functions.&lt;br /&gt;&lt;br /&gt;Overall the changes are simple modifications to the interface, but they can be tricky for the new user.  I did not find a simple explanation for the changes anywhere on the web; perhaps this posting will help someone else.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-1767270456109534558?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/1767270456109534558/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/12/hadoop-020-creating-types.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1767270456109534558'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1767270456109534558'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/12/hadoop-020-creating-types.html' title='Hadoop 0.20:  Creating Types'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-1333878193334947211</id><published>2009-12-05T12:43:00.009-05:00</published><updated>2009-12-05T14:17:12.712-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='MapReduce'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop'/><title type='text'>Hadoop and MapReduce:  What Country is an IP Address in?</title><content type='html'>I have started using Hadoop to sessionize web log data.  It has surprised me that there is not more written on this subject on the web, since I thought this was one of the more prevalent uses of Hadoop.  Because I'm doing this work for a client, using Amazon EC2, I do not have sample data web log data files to share.&lt;br /&gt;&lt;br /&gt;One of the things that I want to do in the sessionization code is to include what country the user is in.  Typically, the only source of location information in such logs is the IP address used for connecting to the internet.  How can I look up the country the IP address is in?&lt;br /&gt;&lt;br /&gt;This posting describes three things:  the source of the IP geography information, new things that I'm learning about java, and how to do the lookup in Hadoop.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-size:130%;" &gt;&lt;span style="font-family:arial;"&gt;The Source of IP Geolocation Information&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.maxmind.com/"&gt;MaxMind&lt;/a&gt; is a company that has a specialty in geolocation data.  I have no connection to MaxMind, other than a recommendation to use their software from someone at the client where I have been doing this work.  There may be other companies with similar products.&lt;br /&gt;&lt;br /&gt;One way they make money by offering a product called GeoIp Country which has very, very accurate information about the country where an IP is located (they also offer more detailed geographies, such as regions, states, and cities, but country is sufficient for my purposes).  Their claim is that GeoIP Country is 99.8% accurate.&lt;br /&gt;&lt;br /&gt;Although quite reasonably priced, I am content to settle for the free version, called GeoLite Country, for which the claim is 99.5% accuracy.&lt;br /&gt;&lt;br /&gt;These products come in two parts.  The first part is an interface, which is available for many languages, with the java version &lt;a href="http://geolite.maxmind.com/download/geoip/api/java/"&gt;here&lt;/a&gt;.  I assume the most recent version is the best, although I happen to be using an older version.&lt;br /&gt;&lt;br /&gt;Both the free and paid versions use the same interface, which is highly convenient, in case I want to switch between them.  The difference is the database, which is available from &lt;a href="http://www.maxmind.com/app/geolitecountry"&gt;this&lt;/a&gt; download page.  The paid version has more complete coverage and is updated more frequently.&lt;br /&gt;&lt;br /&gt;The interface consists of two important components:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Creating a &lt;span style="font-family:courier new;"&gt;LookupService&lt;/span&gt; object, which is instantiated with an argument that names the database file.&lt;/li&gt;&lt;li&gt;Using &lt;span style="font-family:courier new;"&gt;LookupService.getCountry()&lt;/span&gt; to do the lookup.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Simple enough interface; how do we get it to work in java, and in particular, in java for Hadoop?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-size:130%;" &gt;&lt;span style="font-family:arial;"&gt;New Things I've Learned About Java&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;As I mentioned a few weeks ago in my first &lt;a href="http://www.data-miners.com/blog/2009/11/getting-started-with-hadoop-and.html"&gt;post&lt;/a&gt; on learning Hadoop, I had never used java prior to this endavor (although I am familiar with other object oriented programming languages such as C++ and C#).  I have been learning java on an "as needed" basis, which is perhaps not the most efficient way overall but has been the fastest way to get started.&lt;br /&gt;&lt;br /&gt;When programming java, there are two steps.  I am using the &lt;span style="font-family:courier new;"&gt;javac&lt;/span&gt; command to compile code into class files.  Then I'm using the &lt;span style="font-family:courier new;"&gt;jar&lt;/span&gt; command to create a jar file.  I have been considering this the equivalent of "compiling and linking code", which also takes two steps.&lt;br /&gt;&lt;br /&gt;However, the jar file is much more versatile than a regular executable image.  In particular, I can put &lt;span style="font-style: italic;"&gt;any&lt;/span&gt; files there.  These files are then available in my application, although java calls them "resources" instead of "files".  This will be very important in getting MaxMind's software to work with Hadoop.  I can include the IP database in my application jar file, which is pretty cool.&lt;br /&gt;&lt;br /&gt;There is a little complexity, though, which involves the paths of where there are located.  When using hadoop, I have been using statements such as "&lt;span style="font-family:courier new;"&gt;org.apache.hadoop.mapreduce&lt;/span&gt;" without really understand them.  This statement brings in classes associated with the mapreduce package, because three things have happened:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The original work (at apache) was done in a directory structure that included &lt;span style="font-family:courier new;"&gt;./org/apache/hadoop/mapreduce&lt;/span&gt;.&lt;/li&gt;&lt;li&gt;The tar file was created in that (higher-level) directory.  Note that this could be buried deep down in the directory hierarchy.  Everything is relative to the directory where the tar file is created.&lt;/li&gt;&lt;li&gt;I am including that tar file explicitly in my &lt;span style="font-family:courier new;"&gt;javac&lt;/span&gt; command, using the &lt;span style="font-family:courier new;"&gt;-cp&lt;/span&gt; argument which specifies a class path.&lt;/li&gt;&lt;/ul&gt;All of this worked without my having to understand it, because I had some examples of working code.  The MaxMind code then poses a new problem. This is the first time that I have to get someone else's code to work.  How do we do this?&lt;br /&gt;&lt;br /&gt;First, after you uncompress their java code, copy the &lt;span style="font-family:courier new;"&gt;com&lt;/span&gt; directory to the place where you create your java jar file.  Actually, you could just link the directories.  Or, if you know what you are doing, then you may have another solution.&lt;br /&gt;&lt;br /&gt;Next, for compiling the files, I modified the &lt;span style="font-family:courier new;"&gt;javac&lt;/span&gt; command line, so it read:  &lt;span style="font-family:courier new;"&gt; &lt;/span&gt;&lt;span class="il"  style="font-family:courier new;"&gt;javac&lt;/span&gt;&lt;span style="font-family:courier new;"&gt; -cp .:/opt/hadoop/hadoop-0.20.1-core.jar:com/maxmind/geoip [subdirectory]/*.java&lt;/span&gt;.  That is, I added the geoip directory to the class path, so java can find the class files.&lt;br /&gt;&lt;br /&gt;The class path can accept either a jar file or a directory.  When it is a jar file, &lt;span style="font-family:courier new;"&gt;javac&lt;/span&gt; looks for classes in the jar file.  When it is a directory, it looks for classes in the directory (but not in subdirectories).  That is simple enough.  I do have to admit, though, that it wasn't obvious when I started.  I don't think of jar files and directories as being equivalent.  But they are.&lt;br /&gt;&lt;br /&gt;Once the code compiles, just be sure to include the &lt;span style="font-family:courier new;"&gt;com/maxmind/geoip/*&lt;/span&gt; files in the jar command.  In addition, I also copied over the GeoLite Country database and included it in the jar file.  Do note that the path used to put things in the jar file makes a difference!  So, "&lt;span style="font-family:courier new;"&gt;jar ~/maxmind/*.dat&lt;/span&gt;" behaves differently from "&lt;span style="font-family:courier new;"&gt;jar ./*.dat&lt;/span&gt;", when we want to use the data file.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-size:130%;" &gt;&lt;span style="font-family:arial;"&gt;Getting MaxMind to Work With Hadoop&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Things are a little complicated in the Hadoop world, because we need to pass in a database file to initialize the MaxMind classes.  My first attempt was to initialize the lookup service in the map class using code like:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;iplookup = new LookupService("~/maxmind/GeoIP.dat",&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.............................&lt;/span&gt;LookupService.GEOIP_MEMORY_CACHE |&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.............................&lt;/span&gt;LookupService.GEOIP_CHECK_CACHE);&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;This looked right to me and was similar to code that I found in various placed on the internet.&lt;br /&gt;&lt;br /&gt;Guess what?  It didn't work.  And it didn't work for a fundamentally important reason.  Map classes are run on the distributed nodes, and the distributed nodes do not have access to the local file system.  Duh, this is why the HDFS (hadoop distributed file system) was invented!&lt;br /&gt;&lt;br /&gt;But now, I have a problem.  There is a reasonably sized data file -- about 1 Mbyte.  Copying it to the HDFS does not really solve my problem, because it is not an "input" into the Map routine.  I suppose, I could copy it and then figure out how to open it as a sequence file, but that is not the route I took.&lt;br /&gt;&lt;br /&gt;Up to this point, I had found three ways to get information into the Map classes:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Compile it in using constants.&lt;/li&gt;&lt;li&gt;Pass small amounts on the Conf structure, using the various set and get functions.  I have examples of this in the row number code.&lt;/li&gt;&lt;li&gt;Use the distributed cache.  I haven't done this yet, because there is warning about setting it up correctly using configuration xml files.  Wow, that is something that I can easily get wrong.  I'll learn this when I think it is absolutely necessary, knowing that it might take a few hours to get it right.&lt;/li&gt;&lt;/ol&gt;But now, I've discovered that java has an amazing fourth way:  I can pass files in through the jar file.  Remember, when we use Hadoop, we call a function "&lt;span style="font-family:courier new;"&gt;setJarbyClass()&lt;/span&gt;".  Well, this function takes the class that is passed in and sends the entire jar file with the class to each of distributed nodes (for both the Map and Reduce classes).  Now, if that jar file just happens to contain a data file with ip address to country lookup data, then java has conspired to send my database file exactly where it is needed!&lt;br /&gt;&lt;br /&gt;Thank you java!  You solved this problem.  (Or, should I be thanking Hadoop?)&lt;br /&gt;&lt;br /&gt;The only question is how to get the file out of the jar file.  Well, the things in the jar file are called "resources".  Resources are accessed using uniform resource identifiers (URI).  And, the URI is conveniently built out of the file name.  Life is not so convenient that the URI &lt;span style="font-style: italic;"&gt;is&lt;/span&gt; the  file name.  But, it is close enough.  The URI prepends the file name with something (say, "http:").&lt;br /&gt;&lt;br /&gt;So, to get the data file out of the jar file (which we put in using the jar command), we need to:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;figure out the name for the resource in the jar file;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;convert the resource name  to a file name; and then,&lt;br /&gt;&lt;/li&gt;&lt;li&gt;open this just as we would a regular file (by passing it into the constructor).&lt;/li&gt;&lt;/ul&gt;The code to do this is:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;import com.maxmind.geoip;&lt;br /&gt;...&lt;br /&gt;  if (iplookup == null) {&lt;br /&gt;      &lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;String filename = getClass().getResource("/GeoIP.dat").toExternalForm().substring(5);&lt;br /&gt;      &lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;iplookup = new LookupService(filename, LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE);&lt;br /&gt;  }&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;The import tells the java code where to find the &lt;span style="font-family:courier new;"&gt;LookupService&lt;/span&gt; class.  To make this work, we have to include the appropriate directory in the class path, as described earlier.&lt;br /&gt;&lt;br /&gt;The first statement creates the file name.  The resource name "/GeoIP.dat" says that the resource is a file, located in the directory where the tar file was created.   The rest of the statement converts this to a file name.  The function "&lt;span style="font-family: courier new;"&gt;toExternalForm()&lt;/span&gt;" creates a URI, which is the filename prepended with something.  The &lt;span style="font-family: courier new;"&gt;substring(5)&lt;/span&gt; removes the something (I didn't look, but wouldn't be surprised if it were "http:").  The original example code I found had &lt;span style="font-family: courier new;"&gt;substring(6)&lt;/span&gt;, which did not work for me on EC2. &lt;br /&gt;&lt;br /&gt;The second statement passes this into the lookup service constructor.&lt;br /&gt;&lt;br /&gt;Now the lookup service is available, and I can use it via this code:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;this.ipcountry = iplookup.getCountry(sale.ip).getCode();&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Voila!  From the IP address, I am able to use free code downloaded from the internet to lookup the IP address using the distributed power of Hadoop.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-1333878193334947211?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/1333878193334947211/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/12/hadoop-and-mapreduce-what-country-is-ip.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1333878193334947211'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1333878193334947211'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/12/hadoop-and-mapreduce-what-country-is-ip.html' title='Hadoop and MapReduce:  What Country is an IP Address in?'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-2512758806273235451</id><published>2009-11-29T21:49:00.006-05:00</published><updated>2009-12-05T12:42:56.296-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='MapReduce'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop'/><title type='text'>Hadoop and MapReduce:  Switching to 0.20 and Cloudera</title><content type='html'>Recently, I decided to switch from Hadoop 0.18 to 0.20 for several reasons:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;I'm getting tired of using deprecated features -- it is time to learn the new interface.&lt;/li&gt;&lt;li&gt;I would like to use some new features, specifically MultipleInputFormats.&lt;/li&gt;&lt;li&gt;The Yahoo! Virtual Machine (which I recommended in my first &lt;a href="http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-parallel-program.html"&gt;post&lt;/a&gt;) is not maintained, whereas the Cloudera training machine is.&lt;/li&gt;&lt;li&gt;And, for free software, I have so far found the Cloudera community support quite effective.&lt;/li&gt;&lt;/ol&gt;I chose the Cloudera Virtual Machine for a simple reason:  it was recommended by Jeff, who works there and describes himself as "a big fan of [my data mining] books".  I do not know if there are other VMs that are available, and  I am quite happy with my Cloudera experience so far.  Their community support provided answers to key questions, even over the Thanksgiving long weekend.&lt;br /&gt;&lt;br /&gt;That said, there are a few downsides to the upgrade:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The virtual machine has the most recent version of Eclipse (called Ganymede), which does not work with Hadoop.&lt;/li&gt;&lt;li&gt;Hence, the virtual machine requires using command lines for compiling the java code.&lt;/li&gt;&lt;li&gt;I haven't managed to get the virtual machine to share disks with the host (instead, I send source files through gmail).&lt;/li&gt;&lt;/ul&gt;The rest of this post explains how I moved the code that assigns consecutive row numbers (from my previous &lt;a href="http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-parallel-program.html"&gt;post&lt;/a&gt;) to Hadoop 0.20.  It starts with details about the new interface and then talks about updating to the Cloudera virtual machine.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-family:arial;font-size:130%;"  &gt;Changes from Hadoop 0.18 to 0.20&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The updated code with the Hadoop 0.20 API is in &lt;a href="http://www.data-miners.com/blog/RowNumberTwoPass-0.20.java"&gt;RowNumberTwoPass-0.20.java&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Perhaps the most noticeable change is the packages.  Before 0.20, Hadoop used classes in a package called "mapred".  Starting with 0.20, it uses classes in "mapreduce".  These have a different interface, although it is pretty easy to switch from one to the other.&lt;br /&gt;&lt;br /&gt;The reason for this change has to do with future development for Hadoop.  This change will make it possible to separate releases of HDFS (the distributed file system) and releases of MapReduce.  The following are packages that contain the new interface:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;import org.apache.hadoop.mapreduce.*;&lt;br /&gt;import org.apache.hadoop.mapreduce.lib.map.*;&lt;br /&gt;import org.apache.hadoop.mapreduce.lib.reduce.*;&lt;br /&gt;import org.apache.hadoop.mapreduce.lib.input.*;&lt;br /&gt;import org.apache.hadoop.mapreduce.lib.output.*;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;In the code itself, there are both subtle and major code differences.  I have noticed the following changes in the Map and Reduce classes:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The classes now longer need the "implements" syntax.&lt;/li&gt;&lt;li&gt;The function called before the map/reduce is now called &lt;span style="font-family:courier new;"&gt;setup()&lt;/span&gt; rather than &lt;span style="font-family: courier new;"&gt;configure()&lt;/span&gt;.&lt;/li&gt;&lt;li&gt;The function called after the map/reduce is called &lt;span style="font-family:courier new;"&gt;cleanup()&lt;/span&gt;.&lt;/li&gt;&lt;li&gt;The functions all take an argument whose class is &lt;span style="font-family:courier new;"&gt;Context&lt;/span&gt;; this is used instead of &lt;span style="font-family:courier new;"&gt;Reporter&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;OutputCollector&lt;/span&gt;.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The map and reduce functions can also throw &lt;span style="font-family:courier new;"&gt;InterruptedException&lt;/span&gt;.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;The driver function has more changes, caused by the fact that &lt;span style="font-family:courier new;"&gt;JobConf&lt;/span&gt; is no longer part of the interface.  Instead, the work is set up using &lt;span style="font-family:courier new;"&gt;Job&lt;/span&gt;.  Variables and values are passed into the Map and Reduce class through &lt;span style="font-family:courier new;"&gt;Conf&lt;/span&gt; rather than &lt;span style="font-family:courier new;"&gt;JobConf&lt;/span&gt;.  Also, the code for the Map and Reduce classes is added in using the call &lt;span style="font-family:courier new;"&gt;Job.setJarByClass()&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;There are a few other minor coding differences.  However, the code follows the same logic as in 0.18, and the code ran the first time after I made the changes.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-family:arial;font-size:130%;"  &gt;The Cloudera Virtual Machine&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;First, I should point out that I have no connection to &lt;a href="http://www.cloudera.com"&gt;Cloudera&lt;/a&gt;, which is a company that makes money (or intends to make money) by providing support and training for Hadoop.&lt;br /&gt;&lt;br /&gt;The Cloudera Virtual Machine is available &lt;a href="http://www.cloudera.com/hadoop-training-virtual-machine"&gt;here&lt;/a&gt;.  It requires running a VMWare virtual machine, which is available &lt;a href="http://downloads.vmware.com/d/info/desktop_downloads/vmware_player/3_0"&gt;here&lt;/a&gt;.  Between the two, these are about 1.5 Gbytes, so have a good internet connection when you want to download them.&lt;br /&gt;&lt;br /&gt;The machine looks different from the Yahoo! VM, because it runs X rather than just a terminal interface.  The desktop is pre-configured with a terminal, Eclipse, Firefox, and perhaps some other stuff.  When I start the VM, I open the terminal and run emacs in the background.  Emacs is a text editor that I know well from my days as a software programmer (more years ago than I care to admit).  To use the VM, I would suggest that you have some facility with either emacs or VI.&lt;br /&gt;&lt;br /&gt;The version of Hadoop is 0.20.1.  Note that as new versions are released, Cloudera will probably introduce new virtual machines.  Any work you do on this machine will be lost when you replace the VM with a newer version.  As I said, I am sending source files back and forth via gmail.  Perhaps you can get the VM to share disks with the host machine.  The libraries for Hadoop are in &lt;span style="font-family: courier new;"&gt;/usr/lib/Hadoop-2.0&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Unfortunately, the version of Eclipse installed in the VM does not fully support Hadoop (if you want to see the bug reports, google something like "Hadoop Ganymede").   Fortunately, you can use Eclipse/Ganymede to write code, and it does full syntax checking.  However, you'll have to compile and run the code outside the Eclipse environment.  I believe this is a bug in this version of Eclipse, which will hopefully be fixed sometime in the near future.&lt;br /&gt;&lt;br /&gt;I suppose that I could download the working version of Eclipse (Europe, which is, I think, version 3.2).  But, that was too much of a bother.   Instead, I learned to use the command line interface for compiling and running code.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-family:arial;font-size:130%;"  &gt;Compiling and Running Programs&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To compile and run programs you will need to use command line commands.&lt;br /&gt;&lt;br /&gt;To build a new project, create a project in Eclipse by creating a new java project.  The one thing it needs is a pointer to the Hadoop 0.20 libraries (actually "jars").   To install a pointer to this library, do the following after creating the project:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Right click on the project name and choose "Properties".&lt;/li&gt;&lt;li&gt;Click on "Java Build Path" and go to the "Libraries" tab.&lt;/li&gt;&lt;li&gt;Click on "Add External JARs".&lt;/li&gt;&lt;li&gt;Navigate to &lt;span style="font-family: courier new;"&gt;/usr/lib/hadoop-0.20&lt;/span&gt; and choose &lt;span style="font-family: courier new;"&gt;hadoop-0.20.1+133-core.jar&lt;/span&gt;.&lt;/li&gt;&lt;li&gt;Click "OK" on the windows until you are out.&lt;/li&gt;&lt;/ul&gt;You'll see the new library in the project listing.&lt;br /&gt;&lt;br /&gt;Second, you should create a package and then source code in the package.&lt;br /&gt;&lt;br /&gt;After you have created the project, you can compile and run the code from a command line by doing the following.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Go to the project directory (~/workshop/&lt;project&gt;).&lt;br /&gt;&lt;/project&gt;&lt;/li&gt;&lt;li&gt;Issue the following command:  "&lt;span style="font-family:courier new;"&gt;javac -cp /usr/lib/hadoop-0.20/hadoop-0.20.1+133-core.jar -d bin/ src/*/*.java&lt;/span&gt;" [note:  there is a space after "bin/"].&lt;/li&gt;&lt;li&gt;Create the jar:  "&lt;span style="font-family:courier new;"&gt;cd bin; jar ../&lt;jar&gt; cvf */*; cd ..&lt;/jar&gt;&lt;/span&gt;".  So, for the RowNumberTwoPass command, I use:  "&lt;span style="font-family:courier new;"&gt;cd bin; jar cvf ../RowNumberTwoPass.jar */*; cd ..&lt;/span&gt;".&lt;/li&gt;&lt;li&gt;Run the code using the command:  "&lt;span style="font-family:courier new;"&gt;hadoop jar RowNumberTwoPass.jar RowNumberTwoPass/rownumbertwopass&lt;/span&gt;".  The first argument after "hadoop jar" is the jar file with the code.  The second is the class and package where the &lt;span style="font-family:courier new;"&gt;main()&lt;/span&gt; function is located.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Although this seems a little bit complicated, it is only cumbersome the first time you run it.  After that, you have the commands and running them again is simple.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-2512758806273235451?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/2512758806273235451/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-switching-to-020.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/2512758806273235451'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/2512758806273235451'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-switching-to-020.html' title='Hadoop and MapReduce:  Switching to 0.20 and Cloudera'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-8492395889301664500</id><published>2009-11-25T15:18:00.010-05:00</published><updated>2009-11-25T16:56:03.477-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='MapReduce'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop'/><title type='text'>Hadoop and MapReduce:  A Parallel Program to Assign Row Numbers</title><content type='html'>This post discusses (and solves) the problem of assigning consecutive row numbers to data, with no holes.  Along the way, it also introduces some key aspects of the Hadoop framework:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Using the FileSystem package to access HDFS (a much better approach than in my previous posting).&lt;/li&gt;&lt;li&gt;Reading configuration parameters in the Map function.&lt;/li&gt;&lt;li&gt;Passing parameters from the main program to the Map and Reduce functions.&lt;/li&gt;&lt;li&gt;Writing out intermediate results from the Map function.&lt;/li&gt;&lt;/ul&gt;These are all important functionality for using the Hadoop framework.  In addition, I plan on using this technique for assigning unique ids to values in various columns.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-family:arial;font-size:130%;"  &gt;The "Typical" Approach&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The "typical" approach is to serialize the problem, by creating a Reducer function that adds the row number.  By limiting the framework to only a single reducer (using setNumReduceTasks(1) in the JobConf class), this outputs the row number.&lt;br /&gt;&lt;br /&gt;There are several problems with this solution.  The biggest issue is, perhaps, aesthetic.  Shouldn't a parallel framework, such as Hadoop, be able to solve such a simple problem?  Enforced serialization is highly inefficient, since the value of Hadoop is in the parallel programming capabilities enabled when multiple copies of maps and reduces are running.&lt;br /&gt;&lt;br /&gt;Another issue is the output file.  Without some manual coding, the output is a single file, which may perhaps be local to a single cluster node (depending on how the file system is configured).  This can slow down subsequent map reduce tasks that use the file.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-family:arial;font-size:130%;"  &gt;&lt;br /&gt;An Alternative Fully Parallel Approach&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There is a better way, a fully parallel approach that uses two passes through the Map-Reduce framework.  Actually, the full data is only passed once through the framework, so this is a much more efficient alternative to the first approach.&lt;br /&gt;&lt;br /&gt;Let me describe the approach using three passes through the data, since this makes for a simpler explanation (the actual implementation combines the first two steps).&lt;br /&gt;&lt;br /&gt;The first pass through the data consists of a Map phase that assigns a new key to each row and no Reduce phase.  The key is consists of two parts:  the partition id and the row number within the partition.&lt;br /&gt;&lt;br /&gt;The second pass counts the number of rows in each partition, by extracting the maximum row number with each partition key.&lt;br /&gt;&lt;br /&gt;These counts are then combined to get cumulative sums of counts up to each partition.  Although I could do this in the reduce step, I choose not to (which I'll explain below).  Instead, I do the work in the main program.&lt;br /&gt;&lt;br /&gt;The third pass adds the offset to the row number and outputs the results.  Note that the number of map tasks in the first task can be different from the number in subsequent passes, since the code always uses the original partition number for its calculations.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-family:arial;font-size:130%;"  &gt;More Detail on the Approach -- Pass 1&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The code is available in this file &lt;a href="http://www.data-miners.com/blog/RowNumberTwoPass.java"&gt;RowNumberTwoPass.java&lt;/a&gt;.  It contains one class with two Map phases and one Reduce phase.  This code assumes that the data is stored in a text file.  This assumption simplifies the code, because I do not have to introduce any auxiliary classes to read the data.  However, the same technique would work for any data format.&lt;br /&gt;&lt;br /&gt;The first map phase, &lt;span style="font-family:courier new;"&gt;NewKeyOutputMap&lt;/span&gt;,  does two things.  The simpler thing is to output the parition id and the row number within the partition for use in subsequent processing.  The second is to save a copy of the data, with this key, for the second pass.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-style: italic;font-family:arial;font-size:130%;"  &gt;Assigning the Partition ID&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;How does any Map function figure out its partition id?  The partition id is stored in the job configuration, and is accessed using the code:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;partitionid = conf.getInt("mapred.task.partition", 0);&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;In the version of Hadoop that I'm using (0.18.3, through the Yahoo virtual machine), the job configuration is only visible to a configuration function.  This is an optional function that can be defined when implementing an instance of the &lt;span style="font-family:courier new;"&gt;MapReduceBase&lt;/span&gt; class.   It gets called once to initialize the environment.  The configuration function takes one argument, the job configuration.  I just store the result in a static variable local to the &lt;span style="font-family:courier new;"&gt;NewOutputKeyMap&lt;/span&gt; class.&lt;br /&gt;&lt;br /&gt;In more recent versions of Hadoop, the configuration is available in the context argument to the map function.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-style: italic;font-family:arial;font-size:130%;"  &gt;Using Sequence Files in the Map Phase&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The second task is to save the original rows with the new key values.  For this, I need a sequence file.  Or, more specifically, I need a different sequence file for each Map task.  Incorporating the partition id into the file name accomplishes this.&lt;br /&gt;&lt;br /&gt;Sequence files are data stores specific to the Hadoop framework, which contain key-value pairs.  At first, I found them a bit confusing:  Did the term "sequence file" refer to a collection of files available to all map tasks or to a single instance of one of these files?   In fact, the term refers to a single instance file.  To continue processing, we will actually need a collection of sequence files, rather than a single "sequence file".&lt;br /&gt;&lt;br /&gt;They are almost as simple to use as any other files, as the following code in the  &lt;span style="font-family:courier new;"&gt;configuration()&lt;/span&gt; function shows:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;FileSystem fs = FileSystem.get(conf);&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;sfw = SequenceFile.createWriter(fs, conf,&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;........&lt;/span&gt;new Path(saverecordsdir+"/"+String.format("records%05d", partitionid.get())),&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;........&lt;/span&gt;Text.class,    Text.class);&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;The first statement simply retrieves the appropriate file system for creating the file.  The second statement uses the &lt;span style="font-family:courier new;"&gt;SequenceFile.createWriter()&lt;/span&gt; function to open the file and save the id in the &lt;span style="font-family:courier new;"&gt;sfw&lt;/span&gt; variable.  There are several versions of this function, with various additional options.  I've chosen the simplest version.  The specific file will go in the directory referred to by the variable &lt;span style="font-family:courier new;"&gt;saverecordsdir&lt;/span&gt;.   This will contains a series of files with the names "records#####" where ##### is a five-digit, left-padded number.&lt;br /&gt;&lt;br /&gt;This is all enclosed in try-catch logic to catch appropriate exceptions.&lt;br /&gt;&lt;br /&gt;Later in the code, the &lt;span style="font-family:courier new;"&gt;map()&lt;/span&gt; writes to the sequence file using the logic:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;sfw.append(outkey, value);&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Very simple!&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-style: italic;font-family:arial;font-size:130%;"  &gt;Pass1:  Reduce and Combine Functions&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The purpose of the reduce function is to count the number of rows in each partition.  Instead of counting, the function actually takes the maximum of the partition row count.  By taking this approach, I can use the same function for both reducing and combining.&lt;br /&gt;&lt;br /&gt;For efficiency purposes, the combine phase is very important to this operation.  The way the problem is structured, the combine output should be a single record for each map instance -- and sending this data around for the reduce phase should incur very little overhead.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-family:arial;font-size:130%;"  &gt;More Detail on the Approach -- Offset Calculation and Pass 2&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;At the end of the first phase, the summary key result files contains a single row for each partition, containing the number of rows in each partition.  For instance, from my small test data, the data looks like:&lt;br /&gt;&lt;br /&gt;&lt;table style="border-collapse: collapse; width: 144pt;" border="0" cellpadding="0" cellspacing="0" width="192"&gt;&lt;col style="width: 48pt;" width="64" span="3"&gt;  &lt;tbody&gt;&lt;tr style="height: 15pt;" height="20"&gt;   &lt;td style="height: 15pt; width: 48pt;" width="64" height="20"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 48pt;" align="right" width="64"&gt;0&lt;/td&gt;   &lt;td style="width: 48pt;" align="right" width="64"&gt;2265&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 15pt;" height="20"&gt;   &lt;td style="height: 15pt;" height="20"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td align="right"&gt;1&lt;/td&gt;   &lt;td align="right"&gt;2236&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 15pt;" height="20"&gt;   &lt;td style="height: 15pt;" height="20"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td align="right"&gt;2&lt;/td&gt;   &lt;td align="right"&gt;3&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;The first column is the partition id, the second is the count.  The offset is the cumulative sum of previous values.  So, I want this to be:&lt;br /&gt;&lt;br /&gt;&lt;table style="border-collapse: collapse; width: 192pt;" border="0" cellpadding="0" cellspacing="0" width="256"&gt;&lt;col style="width: 48pt;" width="64" span="4"&gt;  &lt;tbody&gt;&lt;tr style="height: 15pt;" height="20"&gt;   &lt;td style="height: 15pt; width: 48pt;" width="64" height="20"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 48pt;" align="right" width="64"&gt;0&lt;/td&gt;   &lt;td style="width: 48pt;" align="right" width="64"&gt;2265&lt;/td&gt;   &lt;td style="width: 48pt;" align="right" width="64"&gt;0&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 15pt;" height="20"&gt;   &lt;td style="height: 15pt;" height="20"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td align="right"&gt;1&lt;/td&gt;   &lt;td align="right"&gt;2236&lt;/td&gt;   &lt;td align="right"&gt;2265&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 15pt;" height="20"&gt;   &lt;td style="height: 15pt;" height="20"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td align="right"&gt;2&lt;/td&gt;   &lt;td align="right"&gt;3&lt;/td&gt;   &lt;td align="right"&gt;4501&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;To accomplish this, I read the data in the main loop, after running the first job.  The following loop in &lt;span style="font-family: courier new;"&gt;main()&lt;/span&gt; gets the results, does the calculation, and saves the results as parameters in the job configuration:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;int numvals = 0;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;long cumsum = 0;&lt;br /&gt;&lt;/code&gt;&lt;code&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;FileStatus[] files = fs.globStatus(new Path(keysummaryoutput+ "/p*"));&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;for (FileStatus fstat : files) {&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;........&lt;/span&gt;FSDataInputStream fsdis = fs.open(fstat.getPath());&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;........&lt;/span&gt;String line = "";&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;........&lt;/span&gt;while ((line = fsdis.readLine()) != null) {&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;............&lt;/span&gt;finalconf.set(PARAMETER_cumsum_nthvalue + numvals++, line + "\t" + cumsum);&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;............&lt;/span&gt;String[] vals = line.split("\t");&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;............&lt;/span&gt;cumsum += Long.parseLong(vals[1]);&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;........}&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;}&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;finalconf.setInt(PARAMETER_cumsum_numvals, numvals);&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Perhaps the most interesting part of this code is the use of the function &lt;span style="font-family: courier new;"&gt;fs.globStatus()&lt;/span&gt; to get a list of HDFS files that match wildcards (in this case, anything that starts with "p" in the &lt;span style="font-family: courier new;"&gt;keysummaryouput&lt;/span&gt; directory).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-family:arial;font-size:130%;"  &gt;Conclusion&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Parallel Map-Reduce is a powerful programming paradigm, that makes it possible to solve many different types of problems using parallel dataflow constructs.&lt;br /&gt;&lt;br /&gt;Some problems seem, at first sight, to be inherently serial.  Appending a sequential row number onto each each row is one of those problems.  After all, don't you have to process the previous row to get the number for the next row?  And isn't this a hallmark of inherently serial problems?&lt;br /&gt;&lt;br /&gt;The answers to these questions are "no" and "not always".  The algorithm described here should scale to very large data sizes and very large machine sizes.  For large volumes of data, it is much, much more efficient than the serial version, since all processing is in parallel.  That is almost true.  The only "serial" part of the algorithm is the calculation of the offsets between the passes.  However, this is such a small amount of data, relative to the overall data, that its effect on overall efficiency is negligible.&lt;br /&gt;&lt;br /&gt;The offsets are passed into the second pass using the &lt;span style="font-family: courier new;"&gt;JobConfiguration&lt;/span&gt; structure.  There are other ways of passing this data.  One method would be to use the distributed data cache.  However, I have not learned how to use this yet.&lt;br /&gt;&lt;br /&gt;Another distribution method would be to do the calculations in the first pass reduce phase (by using only one reducer in this phase).  The results would be in a file.  This file could then be read by subsequent map tasks to extract the offset data.  However, such an approach introduces a lot of contention, because suddenly there will be a host of tasks all trying to open the same file -- contention that can slow processing considerably.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-8492395889301664500?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/8492395889301664500/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-parallel-program.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/8492395889301664500'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/8492395889301664500'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-parallel-program.html' title='Hadoop and MapReduce:  A Parallel Program to Assign Row Numbers'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-3718762123970345595</id><published>2009-11-21T14:52:00.004-05:00</published><updated>2009-11-23T18:24:24.797-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='MapReduce'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop'/><title type='text'>Hadoop and MapReduce:  Controlling the Hadoop File System from theMapReduce Program</title><content type='html'>[This first comment explains that Hadoop really does have a supported interface to the hdfs file system, though the FileSystem package ("import org.apache.hadoop.fs.FileSystem").  Yeah!  I knew such an interface should exist -- and even stumbled across it myself after this post.  Unfortunately, there is not as simple an interface for the "cat" operation, but you can't have everything.]&lt;br /&gt;&lt;br /&gt;In my previous &lt;a href="http://www.data-miners.com/blog/2009/11/getting-started-with-hadoop-and.html"&gt;post&lt;/a&gt;, I explained some of the challenges in getting a Hadoop environment up and running.  Since then, I have succeeding in using Hadoop both on my home machine and on Amazon EC2.&lt;br /&gt;&lt;br /&gt;In my opinion, one of the major shortcomings of the programming framework is the lack of access to the HDFS file system from MapReduce programs.  More concretely, if you have attempted to run the WordCount program, you may have noticed that you can run it once without a problem.  The second time you get an error saying that the output files already exist.&lt;br /&gt;&lt;br /&gt;What do you do?  You go over to the machine running HDFS -- which may or may not be your development machine -- and you delete the files using the "hadoop fs -rmr" command.  Can't java do this?&lt;br /&gt;&lt;br /&gt;You may also have noticed that you cannot see the output.  Files get created, somewhere.  What fun.  To see them, you need to use the "hadoop fs -cat" command.  Can't java do this?&lt;br /&gt;&lt;br /&gt;Why can't we create a simple WordCount program that can be run multiple times in a row, without error, and that prints out the results?  And, to further this question, I want to do all the work in java.  I don't want to work with an additional scripting language, since I already feel that I've downloaded way too many tools on my machine to get all this to work.&lt;br /&gt;&lt;br /&gt;By the way, I feel that both of these are very, very reasonable requests, and the hadoop framework should support them.  It does not.  For those who debate whether hadoop is better or worse than parallel databases, recognize that the master process in parallel databases typically support functionality similar to what I'm asking for here.&lt;br /&gt;&lt;br /&gt;Why is this not easy?  Java, Hadoop, and the operating systems seem to conspire to prevent this.  But I like challenge.  This posting, which will be rather long, is going to explain my solution.  Hey, I'll even include some code so other people don't have to suffer through the effort.&lt;br /&gt;&lt;br /&gt;I want to do this on the configuration I'm running from home.  This configuration consists of:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Windows Vista, running Eclipse&lt;/li&gt;&lt;li&gt;Ubuntu Linux virtual machine, courtesy of Yahoo!, running Hadoop 0.18&lt;/li&gt;&lt;/ul&gt;However, I also want the method to be general and work regardless of platform.  So, I want it to work if I write the code directly on my virtual machine, or if I write the code on Amazon EC2.  Or, if I decide to use Karmasphere instead of Eclipse to write the code, or if I just write the code in a Java IDE.  In all honesty, I've only gotten the system to work on my particular configuration, but I think it would not be difficult to get it to work on Unix.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-family:arial;font-size:130%;"  &gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0);"&gt;Overview of Solution&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The overview of the solution is simple enough.  I am going to do the following:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Create a command file called "myhadoopfs.bat" that I can call from java.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Write a class in java that will run this bat file with the arguments to do what I want.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Boy, that's simple.  NOT!&lt;br /&gt;&lt;br /&gt;Here are a sample of the problems:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Java has to call the batch file without any path.  This is because Windows uses the backslash to separate directories whereas Unix uses forward slashes.  I lose platform independence if I use full paths.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The batch file has to connect to a remote machine.  Windows Vista does not have a command to do this.  Unix uses the command "rsh".&lt;/li&gt;&lt;li&gt;The java method for executing commands (Runtime.getRuntime().exec()) does not execute batch files easily.&lt;/li&gt;&lt;li&gt;The java method for executing commands hangs, after a few lines are output.  And, the lines could be in either the standard output stream (stdout) or the error output stream (stderr), and it is not obvious how to read both of them at the same time.&lt;/li&gt;&lt;/ul&gt;This post is going to resolve these problems, step by step.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-family:arial;font-size:130%;"  &gt;&lt;span style="color: rgb(0, 153, 0);"&gt;What You Need&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To get started, you need to do a few things to your computer so everything will work.&lt;br /&gt;&lt;br /&gt;First, install the program PuTTY (from &lt;a href="http://www.chiark.greenend.org.uk/%7Esgtatham/putty/download.html"&gt;here&lt;/a&gt;).  Actually, choose the option for "A Windows installer for everything except PuTTYtel".  You can accept all the defaults.  As far as I know, this runs on all versions of Windows.&lt;br /&gt;&lt;br /&gt;Next, you need to change the system path so it can find two things by default:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The PuTTY programs.&lt;/li&gt;&lt;li&gt;The batch file you are going to write.&lt;/li&gt;&lt;/ul&gt;The system path variable specifies where the operating system looks for executable files, when you have a command prompt, or when you execute a command from java.&lt;br /&gt;&lt;br /&gt;Decide on the directory where you want the batch file.  I chose "c:\users\gordon".&lt;br /&gt;&lt;br /&gt;To change the system path, to the the "My Computer" or "Computer" icon on your desktop and right click to get "Properties" and then choose "Advanced System Settings".  Click on the "Environment Variables" button.  And scroll down to find "Path" in the variables.  Edit the "Path" variable.&lt;br /&gt;&lt;br /&gt;BE VERY CAREFUL NOT TO DELETE THE PREVIOUS VALUES IN THE PATH VARIABLE!!!  ONLY ADD ONTO THEM!!!&lt;br /&gt;&lt;br /&gt;At the end of the path variable, I appended the following (without the double quotes):  ";c:\Program Files (x86)\PuTTY\;c:\users\gordon".  The part after the second semicolon should be where you want to put your batch file.  The first part is where the putty commands are located (which may vary on different versions of Windows).&lt;br /&gt;&lt;br /&gt;Then, I found that I had to reboot my machine in order for Eclipse to know about the new path.  I speculate that this is because there is a java program running somewhere that picks up the path when it starts, and this is where Eclipse gets the path.  If I'm correct, all that needs to be done is to restart that program.  Rebooting the machine was easier than tracking down a simpler solution.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-family:arial;font-size:130%;"  &gt;&lt;span style="color: rgb(0, 153, 0);"&gt;Test the Newly Installed Software&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The equivalent of rsh in this environment is called plink.  To see if things work, you need the following:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;IP address of the other machine.  On a Unix system, you can find this using either "ipconfig" or "ifconfig".  In my case, the IP address is 192.168.65.128.  This is the address of the virtual machine, but this should work even if you are connecting to a real machine.&lt;/li&gt;&lt;li&gt;The user name to login as.  In my case, this is "hadoop-user", which is provided by the virtual machine.&lt;/li&gt;&lt;li&gt;The password.  In my case, this is "hadoop".&lt;/li&gt;&lt;/ul&gt;Here is a test command to see if you get to the right machine:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;plink -ssh -pw hadoop hadoop-user@192.168.65.128 hostname&lt;/li&gt;&lt;/ul&gt;If this works by returning the name of the machine you are connecting to, then everything is working correctly.  In my case, it returns "hadoop-desk".&lt;br /&gt;&lt;br /&gt;Since we are going to be connecting to the hadoop file system, we might as well test that as well.  I noticed that the expected command:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;plink -ssh -pw hadoop hadoop-user@192.168.65.128 hadoop fs -ls&lt;/li&gt;&lt;/ul&gt;Does not work.  This is because the Unix environment is not initializing the environment properly, so it cannot find the command.  On the Yahoo! virtual machine, the initializations are in the ".profile" file.  So, the correct command is:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;plink -ssh -pw hadoop hadoop-user@192.168.65.128 source .profile; hadoop fs -ls&lt;/li&gt;&lt;/ul&gt;Voila!  That magically seems to work, indicating that we can, indeed, connect to another machine and run the hadoop commands.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-family:arial;font-size:130%;"  &gt;&lt;span style="color: rgb(0, 153, 0);"&gt;Write the Batch File&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I call the batch file "myhadoop.bat".  This file contains the following line:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;"c:\Program Files (x86)\PuTTY\plink.exe" -ssh -pw %3 %2@%1 source .profile; hadoop fs %4 %5 %6 %7 %8 %9&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;This file takes the following arguments in the following order:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;host ip address (or hostname, if it can be resolved)&lt;/li&gt;&lt;li&gt;user name&lt;/li&gt;&lt;li&gt;password&lt;/li&gt;&lt;li&gt;commands to be executed (in arguments %4 though %9)&lt;/li&gt;&lt;/ul&gt;Yes, the password is in clear text.  If this is a problem, learn about PuTTY ssh with security and encryption.&lt;br /&gt;&lt;br /&gt;You can test this batch file in the same way you tested plink.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-family:arial;font-size:130%;"  &gt;&lt;span style="color: rgb(0, 153, 0);"&gt;Write a Java Class to Run the Batch File&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is more complicated than it should be for two reasons.  First, the available exec() command does not execute batch files.  So, you need to use "cmd /q /c myhadoop.bat" to run it.  This invokes a command interpreter to run the command (the purpose of the "/c" option).  It also does not echo the commands being run, courtesy of the "/q" option.&lt;br /&gt;&lt;br /&gt;The more painful part is the issue with stdout and stderr.  Windows blocks a process when either of these buffers are full.  What that means is that your code hangs, without explanation, rhyme, or reason.  This problem, as well as others, are explained and solved in this excellent article, &lt;a style="font-style: italic;" href="http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-traps.html?page=1"&gt;When Runtime.exec() won't&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The solution is to create separate threads to read each of the streams.  With the example from the article, this isn't so hard.  It is available in this file:  &lt;a href="http://www.data-miners.com/blog/HadoopFS.java"&gt;HadoopFS.java&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Let me explain a bit how this works.  The class HadoopFS has four fields:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;command is the command that is run.&lt;/li&gt;&lt;li&gt;exitvalue is the integer code returned by the running process.  Typically, processes return 0 when they are successful and an error code otherwise.&lt;/li&gt;&lt;li&gt;stdout is a list of strings containing the standard output.&lt;/li&gt;&lt;li&gt;stderr is a list of strings containing the standard error.&lt;/li&gt;&lt;/ul&gt;Constructing an object requires a string.  This is the part of the hadoop command that appears after the "fs".  So, for "hadoop fs -ls", this would be "-ls".  As you can see, this could be easily modified to run any command, either under Windows or on the remote box, but I'm limiting it to Hadoop fs commands.&lt;br /&gt;&lt;br /&gt;This file also contains a private class called threadStreamReader.  (Hmmm, I don't think I have the standard java capitalization down, since classes often start with capital letters.)  This is quite similar to the StreamGobbler class in the above mentioned article.  The difference is that my class stores the strings in a data structure instead of writing them to the console.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-family:arial;font-size:130%;"  &gt;&lt;span style="color: rgb(0, 153, 0);"&gt;Using the HadoopFS Class&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;At the beginning of this posting, I said that I wanted to do two things:  (1) delete the output files before running the Hadoop job and (2) output the results.  The full example for the WordCount drive class is in this file:&lt;br /&gt;&lt;a href="http://www.data-miners.com/blog/WordCount.java"&gt;WordCount.java&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;To delete the output files, I use the following code before the job is run:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;HadoopFS hdfs_rmr = new HadoopFS("-rmr "+outputname);&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;hdfs_rmr.callCommand();&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;I've put the name of the  output files in the string outputname.&lt;br /&gt;&lt;br /&gt;To show the results, I use:&lt;br /&gt;&lt;code&gt;&lt;br /&gt; &lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;HadoopFS hdfs_cat = new HadoopFS("-cat "+outputname+"/*");&lt;br /&gt;  &lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;hdfs_cat.callCommand();&lt;br /&gt;  &lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;for (String line : hdfs_cat.stdout) {&lt;br /&gt;      &lt;span style="color: rgb(255, 255, 255);"&gt;........&lt;/span&gt;System.out.println(line);&lt;br /&gt;  &lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;}&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;This is pretty simple and readable.  More importantly, they seem to work.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-family:arial;font-size:130%;"  &gt;&lt;span style="color: rgb(0, 153, 0);"&gt;&lt;br /&gt;Conclusion&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The hadoop framework does not allow us to do some rather simple things.  There are typically three computing environments when running parallel code -- the development environment, the master environment, and the grid environment.  The master environment controls the grid, but does not provide useful functionality for the development environment.  In particular, the master environment does not give the development environment critical access to the parallel distributed files.&lt;br /&gt;&lt;br /&gt;I want to develop my code strictly in java, so I need more control over the environment.  Fortunately, I can extend the environment to support the "hadoop fs" commands in the development environment.  I believe this code could easily be extended for the Unix world (by writing appropriate "cmd" and "myhadoop.bat" files).  This code would then be run in exactly the same way from the java MapReduce code.&lt;br /&gt;&lt;br /&gt;This mechanism is going to prove much more powerful than merely affecting the aesthetics of the WordCount program.  In the next post, I will probably explain how to use this method to return arbitrary data structures between MapReduce runs.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-3718762123970345595?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/3718762123970345595/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-controlling-hadoop.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/3718762123970345595'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/3718762123970345595'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/11/hadoop-and-mapreduce-controlling-hadoop.html' title='Hadoop and MapReduce:  Controlling the Hadoop File System from theMapReduce Program'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-1377158198637714684</id><published>2009-11-18T20:42:00.002-05:00</published><updated>2009-11-18T21:57:59.081-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='MapReduce'/><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop'/><title type='text'>Getting Started with Hadoop and MapReduce</title><content type='html'>Over the past week, I have been teaching myself Hadoop -- a style of programming made popular by Google (who invented the parallel version of MapReduce), Yahoo! (who created the open source version called Hadoop), and Amazon (who provides cloud computing resources called EC2 for running the software).&lt;br /&gt;&lt;br /&gt;My purpose is simple:  one of our clients has a lot of web log data on Amazon S3 (which provides lots of cheap data storage).  This data can be analyzed on EC2, using Hadoop.  Of course, the people who put the data up there are busy running the web site.  They are not analysts/data miners.  Nor do they have much bandwidth to help a newbie get started.  So, this has been a do-it-yourself effort.&lt;br /&gt;&lt;br /&gt;I have some advice for anyone else who might be in a similar position.  Over the past few days, I have managed to turn my Windows Vista laptop into a little Hadoop machine, where I can write code in an IDE (that means "interactive development editor", which is what you program in ), run it on a one-node hadoop cluster (that is, my laptop), and actually get results.&lt;br /&gt;&lt;br /&gt;There are several ways to get started with Hadoop.  Perhaps you work at a company that has Hadoop running on one or more clusters or has a cluster in the cloud.  You can just talk to people where you are and get started.&lt;br /&gt;&lt;br /&gt;You can also endeavor to install Hadoop yourself.   There is a good chance that I will attempt this one day.  However, it is supposed to be a rather complicated process.  And, all the configuration and installation is a long way from data.&lt;br /&gt;&lt;br /&gt;My preferred method is to use a Hadoop virtual machine.  And I found a nice one through the &lt;a href="http://developer.yahoo.com/hadoop/tutorial/"&gt;Yahoo Hadoop Tutorial&lt;/a&gt; (and there are others . . . I would like to find one running the most recent version of Hadoop).&lt;br /&gt;&lt;br /&gt;I have some advice, corrections and expectations for anyone else who wants to try this.&lt;br /&gt;&lt;br /&gt;(1)  Programming languages.&lt;br /&gt;&lt;br /&gt;Hopefully you already know some programming languages, preferably of the object-oriented sort.  If you are a java programmer, kudos to you, since hadoop is written in java.  However, I found that I can struggle through java with my rush knowledge of C++ and C#.  (This post is *not* about my complaints about java.)&lt;br /&gt;&lt;br /&gt;For the purposes of this discussion, I do not consider SAS or SPSS to be worthy programming languages.  You should be familiar with ideas such as classes, constructors, static and instance variables, functions, and class inheritance.  You should also be willing to read through long run-time error messages, whose ultimate meaning is something like "I was expecting a constructor with no arguments" or "I received a Text type when I was expecting an IntWritable."&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;(2) Unix Familiarity&lt;br /&gt;&lt;br /&gt;In the same way that Hadoop was developed using java, it was developed on a Unix environment.  This means that you should be familiar with Unix shell commands (you know, "ls" versus "dir", "rm" versus "del", and so on).  There is a rumor that a version of Hadoop will run under Windows.  I am sure, though, that trying to install open source Unix-based software under Windows is going to be a major effort; so I chose the virtual machine route.&lt;br /&gt;&lt;br /&gt;By the way, if you want to get Unix shell commands on your Windows box, then use cygwin!  It provides the common Unix commands.  Just download the most recent version from &lt;a href="http://www.cygwin.com/"&gt;www.cygwin.com&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;(3) Use Yahoo!'s Hadoop Tutorial (which is &lt;a href="http://developer.yahoo.com/hadoop/tutorial/"&gt;here&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;This has proven invaluable in my efforts.  Although I have a few corrections and clarifications, which I'll describe below.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;(4) Be Prepared to Load Software! (And have lots of disk space)&lt;br /&gt;&lt;br /&gt;I have loaded the following software packages on my computer, to make this work:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;VMWare Player 3.0 (from &lt;a href="https://www.vmware.com/tryvmware/?p=player&amp;amp;lp=1"&gt;VM Ware&lt;/a&gt;).  This is free software that allows any computer to run a virtual desktop in another operating system.&lt;/li&gt;&lt;li&gt;Hadoop VM Appliance (from &lt;a href="http://developer.yahoo.com/hadoop/tutorial/module3.html"&gt;Yahoo&lt;/a&gt;).  This has the information for running a Hadoop virtual machine.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Eclipse SDE, version 3.3.2 (from the &lt;a href="http://archive.eclipse.org/eclipse/downloads/"&gt;Eclipse project archives&lt;/a&gt;).  Note that the Tutorial suggests version 3.3.1.  Version 3.3.2 works.  Anything more recent is called "Ganymede", and it just doesn't work with Hadoop.&lt;/li&gt;&lt;li&gt;Java Development Kit from Sun.&lt;/li&gt;&lt;/ul&gt;These are all multi-hundred megabyte zipped files.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;(5) Getting VMWare to Share Folders between the host and virtual machine&lt;br /&gt;&lt;br /&gt;In order to share folders (the easiest way to pass data back and forth), you need to install VMWare Tools.  VMWare has instructions &lt;a href="http://www.vmware.com/support/gsx3/doc/tools_install_lin_gsx.html"&gt;here&lt;/a&gt;.   However, it took me  a while to figure out what to do.  The real steps are:&lt;br /&gt;&lt;br /&gt;(a) Under VM--&gt;Settings go to the Hardware tab.  Set the CD/DVD to be "connected" and use the ISO image file.  My path is "C:\Program Files (x86)\VMware\VMware Player\linux.iso", but I believe that it appeared automatically, after a pause.  This makes the virtual machine think it is reading from a CD when it is really reading from this file.&lt;br /&gt;&lt;br /&gt;(b) Run the commands to mount and extract files&lt;br /&gt;Login or su to root and then run the following:&lt;br /&gt;&lt;tt&gt;&lt;tt&gt;mount /cdrom&lt;br /&gt;cd /tmp&lt;br /&gt;tar zxf /cdrom/vmware-freebsd-tools.tar.gz&lt;br /&gt;umount /cdrom&lt;br /&gt;&lt;/tt&gt;&lt;/tt&gt;&lt;br /&gt;(c) Do the installation&lt;br /&gt;I accepted all the defaults when it asked a question.&lt;br /&gt;&lt;br /&gt;&lt;tt&gt;&lt;tt&gt;cd vmware-tools-distrib&lt;br /&gt;./vmware-install.pl&lt;br /&gt;vmware-config-tools.pl&lt;br /&gt;&lt;/tt&gt;&lt;/tt&gt;&lt;br /&gt;(d) Set up a shared folder&lt;br /&gt;&lt;br /&gt;Go to VM --&gt; Settings... and go to the Options tab.  Enable the shared folders.  Choose an appropriate folder on the host machine (I chose "c:\temp") and give it a name on the virtual machine ("temp").&lt;br /&gt;&lt;br /&gt;(e) The folder is available on the virtual machine at /mnt/hgfs/temp.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;(6)  Finding the Hadoop Plug-In for Eclipse&lt;br /&gt;&lt;br /&gt;The Tutorial has the curious instructions in the &lt;a href="http://developer.yahoo.com/hadoop/tutorial/module3.html"&gt;third part&lt;/a&gt; "In the &lt;tt&gt;hadoop-0.18.0/contrib/eclipse-plugin&lt;/tt&gt; directory on this CD, you will find a file named &lt;tt&gt;hadoop-0.18.0-eclipse-plugin.jar&lt;/tt&gt;.   Copy this into the &lt;tt&gt;plugins/&lt;/tt&gt; subdirectory of wherever you unzipped Eclipse."  These are curious because there is no CD.&lt;br /&gt;&lt;br /&gt;You will find this directory on the Virtual Machine.  Go to it.  Copy the jar file to the shared folder.  Then go to the host machine and copy it to the described place.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;(7) The code examples may not work in the Tutorial&lt;br /&gt;&lt;br /&gt;I believe the following code is not correct in the driver code:&lt;br /&gt;&lt;pre class="codeblue"&gt;   FileInputPath.addInputPath(conf, new Path("input"));&lt;br /&gt;  FileOutputPath.addOutputPath(conf, new Path("output"));&lt;/pre&gt;&lt;br /&gt;The following works:&lt;br /&gt;&lt;pre class="codeblue"&gt;   conf.setInputPath(new Path("input"));&lt;br /&gt;  conf.setOutputPath.addOutputPath(new Path("output"));&lt;/pre&gt;&lt;br /&gt;(8) Thinking the Hadoop Way&lt;br /&gt;&lt;br /&gt;Hadoop is a different style of programming, because it is distributed.  Actually, there are three different "machines" that it uses.&lt;br /&gt;&lt;br /&gt;The first is your host machine, which is where you develop code.  Although you can develop software on the Hadoop machine, the tools are better on your desktop.&lt;br /&gt;&lt;br /&gt;The second is the Hadoop machine, which is where you can issue the commands related to Hadoop.  In particular, the command "hadoop" provides access to the parallel data.  This machine has a shared drive with the host machine, which you can use to move files back and forth.&lt;br /&gt;&lt;br /&gt;The data used by the programs, though, is in a different place, the Hadoop Distributed File System (HDFS).  To move data between HDFS and the virtual machine, use the "hadoop fs" command.  Using the shared folder you can move it to the local machine or anywhere.&lt;br /&gt;&lt;br /&gt;--gordon&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-1377158198637714684?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/1377158198637714684/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/11/getting-started-with-hadoop-and.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1377158198637714684'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1377158198637714684'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/11/getting-started-with-hadoop-and.html' title='Getting Started with Hadoop and MapReduce'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-7557699377639917151</id><published>2009-11-13T12:26:00.004-05:00</published><updated>2009-11-13T12:56:57.806-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='Assocation Rules'/><category scheme='http://www.blogger.com/atom/ns#' term='SQL'/><title type='text'>From Item Sets to Association Rules Using Chi-Square</title><content type='html'>In &lt;a style="font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;, I introduce the chi-square metric for evaluating association rules.  This posting extends that discussion, to explain how to use the chi-square metric for generating rules.&lt;br /&gt;&lt;br /&gt;An association rule is of the form:  left-hand-side --&gt; right-hand-side (or, alternatively, LHS --&gt; RHS).  The left hand side consists of one or more items, and the right-hand side consists of a single item.  A typical example of an association rule is "graham crackers plus chocolate bars implies marshmallows", which may readers will recognize that the recipe for a childhood delight called smores.&lt;br /&gt;&lt;br /&gt;Association rules are not only useful for retail analysis.  They are also useful for web analysis, where we are trying to track the parts of a web page where people go.  I have also seen them used in financial services and direct marketing.&lt;br /&gt;&lt;br /&gt;The key to understanding how the chi-square metric fits in is to put the data into a contingency table.  For this discussion, let's consider that we have a rule of the form LHS --&gt; RHS, where each side consists of one item.   In the following table, the numbers A, B, C, D represent counts:&lt;br /&gt;&lt;br /&gt; &lt;table style="border-collapse: collapse; width: 273pt;" border="0" cellpadding="0" cellspacing="0" width="364"&gt;&lt;col style="width: 48pt;" width="64"&gt;  &lt;col style="width: 77pt;" width="103"&gt;  &lt;col style="width: 77pt;" width="102"&gt;  &lt;col style="width: 71pt;" width="95"&gt;  &lt;tbody&gt;&lt;tr style="height: 15pt;" height="20"&gt;   &lt;td style="height: 15pt; width: 48pt;" width="64" height="20"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 77pt;" width="103"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 77pt;" width="102"&gt;RHS-present&lt;/td&gt;   &lt;td style="width: 71pt;" width="95"&gt;RHS-absent&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 15pt;" height="20"&gt;   &lt;td style="height: 15pt;" height="20"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;LHS-present&lt;/td&gt;   &lt;td&gt;A&lt;/td&gt;   &lt;td&gt;B&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 15pt;" height="20"&gt;   &lt;td style="height: 15pt;" height="20"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;LHS-absent&lt;/td&gt;   &lt;td&gt;C&lt;/td&gt;   &lt;td&gt;D&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;A is the count where the LHS and RHS items are both present.  B has the LHS item but not the RHS item, and so on.  Different rules have different contingency tables.  We choose the best one using the chi-square metric (described in Chapter 3 of the above book).  This tells us how unusual these counts are.  In other words, the chi-square metric is a measure of how unlikely the counts that are measured are due to a random split of the data.&lt;br /&gt;&lt;br /&gt;Once we get a contingency table, though, we still do not have a rule.  A contingency table really has four different rules:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;LHS --&gt; RHS&lt;/li&gt;&lt;li&gt;LHS --&gt; not RHS&lt;/li&gt;&lt;li&gt;not LHS --&gt; RHS&lt;/li&gt;&lt;li&gt;not LHS --&gt; not RHS&lt;/li&gt;&lt;/ul&gt;(Or, another way of saying this is that these rules all generate the same contingency table.) How can we choose which rule is the best one?&lt;br /&gt;&lt;br /&gt;In this case, we'll choose the rule based on how much better they do than just guessing.  This is called the lift or improvement for a rule.  So, the rule LHS --&gt; RHS is correct for A/(A+B) of the records:  the LHS is true for A+B records, and for A of these, the rule is true.&lt;br /&gt;&lt;br /&gt;Overall, simply guessing that RHS is true would be correct for (A+C)/(A+B+C+D) of the records.  The ratio of these is the lift for the rule.  A lift greater than 1 indicates that the rule does better than guessing; a lift less than 1 indicates that guessing is better.&lt;br /&gt;&lt;br /&gt;The following are the ratios for the four possible rules in the table:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;(A/(A+B))/((A+C)/(A+B+C+D))&lt;/li&gt;&lt;li&gt;(B/(A+B))/((A+C)/(A+B+C+D))&lt;/li&gt;&lt;li&gt;(C/(C+D))/((B+D)/(A+B+C+D))&lt;/li&gt;&lt;li&gt;(D/(C+D))/((B+D)/(A+B+C+D))&lt;/li&gt;&lt;/ul&gt;  When choosing among these, choose the one with highest lift.&lt;br /&gt;&lt;br /&gt;The process for choosing rules is to choose the item sets based on the highest chi-square value.  And then to choose the rules using the best lift.&lt;br /&gt;&lt;br /&gt;This works well for rules with a single item on each side.  What do we do for more complicated rules, particularly ones with more items in the left hand side?  One method would be to extend the chi-square test to multiple dimensions.  I am not a fan of the multidimensional chi-square test, as I've explained in another blog.&lt;br /&gt;&lt;br /&gt;In this case, we just consider rules with a single item on the RHS side.  So, if an item set has four items, a, b, c, and d, then we would consider only the rules:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;a+b+c --&gt; d&lt;/li&gt;&lt;li&gt;a+b+d --&gt; c&lt;/li&gt;&lt;li&gt;a+c+d --&gt; b&lt;/li&gt;&lt;li&gt;b+c+d --&gt; a&lt;/li&gt;&lt;/ul&gt;We are ignoring possibilities such as a+b--&gt;c+d.&lt;br /&gt;&lt;br /&gt;Each of these rules can now be evaluated using the chi-square metric, and then the best rule chosen using the lift of the rule.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-7557699377639917151?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/7557699377639917151/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/11/from-item-sets-to-association-rules.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/7557699377639917151'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/7557699377639917151'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/11/from-item-sets-to-association-rules.html' title='From Item Sets to Association Rules Using Chi-Square'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-3946394129837158097</id><published>2009-11-06T14:05:00.002-05:00</published><updated>2009-11-06T14:39:47.040-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='Ask a data miner'/><title type='text'>Oversampling in General</title><content type='html'>&lt;span style="font-style: italic;"&gt;Dear Data Miners,&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;I am trying to find out statistical reasons for balancing data sets when building models with binary targets, and nobody is able to intelligently describe why it is being done. In fact, there are mixed opinions on sampling when the response rate is low.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Based on literature and data mining professional opinions, here are few versions (assume that the response rate is 1%):&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;1) As long as the number of responders is approximately equal or greater than 10 times the number variables included, no additional sampling is needed.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;2) Oversample or undersample (based on the total number of observations) at least until the response rate = 10%.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;3) Oversample or undersample (based on the total number of observations) until the response rate = 50%.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;4) Undersampling is useful only for cutting down on processing time; really no good reason to do it statistically as long as the number of observations for responders is "sufficient" (% does not matter).&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Having an advanced degree in mathematics but not being a statistician, I would like to understand whether there really is any statistical benefit in doing that.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;I appreciate your time answering this.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Sincerely,&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Your fellow data miner &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Many years ago, I was doing a churn model for SK Telecom (in South Korea) using SAS Enterprise Miner.  A friend of mine at SAS, Anne Milley, had suggested that having a 50% density for a binary response model would produce optimal models.  Her reasoning was that with a 50% density of each target value, the contrast between the two values would be maximized, making it easier to pick out patterns in the data.&lt;br /&gt;&lt;br /&gt;I spent some time testing decision trees with all sorts of different densities.  To my surprise, the decision trees with more than 30% density performed better than trees with lower densities, regardless of the splitting criterion and other factors.  This convinced me that 50% is not a bad idea.&lt;br /&gt;&lt;br /&gt;There is a reason why decision trees perform better on balanced samples.  The standard pruning algorithm for decision trees uses &lt;span style="font-style: italic;"&gt;classification&lt;/span&gt; as the metric for choosing subtrees.  That is, a leaf chooses its dominant class -- the one in excess of 50% for two classes.  This works best when the classes are evenly distributed in the data.  (Why data mining software implementing trees doesn't take the original density into account is beyond me.)&lt;br /&gt;&lt;br /&gt;In addition, the splitting criteria may be more sensitive to deviations around 50% than around other values.&lt;br /&gt;&lt;br /&gt;Standard statistical techniques are insensitive to the original density of the data.  So, a logistic regression run on oversampled data should produce essentially the same model as on the original data.  It turns out that the confidence intervals on the coefficients do vary, but the model remains basically the same.&lt;br /&gt;&lt;br /&gt;Hmmm, as I think about it, I wonder if the oversampling rate would affect stepwise or forward selection of variables.  I could imagine that, when testing each variable, the variance in results using a rare target would be larger than the variance using a balanced model set.  This, in turn, might lead to a poorer choice of variables.  But I don't know if this is the case.&lt;br /&gt;&lt;br /&gt;For neural networks, the situation is more complicated.  Oversampling does not necessarily improve the neural network -- there is no theoretical reason why.  However, it does allow the network to run on a smaller set of data, which makes convergence faster.  This, in turn, allows the modeler to experiment with different models.  Faster convergence is a benefit in other ways.&lt;br /&gt;&lt;br /&gt;Some other techniques such as k-means clustering and nearest neighbor approaches probably do benefit from oversampling.  However, I have not investigated these situations in detail.&lt;br /&gt;&lt;br /&gt;Because I am quite fond of decision trees, I prefer a simple rule, such as "oversample to 50%", since this works under the maximum number of circumstances.&lt;br /&gt;&lt;br /&gt;In response to your specific questions, I don't think that 10% is a sufficient density.  If you are going to oversample, you might as well go to 50% -- there is at least an elegant reason why (the contrast idea between the two response values).  If you don't have enough data, then use weights instead of oversampling to get the same effect.&lt;br /&gt;&lt;br /&gt;In the end, though, if you have the data and you have the software, try out different oversampling rates and see what produces the best models!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-3946394129837158097?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/3946394129837158097/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/11/oversampling-in-general.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/3946394129837158097'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/3946394129837158097'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/11/oversampling-in-general.html' title='Oversampling in General'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-1846742851172318395</id><published>2009-11-04T11:20:00.007-05:00</published><updated>2009-11-04T14:32:11.394-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><category scheme='http://www.blogger.com/atom/ns#' term='SQL'/><title type='text'>Scoring Association Rules</title><content type='html'>At M2009 (SAS's data mining conference), I was approached with the question of scoring association rules for customers.  This is not a topic I have thought about very much.  More typically, association rules are used qualitatively or to understand products.  I hadn't thought about assigning the "best" rule (or rules) back to customers.&lt;br /&gt;&lt;br /&gt;As a reminder, association rules provide information about items that are purchased at the same time.  For example, we might find that marshmallows and chocolate bars imply graham crackers.  The "marshmallows" and "chocolate bars" are the left hand side of the rule (LHS) and the graham crackers is the right hand side (RHS).  The presumption is that when graham crackers are missing from a shopper's basket, then they should be there.&lt;br /&gt;&lt;br /&gt;Most data mining software, such as SAS Enterprise Miner,  SQL Server Data Mining, and SPSS Clementine, can be used to generate association rules. I prefer to calculate the rules myself using database technology, using code similar to that in &lt;a style="font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;However, data mining tools do not provide the ability to score association rules for individual customers.  Neither is this is a topic that I discuss in my book.  My goal here is to discuss scoring rules in databases.  This is because scoring is computationally expensive.  Because databases can take advantage of indexing and parallel processing, they offer scope to make the score more efficient.&lt;br /&gt;&lt;br /&gt;Hmmm, what does scoring association rules even mean?  Scoring is the process of finding the best rule that a customer matches, either for a single RHS or for all possible RHSs.  In the former case, the result is one rule.  In the latter, it is an array of rules, for each possible RHS.&lt;br /&gt;&lt;br /&gt;An association rule is traditionally defined by three metrics:  support, confidence, and lift (as well as a fourth, the chi-square metric, which I prefer).   For the purposes of this discussion, the best rule is the one with the highest confidence.&lt;br /&gt;&lt;br /&gt;The simplistic way of doing such scoring is by considering each rule for each customer, to determine which rules apply to each customer.   From the set that do apply, do some work to find the best one.&lt;br /&gt;&lt;br /&gt;Imagine that we have a table, &lt;span style="font-style: italic;"&gt;rules&lt;/span&gt;, with the following columns:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The number of LHS items (we assume there is 1 RHS item);&lt;/li&gt;&lt;li&gt;The RHS item.&lt;/li&gt;&lt;li&gt;The LHS items, as a string:  "item1;item2;..."&lt;/li&gt;&lt;/ul&gt;There is another table, &lt;span style="font-style: italic;"&gt;custitem&lt;/span&gt;, containing each customer and each item as a separate row.&lt;br /&gt;&lt;br /&gt;The following query find all matching rules for each customer in the innermost subquery, by counting the number of items matched on the left hand side.  The outer query then finds the rule (for each RHS) that has the maximum confidence, using SQL window functions.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;SELECT cr.*&lt;br /&gt;FROM (SELECT customerid, r.rhs, r.ruleid,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;. ....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.......&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;(MAX(r.confidence) OVER (PARTITION BY customerid, rhs)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.......&lt;/span&gt;) as maxconfidence&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;FROM (SELECT&lt;/span&gt;&lt;span style="font-family:courier new;"&gt; ci.customerid, r.rhs, r.ruleid,&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;..&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;...........&lt;/span&gt;COUNT(*) as nummatches, &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;..&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;FROM custitem ci CROSS JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;..&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;rules r&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;WHERE CHARINDEX(ci.item||';', r.lhs) &gt; 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;..&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;GROUP BY ci.customerid, r.rhs, r.ruleid&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;...&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;...&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;HAVING COUNT(*) = MAX(r.numlhs)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;...&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;..&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;) matchrules JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;..&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;...&lt;span style="color: rgb(0, 0, 0);"&gt;rules r&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;...&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;..&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;ON matchrules.ruleid = rules.ruleid&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;span style="color: rgb(0, 0, 0);"&gt;) cr&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;WHERE confidence = maxconfidence&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This query is expensive, as you might guess from the use of &lt;span style="font-family: courier new;"&gt;CROSS JOIN&lt;/span&gt;.  And, its performance gets longer particularly as the number of rules gets larger (and presumably the number of customers is larger still).&lt;br /&gt;&lt;br /&gt;It is possible to make it more efficient, by doing tricks, such as:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;If there are a few number of items, then the LHS could be encoded using bits.  This eliminates the need for string matching.&lt;/li&gt;&lt;li&gt;The rules can be pruned, so only the rules with the highest confidence are kept.&lt;/li&gt;&lt;/ul&gt;And, although this cannot be done in SQL, the rules could be ordered by confidence (for each RHS) from highest to lowest.  The first match would then stop the search.&lt;br /&gt;&lt;br /&gt;An alternative method requires storing the rules in two tables.  The first is &lt;span style="font-style: italic;"&gt;rules&lt;/span&gt;, containing descriptive information about each rule, such as:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;ruleid;&lt;/li&gt;&lt;li&gt;rhs; and,&lt;br /&gt;&lt;/li&gt;&lt;li&gt;numlhs.&lt;/li&gt;&lt;/ul&gt;The second is &lt;span style="font-style: italic;"&gt;ruleitem&lt;/span&gt;, which contains each item in the rules.  Incidentally, this is more in keeping with the spirit of normalization in relational databases.&lt;br /&gt;&lt;br /&gt;The subquery for the scoring now changes to a join.  This is useful, because it means that we can use database mechanisms -- such as indexing and table partitioning -- to speed it up.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;SELECT cr.*&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;FROM (SELECT customerid, r.rhs, r.ruleid,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.............&lt;/span&gt;(MAX(r.confidence) OVER (PARTITION BY customerid, rhs)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.............&lt;/span&gt;) as maxconfidence&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM (SELECT ci.customerid, r.rhs, r.ruleid,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;...................&lt;/span&gt;COUNT(*) as nummatches, MAX(numlhs) as numlhs&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;............&lt;/span&gt;FROM custitem ci JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.................&lt;/span&gt;ruleitems ri&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.................&lt;/span&gt;ON ci.item = ri.item JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.................&lt;/span&gt;rule r&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.................&lt;/span&gt;ON ri.ruleid = ri.ruleid&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;............&lt;/span&gt;GROUP BY ci.customerid, r.rhs, r.ruleid&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;............&lt;/span&gt;HAVING COUNT(*) = MAX(r.numlhs)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;...........&lt;/span&gt;) matchrules JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;...........&lt;/span&gt;rules r&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;...........&lt;/span&gt;ON matchrules.ruleid = rules.ruleid&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;) cr&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;WHERE confidence = maxconfidence&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Of course, such an approach makes a difference only when you need to score many customers and you have many rules.  This same approach can be used for looking at a single product in the RHS or at several at one time.  Of course, this would require summarizing the multiple products at the customer level in order to append the desired information on the customer record.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-1846742851172318395?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/1846742851172318395/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/11/scoring-association-rules.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1846742851172318395'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1846742851172318395'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/11/scoring-association-rules.html' title='Scoring Association Rules'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-2347030710379465299</id><published>2009-10-23T16:23:00.006-04:00</published><updated>2009-10-26T06:51:13.144-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='Web analytics'/><title type='text'>Counting Users From Unique Cookies</title><content type='html'>&lt;div&gt;Counting people/unique visitors/users at web sites is a challenge, and is something that I've been working on for the past couple of months for the web site of a large media company.   The goal is to count the number of distinct users over the course of a month.  Counting distinct cookies is easy; the challenge is turning these into human beings.  These challenges include:&lt;/div&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Cookie deletions. A user may manually delete their cookies one or more times during the month.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Disallowing first party cookies. A user may allow session cookies (while the browser is running), but not allow the cookies to be committed to disk.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Multiple browsers. A single user may use multiple browsers on the same machine during the month.  This is particularly true when the user upgrades his or her browser.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Multiple machines. A single user may use multiple machines during the month.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;And, I have to admit, that the data that I'm using has one more problem, which is probably not widespread.  The cookies are actually hashed into four bytes.  This means that it is theoretically possible for two "real" cookies to have the same hash value.  Not only theoretically possible, but it happens (although not too frequently).&lt;/p&gt;&lt;div&gt;I came across a very good &lt;a href="http://showmeanalytics.com/2009/04/calculating-the-effects-of-cookie-deletion/"&gt;blog&lt;/a&gt; by Angie Brown that lays out the assumptions in making the calculation, including a spreadsheet for varying the assumptions.  One particularly interesting factoid from the blog is that the number cookies that appear only once during the month exceeds the number of unique visitors, even under quite reasonable assumptions.  Where I am working, one camp believes that the number of unique visitors is approximated by the number of unique cookies.&lt;/div&gt;&lt;br /&gt;&lt;div&gt; &lt;/div&gt;&lt;br /&gt;&lt;div&gt;A white paper by ComCast states that the average user has 2.5 unique cookies per month due to cookie deletion.  The paper is &lt;a href="http://www.scribd.com/doc/20518487/Cookie-Deletion-White-Paper"&gt;here&lt;/a&gt;, and a PR note about it is it is &lt;a href="http://www.comscore.com/Press_Events/Press_Releases/2007/04/comScore_Cookie_Deletion_Report"&gt;here&lt;/a&gt;.  This paper is widely cited, although it has some serious methodological problems due to the fact that its data sources are limited to DoubleClick and Yahoo!.&lt;/div&gt;&lt;br /&gt;&lt;div&gt; &lt;/div&gt;&lt;br /&gt;&lt;div&gt;In particular, Yahoo! is quite clear about its cookie expiration policies (&lt;a href="http://www.blogger.com/help.yahoo.com/l/us/yahoo/edit/id_password/edit-54.html"&gt;two weeks&lt;/a&gt; for users clicking the "keep me logged in for 2 weeks" box and &lt;a href="http://help.yahoo.com/l/us/yahoo/mail/original/settings/settings-11.html"&gt;eight hours&lt;/a&gt; for Yahoo! mail).   I do not believe that this policy has changed significantly in the last few years, although I am not 100% sure.&lt;/div&gt;&lt;br /&gt;&lt;div&gt; &lt;/div&gt;&lt;br /&gt;&lt;div&gt;The white paper from ComCast does not mention these facts, which means that most of the cookies that a user has are due to automatic deletion, not user behavior.  How many distinct cookies does a user have, due only to the user's behavior?&lt;br /&gt;&lt;br /&gt;If I make the following assumptions:&lt;/div&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;The Yahoo! users have an average of 2.5 cookies per month.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;ComCast used the main Yahoo! cookies, and not the Yahoo! mail cookies.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;All Yahoo! users use the site consistently throughout the month.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;All Yahoo! users have the "keep me logged in for 2 weeks" box checked.&lt;/li&gt;&lt;/ul&gt;Then I can estimate the number of cookies per user per machine per month.  The average user would have 31/14 = 2.2 cookies per month, strictly due to the automatic deletion.  This leaves 0.3 cookies per month due to manual deletion.  Of course, the user starts with one cookie.  So the average number of cookies per month per user per machine is 1.3.&lt;p&gt;By the way, I find this number much more reasonable.  I also think that it misses the larger source of overcounting -- users who use more than one machine.  Unfortunately, there is no single approach.  In the case that I'm working on, we have the advantage that a minority of users are registered, so we can use them as a sample.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;div&gt; &lt;/div&gt;&lt;br /&gt;&lt;div&gt; &lt;/div&gt;&lt;br /&gt;&lt;div&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-2347030710379465299?l=www.data-miners.com%2Fblog' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/2347030710379465299/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/10/counting-users-from-unique-cookies.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/2347030710379465299'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/2347030710379465299'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/10/counting-users-from-unique-cookies.html' title='Counting Users From Unique Cookies'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total>3</thr:total></entry></feed>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
