<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss'><id>tag:blogger.com,1999:blog-3366935554564939610</id><updated>2009-06-29T14:55:16.490-04:00</updated><title type='text'>Data Miners Blog</title><subtitle type='html'>A place to read about topics of interest to data miners, ask questions of the data mining experts at Data Miners, Inc., and discuss the books of Gordon Linoff and Michael Berry.</subtitle><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default?start-index=26&amp;max-results=25'/><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://www.data-miners.com/blog/atom.xml'/><author><name>Michael J. A. Berry</name><uri>http://www.blogger.com/profile/06077102677195066016</uri><email>noreply@blogger.com</email></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>56</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-7678077843544148924</id><published>2009-06-26T18:47:00.006-04:00</published><updated>2009-06-28T11:10:39.782-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='Ask a data miner'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><title type='text'>When Customers Start and End</title><content type='html'>&lt;span style="font-style: italic;"&gt;In texts on credit scoring, some effort almost always goes into defining what is to be considered as a "bad" credit. The Basel framework provides rather a precise definition of what is to be considered a default.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-style: italic;"&gt;But I have rarely seen the same in predicting cross-sell, up-sell or churn. I do however, remember attending an SPSS conference where churn of pre-paid cards was discussed. Churn, in that case, was defined as a number of consecutive periods where the number of calls fell below a certain level.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-style: italic;"&gt;In the past, I've used start and end dates of contracts, as well as a simple increase (or decrease) in the number of products that a customer has over time as indicators of what to target.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-style: italic;"&gt;I'd be really interested in hearing how you define and extract targets, be it in telecom, banking, cards or any other business where you use prediction. For instance, how would you go looking for customers that have churned? Or for that matter, customers where up-sell has been successful?&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-style: italic;"&gt;This may be too simple a question, but if there are standard methods that you use, I'd be really interested in learning about them.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;--Ola&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Ola,&lt;br /&gt;&lt;br /&gt;This is not a simple question at all.  Or rather, the simplest questions are often the most illuminating.&lt;br /&gt;&lt;br /&gt;The place where I see the biggest issues in defining starts and stops is in survival data mining (obligatory plug for my book &lt;a style="font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;, which has two chapters on the subject).  For the start date, I try to use (or approximate as closely as possible) the date when two things have occurred:  the company has agreed to provide a product or service, and the customer has agreed to pay for it.  In the case of post-pay telecoms, this would be the activation date -- and there are similar dates in many other industries, as varied as credit cards, cable subscriptions, and health insurance.&lt;br /&gt;&lt;br /&gt;The activation date is often well-defined because the number of active customers gets reported through some system tied to the financial systems.  Even so, there are anomalies.  I recently completed a project at a large newspaper, and used their service start date as the activation date.  Alas, at time, customers with start dates did not necessarily actually receive the paper on the date -- often because the newspaper delivery person could not find the address.&lt;br /&gt;&lt;br /&gt;The stop date is even more fraught with complication, because there are a variety of different dates to choose from.  For voluntary churn, there is the date the customer requests termination of the service.  There is also the date when the service is actually turned off.  Which to use?  It depends on the application.  To count active customers, we want the service cut-off date.  To plan for customer retention efforts, we want to know when they call in.&lt;br /&gt;&lt;br /&gt;Involuntary churn is also complicated, because there are a series of steps, often called the Dunning Process, which keeps track of customers who do not pay.  At what point does a non-paying customer stop?  When the service stops?  When the bill is written off or settled?  At some arbitrary point, such as 60 or 90 days of non-payment?  To further confuse the situation, the business may change its rules over time.  So, during some periods of time or for some customers, 60 days of non-payment results in service cutoff.  For other periods or customers, 90 days might be the rule.&lt;br /&gt;&lt;br /&gt;Often, I find multiple time-to-event problems in this scenario.  How long does it take a non-paying customer to stop, if ever?  How long after customers sign up do they begin?&lt;br /&gt;&lt;br /&gt;In your particular case, the contract start date is probably a good place to start.  However, the contract end date might or might not be appropriate, since this might not be updated to reflect when a customer actually stops.&lt;br /&gt;&lt;br /&gt;--gordon&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-7678077843544148924?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/7678077843544148924/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/06/when-customers-start-and-end.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/7678077843544148924'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/7678077843544148924'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/06/when-customers-start-and-end.html' title='When Customers Start and End'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-3969686024557949362</id><published>2009-06-08T17:30:00.002-04:00</published><updated>2009-06-08T17:41:32.784-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='Ask a data miner'/><category scheme='http://www.blogger.com/atom/ns#' term='Michael'/><title type='text'>Confidence in Logistic Regression Coefficients</title><content type='html'>&lt;span style="font-style: italic;"&gt;I work in the marketing team of a telecom company and I recently encountered an annoying problem with an upsell model. Since the monthly sale rate is less than 1% of our customer base, I used oversampling as you mentioned in your book ‘Mastering data mining’ with data over the last 3 sales months so that I had a ratio of about 15% buyers and 85% non-buyers (sample size of about 20K). Using alpha=5%, I got parameter estimates which were from a business perspective entirely explicable. However, when I then re-estimated the model on the total customer base to obtain the ‘true’ parameter estimates which I will use for my monthly scoring two effects were suddenly insignificant at alpha=5%.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;I never encountered this and was wondering what to do with these effects: should I kick them out of the model or not ?  I decided to keep them in since they did have some business meaning and concluded that they must have become insignificant since it is only a micro-segment in your entire population. &lt;/span&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;To your opinion, did I interpret this correctly ?  . . .&lt;/span&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Many thanks in advance for your advice,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Wendy&lt;/span&gt;&lt;br /&gt;&lt;div class="gA gt"&gt;&lt;div class="gB"&gt;&lt;table style="font-style: italic;" class="cf gz" cellpadding="0"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;div class="cKWzSc mD" idlink="" tabindex="0" role="button"&gt;&lt;img class="mL" src="http://mail.google.com/mail/images/cleardot.gif" alt="" /&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;Michael responds:&lt;br /&gt;&lt;br /&gt;Hi Wendy,&lt;br /&gt;&lt;br /&gt;This question has come up on the blog &lt;a href="http://www.data-miners.com/blog/2008/05/adjusting-for-oversampling.html" target="_blank"&gt;before&lt;/a&gt;.  The short answer is that with a logistic regression model trained at one concentration of responders, it is a bit tricky to adjust the model to reflect the actual probability of response on the true population.  I suggest you look at &lt;a href="http://gking.harvard.edu/projects/rareevents.shtml" target="_blank"&gt;some papers by Gary King on this topic&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Gordon responds:&lt;br /&gt;&lt;br /&gt;Wendy, I am not sure that Prof. King deals directly with your issue, of changing confidence in the coefficients estimates.  To be honest, I have never considered this issue.  Since you bring it up, though, I am not surprised that it may happen.&lt;br /&gt;&lt;br /&gt;My first comment is that the results seem usable, since they are explainable.  Sometimes statistical modeling stumbles on relationships in the data that make sense, although they may not be fully statistically significant.  Similarly, some relationships may be statistically significant, but have no meaning in the real world.  So, use the variables!&lt;br /&gt;&lt;br /&gt;Second, if I do a regresson on a set of data, and then duplicate the data (to make it twice as big) and run it again, I'll get the same estimates as on the orignal data.  However, the confidence in the coefficients will increase.  I suspect that something similar is happening on your data.&lt;br /&gt;&lt;br /&gt;If you want to fix that particular problem, then use a tool (such as SAS Enterprise Miner and probably proc logistic) that supports a frequency option on each row.  Set the frequency to one for the more common events and to an appropriate value less than one for more common events.  I do this as a matter of habit, because it works best for decision trees.  You have pointed out that the confidence in the coefficients is also affected by the frequencies, so this is a good habit with regressions as well.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-3969686024557949362?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/3969686024557949362/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/06/confidence-in-logistic-regression.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/3969686024557949362'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/3969686024557949362'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/06/confidence-in-logistic-regression.html' title='Confidence in Logistic Regression Coefficients'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-1885196285450483213</id><published>2009-05-10T17:31:00.003-04:00</published><updated>2009-05-10T18:27:39.331-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='In The News'/><title type='text'>Not Enough Data</title><content type='html'>An &lt;a href="http://www.nytimes.com/2009/05/09/business/09charts.html"&gt;article&lt;/a&gt; in yesterday's New York Times reminded me of examples of "bad" examples of data mining.  By bad examples, I mean that spurious correlations are given credence -- enough credence to make it into a well-reputed national newspaper.&lt;br /&gt;&lt;br /&gt;The article, entitled "Eat Quickly, for the Economy's State" is about a leisure time report from the OECD that shows a correlation between the following two variables:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Change in real GNP in 2008; and,&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Amount of time people spend eating and drinking in a given day.&lt;/li&gt;&lt;/ul&gt;The study is based on surveys from 17 countries (for more information on the survey, you can check &lt;a href="http://www.sourceoecd.org/pdf/societyataglance2009/812009011e-02.pdf"&gt;this&lt;/a&gt; out).&lt;br /&gt;&lt;br /&gt;The highlight is a few charts that shows that countries such as Mexico, Canada, and the United States have the lowest time spent eating (under 75 minutes per day) versus countries such as New Zealand, France, and Japan (over 110 minutes per day).  The first group of countries have higher growth rates, both in 2008 and for the past few years.&lt;br /&gt;&lt;br /&gt;My first problem with the analysis is one of granularity.  Leisure time is measured per person, but GNP is measured over everyone.  One big component of GNP growth is population growth, and different countries have very different patterns of population growth.  The correct measure would be per capital GNP.  Taking this into account would dampen the GNP growth figures for growing countries such as Mexico and the United States, and increase the GNP growth figures for lesser growing (or shrinking countries) such as Italy, Germany, and Japan.&lt;br /&gt;&lt;br /&gt;Also, the countries where people eat more leisurely have other characteristics in common.  In particular, they tend to have older populations and lower (or even negative) rates of population growth.  One wonders if speed eating is a characteristic of younger people and leisurely eating is a characteristic of older people.&lt;br /&gt;&lt;br /&gt;The biggest problem, though, is that this is, in all likelihood, a spurious correlation.  One of the original definitions of data mining, which may still be used in the ecoonomics and political world, is a negative one:  data mining is looking for data to support a conclusion.  The OECD surveys were done in 17 different countries.  The specific result in the NYT article is "Counties in which people eat and drink less than 100 minutes per day grow 0.9% faster -- on average -- than countries in which people each and drink more than 100 minutes per day".&lt;br /&gt;&lt;br /&gt;In other words, the 17 countries were divided into two groups, and the growth rates were then measured for each group.  Let's look at this in more detail.&lt;br /&gt;&lt;br /&gt;How many ways are there to divide 17 countries into 2 groups?  The answer is 2^17 = 131,072 different ways (any particular country could be in either group).  So, if we had 131,072 yes-or-no survey questions, then would would expect any combination to arise, including the combinations where all the high growth countries are in one group and all the low growth countries in the other.   (I admit the exact figure is a bit more than 131,072 but that is unimportant to illustrate my point.)&lt;br /&gt;&lt;br /&gt;The situation actually gets worse.  The results are not yes-or-no; they are numeric measurements which are then used to split the countries into two groups.  The splits could be at any value of the measure.  So, any given measurement results in 17-1=16 different possible splits (the first group having the country with the lowest measurement, with the two lowest, and so on).  Now we only need about 8,192 uncorrelated measurements to get all possibilities.&lt;br /&gt;&lt;br /&gt;However, we do not need all possibilities.  A glance at the NYT article shows that the country with the worst 2008 growth is Poland, yet it is in the fast-eating group.  And Spain -- in the slow eating group -- is the third fastest growing economy (okay, its GNP actually shrank but less than most others).  So, we only need an approximation of a split, where the two groups look different.  And then, voila! we get a news article.&lt;br /&gt;&lt;br /&gt;The problem is that the OECD was able to measure dozens or hundreds of different things in their survey.  My guess is that measures such as "weekly hours of work in main job," "time spent retired,"  and "time spent sleeping" -- just a few of the many possibilities -- did not result in interesting splits.  Eventually, though, a measure such as "time spent eating and drinking" results in a split where the different groups look "statistically significant" but they probably are not.  If the measure is interesting enough, then it can become an article in the New York Times.&lt;br /&gt;&lt;br /&gt;This is probably a problem with statistical significance.   The challenge is that a p-value of 0.01 means that something has only a 1% chance of happening at random.  However, if we look at 100 different measures, then there is a really, really good chance that one of them will have a p-value of 0.01 or less.  By the way, there is a statistical adjustment called the Bonferroni correction to take this into account (this as well as others are described in the &lt;a href="http://en.wikipedia.org/wiki/Multiple_comparisons"&gt;Wikipeida&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;Fortunately, neither the OECD nor the New York Times talk about this discovery as an example of data mining.  It is just poor data analysis, but poor data analysis that can re-enforce lessons in good data analysis.  Lately, I have been noticing more examples of articles such as this, where researchers -- or perhaps just journalists -- extrapolate from very small samples to make unsupported conclusions.   These are particularly grating when they appear in respected newspapers, magazines, and journals.&lt;br /&gt;&lt;br /&gt;Data mining is not about finding spurious correlations and claiming some great discovery.  It is about extracting valuable information from large quantities of data, information that is stable and useful.  Smaller amounts of data often contain many correlations.  Often, these correlations are going to be spurious.  And without further testing, or at least a mechanism to explain the correlation, the results should not be mentioned at all.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-1885196285450483213?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='related' href='http://www.nytimes.com/2009/05/09/business/09charts.html' title='Not Enough Data'/><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/1885196285450483213/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/05/not-enough-data.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1885196285450483213'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1885196285450483213'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/05/not-enough-data.html' title='Not Enough Data'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-5172790987962846050</id><published>2009-04-25T11:18:00.003-04:00</published><updated>2009-04-25T11:48:10.792-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='Ask a data miner'/><category scheme='http://www.blogger.com/atom/ns#' term='user question'/><title type='text'>When There Is Not Enough Data</title><content type='html'>&lt;span style="font-style: italic;"&gt; I have a dataset where the target (continuous variable) variable that has to be estimated. However, in the given dataset, values for target are preset only for 2% while rest of 98% do not have values. The 98% are empty values. I need to score a dataset and give values for the target for all 2500 records. Can I use the 2% and replicate it several times and use that dataset to build a model? The ASE is too high if I use the 2% data alone.  Any suggestions how to handle it, please?&lt;br /&gt;&lt;/span&gt;&lt;div style="font-style: italic;"&gt;Thanks,&lt;/div&gt; &lt;div style="font-style: italic;"&gt;Sneha&lt;/div&gt;&lt;br /&gt;Sneha,&lt;br /&gt;&lt;br /&gt;The short answer to your question is "Yes, you can replicate the 2% and use it to build a model."  BUT DO NOT DO THIS!  Just because a tool or technique is possible to implement does not mean that it is a good idea.  Replicating observations "confuses" models, often by making the model appear overconfident in its results.&lt;br /&gt;&lt;br /&gt;Given the way that ASE (average squared error) is calculated, I don't think that replicating data is going to change the value.  We can imagine adding a weight or frequency on each observation instead of replicating them.  When the weights are all the same, they cancel out in the ASE formula.&lt;br /&gt;&lt;br /&gt;What does change is confidence in the model.  So, if you are doing a regression and looking at the regression coefficients, each has a confidence interval.  By replicating the data, the resulting model would have smaller confidence intervals.  However, these are false, because the replicated data has no more information than the original data.&lt;br /&gt;&lt;br /&gt;The problem that you are facing is that the modeling technique you are using is simply not powerful enough to represent the 50 observations that you have.   Perhaps a different modeling technique would work better, although you are working with a small amount of data.  For instance, perhaps some sort of nearest neighbor approach would work well and be easy to implement.&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;div&gt;You do not say why you are using ASE (average squared error) as the preferred measure of model fitness.  I can speculate that you are trying to predict a number, perhaps using a regression.  One challenge is that the numbers being predicted often fall into a particular range (such as positive numbers for dollar values or ranging between 0 and 1 for a percentage).  However, regressions produce numbers that run the gamut of values.  In this case, transforming the target variable can sometimes improve results.&lt;br /&gt;&lt;br /&gt;In our class on data mining (&lt;a style="font-weight: bold; font-style: italic;" href="http://www.sas.com/apps/wtraining2/coursedetails.jsp?course_code=bdmt53&amp;amp;ctry=us"&gt;Data Mining Techniques:  Theory and Practice&lt;/a&gt;), Michael and I introduce the idea of oversamping rare data using weights in order to get a balanced model set.  For instance, if you were predicting whether someone was in the 2% group, you might give each of them a weight of 49 and all the unknowns a weight of 1.  The result would be a balanced model set.  However, we strongly advise that the maximum weight be 1.  So, the weights would be 1/49 for the common cases and 1 for the rare ones.   For regressions, this is important because it prevents any coefficients from having too-narrow confidence intervals.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt; &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;span style="color:#888888;"&gt;&lt;br /&gt;       &lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-5172790987962846050?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/5172790987962846050/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/04/when-there-is-not-enough-data.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/5172790987962846050'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/5172790987962846050'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/04/when-there-is-not-enough-data.html' title='When There Is Not Enough Data'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-3058657519913023422</id><published>2009-04-13T15:26:00.004-04:00</published><updated>2009-04-13T15:46:31.122-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='forecasting'/><category scheme='http://www.blogger.com/atom/ns#' term='Michael'/><title type='text'>Customer-Centric Forecasting White Paper Available</title><content type='html'>In our consulting practice, we work with many subscription-based businesses including newspapers, mobile phone companies, and software-as-a-service providers. All of these companies need to forecast future subscriber levels. With production support from SAS, I have recently written a white paper describing our approach to creating such forecasts. Very briefly, the central idea is that the subscriber population is a constantly changing mix of customer segments based on geography, acquisition channel, product mix, subscription type, payment type, demographic characteristics, and the like. Each of these segments has a different survival curve. Overall subscriber numbers come from aggregating planned additions and forecast losses at the segment level. Managers can simulate the effects of alternative acquisition strategies by changing assumptions about the characteristics of future subscribers and watching how the forecast changes. The paper is available on &lt;a href="http://www.data-miners.com"&gt;our web site&lt;/a&gt;. I will also be presenting a keynote talk on customer-centric forecasting on July 1st at the&lt;a href="http://www.sas.com/events/aconf/index.html"&gt; A2009 conference&lt;/a&gt; in Copenhagen.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-3058657519913023422?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='related' href='http://www.data-miners.com/fsa.htm' title='Customer-Centric Forecasting White Paper Available'/><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/3058657519913023422/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/04/customer-centric-forecasting-white.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/3058657519913023422'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/3058657519913023422'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/04/customer-centric-forecasting-white.html' title='Customer-Centric Forecasting White Paper Available'/><author><name>Michael J. A. Berry</name><uri>http://www.blogger.com/profile/06077102677195066016</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='14679622169454737233'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-5339788888139258162</id><published>2009-04-10T11:41:00.004-04:00</published><updated>2009-04-13T10:50:38.053-04:00</updated><title type='text'>Rexer Analytics Data Mining Survey</title><content type='html'>Karl Rexer of Rexer Analytics asked us to alert our readers that their annual survey of data miners is ongoing and will be available for a few more days. Click on the title to be taken to the survey page.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-5339788888139258162?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='related' href='http://www.rexeranalytics.com/Data-Miner-Survey-Intro2.html' title='Rexer Analytics Data Mining Survey'/><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/5339788888139258162/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/04/rexer-analytics-data-mining-survey.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/5339788888139258162'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/5339788888139258162'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/04/rexer-analytics-data-mining-survey.html' title='Rexer Analytics Data Mining Survey'/><author><name>Michael J. A. Berry</name><uri>http://www.blogger.com/profile/06077102677195066016</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='14679622169454737233'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-2736857315708988431</id><published>2009-04-08T14:09:00.004-04:00</published><updated>2009-04-08T16:34:10.290-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MapReduce'/><category scheme='http://www.blogger.com/atom/ns#' term='Michael'/><title type='text'>MapReduce, Hadoop, Everything Old Is New Again</title><content type='html'>One of the pleasures of aging is watching younger generations discover pleasures one remembers discovering long ago--sex, porcini, the Beatles. Occasionally though, it is frustrating to see old ideas rediscovered as new ones. I am especially prone to that sort of frustration when the new idea is one I once championed unsuccessfully. Recently, I've been feeling as though I was always a Beatles fan but until recently all my friends preferred Herman's Hermits. Of course, I'm glad to see them coming around to my point of view, but still . . .&lt;br /&gt;&lt;br /&gt;What brings these feelings up is all the excitement around MapReduce. It's nice to see a parallel programming paradigm that separates the description of the mapping from the description of the function to be applied, but at the same time, it seens a bit underwhelming. You see, I literally grew up with the parallel programming language &lt;a href="http://en.wikipedia.org/wiki/APL_%28programming_language%29"&gt;APL&lt;/a&gt;. In the late 60's and early 70's my father worked at IBM's Yorktown Heights research center in the group that developped APL and I learned to program in that language at the age of 12. In 1982 I went to Analogic Corporation to work on an array processor implementation of APL. In 1986, while still at Analogic, I read &lt;a href="http://en.wikipedia.org/wiki/Danny_Hillis"&gt;Danny Hillis&lt;/a&gt;'s book &lt;a href="http://en.wikipedia.org/wiki/Connection_Machine"&gt;The Connection Machine&lt;/a&gt; and realized that he had designed the real APL Machine. I decided I wanted to work at the company that was building Danny's machine. I was hired by Guy Steele, who was then in charge of the software group at Thinking Machines. In the interview, all we talked about was APL. The more I learned about the Connection Machine's SIMD architecture, the more perfect a fit it seemed for APL or an APL-like language in which hypercubes of data may be partitioned into subcubes of any rank so that arbitrary functions can be applied to them.  In APL and its descendents such as &lt;a href="http://en.wikipedia.org/wiki/J_%28programming_language%29"&gt;J&lt;/a&gt;, reduction is just one of rich family of ways that the results of applying a function to various data partitions can be glued together to form a result. I described this approach to parallel programming in a paper published in ACM SIGPLAN Notices in 1990, but as far as I know, no one ever read it. (You can, though. It is available &lt;a href="http://www.data-miners.com/mjab/adverbial-programming.pdf"&gt;here&lt;/a&gt;.) My dream of implementing APL on the Connection Machine gradually faded in the face of commercial reality. The early Connection Machine customers, having already been forced to learn Lisp, were not exactly clamouring for another esoteric language; they wanted Fortran. And Fortran is what I ended up working on.  As you can tell, I still have regrets. If we'd implemented a true parallel APL back then, no one would have to invent MapReduce today.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-2736857315708988431?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/2736857315708988431/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/04/mapreduce-hadoop-everything-old-is-new.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/2736857315708988431'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/2736857315708988431'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/04/mapreduce-hadoop-everything-old-is-new.html' title='MapReduce, Hadoop, Everything Old Is New Again'/><author><name>Michael J. A. Berry</name><uri>http://www.blogger.com/profile/06077102677195066016</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='14679622169454737233'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-2569556451691856307</id><published>2009-01-18T17:15:00.017-05:00</published><updated>2009-01-21T18:48:49.375-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='Enterprise Miner'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Mining'/><category scheme='http://www.blogger.com/atom/ns#' term='Neural Networks'/><title type='text'>Thoughts on Understanding Neural Networks</title><content type='html'>Lately, I've been thinking quite a bit about neural networks.  In particular, I've been wondering whether it is actually possible to understand them.  As a note, this posting assumes that the reader has some understanding of neural networks.  Of course, we at Data Miners, heartily recommend our book &lt;a style="font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0471470643/thedataminers"&gt;Data Mining Techniques for Marketing, Sales, and Customer Relationship Management&lt;/a&gt; for introducing neural networks (as well as a plethora of other data mining algorithms).&lt;br /&gt;&lt;br /&gt;Let me start with a picture of a neural network.  The following is a simple network that takes three inputs and has two nodes in the hidden layer:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/blog/uploaded_images/nnpict-791240.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 218px;" src="http://www.data-miners.com/blog/uploaded_images/nnpict-791236.jpg" alt="" border="0" /&gt;&lt;/a&gt;Note that this structure of the network explains what is really happening.  The "input layer" (the first layer connected to the inputs) standardizes the inputs.  The "output layer" (connect to the output) is doing a regression or logistic regression, depending on whether the target is numeric or binary.  The hidden layers are actually doing a mathematical operation as well.  This could be the logistic function; more typically, though it is the hyperbolic tangent.  All of the lines in the diagram have weights on them.  Setting these weights -- plus a few others not shown -- is the process of training the neural network.&lt;br /&gt;&lt;br /&gt;The topology of the neural network is specifically how SAS Enterprise Miner implements the network.  Other tools have similar capabilities.  Here, I am using SAS EM for three reasons.  First, because we teach a class using this tool, I have pre-built neural network diagrams.  Second, the neural network node allows me to score the hidden units.  And third, the graphics provide a data-colored scatter plot, which I use to describe what's happening.&lt;br /&gt;&lt;br /&gt;There are several ways to understand this neural network.  The most basic way is "it's a black box and we don't need to understand it."  In many respects, this is the standard data mining viewpoint.  Neural networks often work well.  However, if you want a technique that let's you undersand what it is doing, then choose another technique, such as regression or decision trees or nearest neighbor.&lt;br /&gt;&lt;br /&gt;A related viewpoint is to write down the equation for what the network is doing.  Then point out that this equation *is* the network.  The problem is not that the network cannot explain what it is doing.  The problem is that we human beings cannot understand what it is saying.&lt;br /&gt;&lt;br /&gt;I am going to propose two other ways of looking at the network.  One is geometrically.  The inputs are projected onto the outputs of the hidden layer.  The results of this projection are then combined to form the output.  The other method is, for lack of a better term, "clustering".  The hidden nodes actually identify patterns in the original data, and one hidden node usually dominates the output within a cluster.&lt;br /&gt;&lt;br /&gt;Let me start with the geometric interpretation.  For the network above, there are three dimensions of inputs and two hidden nodes.  So, three dimensions are projected down to two dimensions.&lt;br /&gt;&lt;br /&gt;I do need to emphasize that these projections are not the linear projections.   This means that they are not described by simple matrices.  These are non-linear projections.  In particular, a given dimension could be stretched non-uniformly, which further complicates the situation.&lt;br /&gt;&lt;br /&gt;I chose two nodes in the hidden layer on purpose, simply because two dimensions are pretty easy to visualize.  Then I went and I tried it on a small neural network, using Enterprise Miner.    The next couple of pictures are scatter plots made with EM.  It has the nice feature that I can color the points based on data -- a feature sadly lacking from Excel.&lt;br /&gt;&lt;br /&gt;The following scatter plot shows the original data points (about 2,700 of them).  The positions are determined by the outputs of the hidden layers.  The colors show the output of the network itself (blue being close to 0 and red being close to 1).  The network is predicting a value of 0 or 1 based on a balanced training set and three inputs.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/blog/uploaded_images/h1h2-respondp-707452.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 268px;" src="http://www.data-miners.com/blog/uploaded_images/h1h2-respondp-707450.jpg" alt="" border="0" /&gt;&lt;/a&gt;Hmm, the overall output is pretty much related to the H1 output rather than the H2 output.  We see this becasuse the color changes primarily as we move horizontally across the scatter plot and not vertically.  This is interesting.  It means that H2 is contributing little to the network prediction.  Under these particular circumstances, we can explain the output of the neural network by explaining what is happening at H1.  And what is happening at H1 is a lot like a logistic regression, where we can determine the weights of different variables going in.&lt;br /&gt;&lt;br /&gt;Note that this is an approximation, because H2 does make some contribution.  But it is a close approximation, because for almost all input data points, H1 is the dominant node.&lt;br /&gt;&lt;br /&gt;This pattern is a consequence of the distribution of the input data.  Note that H2 is always negative and close to -1, whereas H1 varies from -1 to 1 (as we would expect, given the transfer function).  This is because the inputs are always positive and in a particular range.  The inputs do not result in the full range of values for each hidden node.  This fact, in turn, provides a clue to what the neural network is doing.  Also, this is close to a degenerate case because one hidden unit is almost always ignored.  It does illustrate that looking at the outputs of the hidden layers are useful.&lt;br /&gt;&lt;br /&gt;This suggests another approach.    Imagine the space of H1 and H2 values, and further that any combination of them might exist (do remember that because of the transfer function, the values actually are limited to the range -1 to 1).  Within this space, which node dominates the calculation of the output of the network?&lt;br /&gt;&lt;br /&gt;To answer this question, I had to come up with some reasonable way to compare the following values:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Network output:  &lt;span style="font-family:courier new;"&gt;exp(bias + a1*H1 + a2*H2)&lt;/span&gt;&lt;/li&gt;&lt;li&gt;H1 only:  &lt;span style="font-family:courier new;"&gt;exp(bias + a1*H1)&lt;/span&gt;&lt;/li&gt;&lt;li&gt;H2 only: &lt;span style="font-family:courier new;"&gt;exp(bias + a2*H2)&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Let me give an example with numbers.  For the network above, we have the following when H1 and H2 are both -1:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Network output:  0.9994&lt;/li&gt;&lt;li&gt;H1 only output:  0.9926&lt;/li&gt;&lt;li&gt;H2 only output:  0.9749&lt;/li&gt;&lt;/ul&gt;To calculate the contribution of H1, I use the ratio of the sums of the squares of the differences, as in the following example for H1:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;H1 contribution:  &lt;span style="font-family:courier new;"&gt;(0.9994 - 0.9926)^2 / ((0.9994 - 0.9926)^2 + (0.9994 - 0.9749)^2)&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;The following scatter plot shows the regions where H1 dominates the overall prediction of the network using this metric (red is H1 is dominant; blue is H2 is dominant):&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/blog/uploaded_images/h1h2-distribution-700970.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 263px;" src="http://www.data-miners.com/blog/uploaded_images/h1h2-distribution-700967.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;There are four regions in this scatter plot, defined essentially by the intersection of two lines.  In fact, each hidden node is going to add another line on this chart, generating more regions.  Within each region, one node is going to dominate.  The boundaries are fuzzy.  Sometimes this makes no difference, because the output on either side is the same; sometimes it does make a difference.&lt;br /&gt;&lt;br /&gt;Note that this scatter plot assumes that the inputs can generate all combinations of values from the hidden units.  However, in practice, this is not true, as shown on the previous scatter plot, which essentially covers only the lowest eights of this one.&lt;br /&gt;&lt;br /&gt;With the contribution metric, we can then say that for different regions in the hidden unit space, different hidden units dominate the output.  This is essentially saying that in different areas, we only need one hidden unit to determine the outcome of the network.  Within each region, then, we can identify the variables used by the hidden units and say that they are determining the outcome of the network.&lt;br /&gt;&lt;br /&gt;This idea leads to a way to start to understand standard multilayer perceptron neural networks, at least in the space of the hidden units.  We can identify the regions where particular hidden units dominate the output of the network.  Within each region, we can identify which variables dominate the output of that hidden unit.  Perhaps this explains what is happening in the network, because the input ranges limit the outputs only to one region.&lt;br /&gt;&lt;br /&gt;More likely, we have to return to the original inputs to determine which hidden unit dominates for a given combination of inputs.  I've only just started thinking about this idea, so perhaps I'll follow up in a later post.&lt;br /&gt;&lt;br /&gt;--gordon&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-2569556451691856307?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/2569556451691856307/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/01/thoughts-on-understanding-neural.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/2569556451691856307'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/2569556451691856307'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/01/thoughts-on-understanding-neural.html' title='Thoughts on Understanding Neural Networks'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-2034424201184246482</id><published>2009-01-14T11:42:00.003-05:00</published><updated>2009-01-14T12:01:27.958-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='Ask a data miner'/><category scheme='http://www.blogger.com/atom/ns#' term='Neural Networks'/><title type='text'>Neural Network Training Methods</title><content type='html'>&lt;span style="font-family: Arial,Helvetica,sans-serif; font-size: 10pt;"&gt;&lt;span style="font-style: italic;"&gt;Scott asks . . .&lt;br /&gt;&lt;br /&gt;Dear Ask a Data Miner,&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;I am using SPSS &lt;/span&gt;&lt;span style="font-style: italic;" class="nfakPe"&gt;Clementine&lt;/span&gt;&lt;span style="font-style: italic;"&gt; 12.  The &lt;/span&gt;&lt;span style="font-style: italic;" class="nfakPe"&gt;Neural&lt;/span&gt;&lt;span style="font-style: italic;"&gt; Network node in &lt;/span&gt;&lt;span style="font-style: italic;" class="nfakPe"&gt;Clementine&lt;/span&gt;&lt;span style="font-style: italic;"&gt; allows users to choose from six different training methods for building &lt;/span&gt;&lt;span style="font-style: italic;" class="nfakPe"&gt;neural&lt;/span&gt;&lt;span style="font-style: italic;"&gt; network models:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;• Quick. This method uses rules of thumb and characteristics of the data to choose an appropriate shape (topology) for the network. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;• Dynamic. This method creates an initial topology but modifies the topology by adding and/or removing hidden units as training progresses.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;• Multiple. This method creates several &lt;/span&gt;&lt;span style="font-style: italic;" class="nfakPe"&gt;networks&lt;/span&gt;&lt;span style="font-style: italic;"&gt; of different topologies (the exact number depends on the training data). These &lt;/span&gt;&lt;span style="font-style: italic;" class="nfakPe"&gt;networks&lt;/span&gt;&lt;span style="font-style: italic;"&gt; are then trained in a pseudo-parallel fashion. At the end of training, the model with the lowest RMS error is presented as the final model.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;• Prune. This method starts with a large network and removes (prunes) the weakest units in the hidden and input layers as training proceeds. This method is usually slow, but it often yields better results than other methods.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;• RBFN. The radial basis function network (RBFN) uses a technique similar to k-means clustering to partition the data based on values of the target field.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;• Exhaustive prune. This method is related to the Prune method. It starts with a large network and prunes the weakest units in the hidden and input layers as training proceeds. With Exhaustive Prune, network training parameters are chosen to ensure a very thorough search of the space of possible models to find the best one. This method is usually the slowest, but it often yields the best results. Note that this method can take a long time to train, especially with large datasets.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: 14pt; font-style: italic;color:#4b0082;" &gt;Which is your preferred training method?&lt;/span&gt;&lt;span style="font-style: italic;"&gt;  How about for a lot of data - (a high number of cases AND a high number of input variables)?  How about for a relatively small amount of data?&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Scott,&lt;br /&gt;&lt;br /&gt;Our general attitude with respect to fancy algorithms is that they provide incremental value.  However, focusing on data usually provides more scope for improving results.  This is particularly true of neural networks, because stable neural networks should have few inputs.&lt;br /&gt;&lt;br /&gt;Before addressing your question, there are a few things that you should keep in mind when using neural networks:&lt;br /&gt;&lt;br /&gt;(1) Standardize all the inputs (that is, subtract the average and divide by the standard deviation).  This puts all numeric inputs into a particular range.&lt;br /&gt;&lt;br /&gt;(2) Avoid categorical inputs!  These should be replaced by appropriate numeric descriptors.  Neural network tools, such as Clementine, handle categorical inputs using something called n-1 coding, which converts one variable into many flag variables, which, in turn, multiplies the number of weights in the network that need to be optimized.&lt;br /&gt;&lt;br /&gt;(3) Avoid variables that are highly collinear.  These cause "multidimensional ridges" in the space of neural network weights, which can confuse the training algorithms.&lt;br /&gt;&lt;br /&gt;To return to your question in more detail.  Try out lots of the different approaches to determine which is best!  There is no rule that says that you have to decide on one approach initially and stick with it.  To test the approaches use a separate partition of the data to see which works best.&lt;br /&gt;&lt;br /&gt;For instance, the Quick method is probably very useful in getting results back in a reasonable amount of time.  Examine the topology, though, to see if it makes sense (no hidden units or too many hidden units).  Most of the others are all about adding or removing units, which can be valuable.  However, always test the methods on a test set that is not used for training.  The topology of the network may depend on the training set, so that provides an opportunity for overfitting.&lt;br /&gt;&lt;br /&gt;These methods are focusing more on the topology than on the input parameters.  If the prune method really does remove inputs, then that would be powerful functionality.  For the methods that are comparing results, ensure that the results are compared on a validation set, separate from the test set used to calculate the weights.  It can be easy to overfit neural networks, particularly as the number of weights increases.&lt;br /&gt;&lt;br /&gt;A comment about the radial basis function approach.  Make sure that Clementine is using normalized radial basis functions.  Standard neural networks use an s-shaped function that starts low and goes high (or vice versa), meaning that the area under the curve is unbounded.  RBFs start low, go high, and then go low again, meaning that the area under the curve is finite.  Normalizing the RBFs ensures that the basis functions do not get too small.&lt;br /&gt;&lt;br /&gt;My personal favorite approach to neural networks these days is to use principal components as inputs into the network.  To work effectively, this requires some background in principal components to choose the right number as inputs into the network.&lt;br /&gt;&lt;br /&gt;--gordon&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-2034424201184246482?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/2034424201184246482/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/01/neural-network-training-methods.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/2034424201184246482'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/2034424201184246482'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/01/neural-network-training-methods.html' title='Neural Network Training Methods'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-2710186856949191105</id><published>2009-01-09T18:24:00.010-05:00</published><updated>2009-01-09T19:43:43.204-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='SQL'/><title type='text'>Multidimensional Chi-Square, Expected Values, Independence, and All That, Part 3</title><content type='html'>This post is a continuation of my previous &lt;a href="http://www.data-miners.com/blog/2008/12/multidimensional-chi-square-expected_28.html"&gt;post&lt;/a&gt; on extending the chi-square test to more than two dimensions. The standard, two-dimensional chi-square test is explained in Chapter 3 of my book &lt;a style="font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;This post explains how to implement a multidimensional chi-square test using SQL queries by calculating the chi-square value.&lt;br /&gt;&lt;br /&gt;For the purpose of demonstrating this, I will use data derived from the companion web site for &lt;span style="font-style: italic;"&gt;Data Analysis Using SQL and Excel&lt;/span&gt;.  The following query produces data with three dimensions:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;CREATE TABLE d3 as&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;..&lt;/span&gt;SELECT paymenttype, MONTH(orderdate) as mon,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.........&lt;/span&gt;LEFT(zipcode, 1) as zip1, COUNT(*) as cnt&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;..&lt;/span&gt;FROM orders&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;..&lt;/span&gt;GROUP BY 1, 2, 3&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;The table d3 simply contains three dimensions:  the payment type, the month of the order date, and the first digit of the zip code.  These dimensions are for illustration purposes.&lt;br /&gt;&lt;br /&gt;The formula for the expected values is ratio of the following quantities:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The product of the sum of the counts along each dimension.&lt;/li&gt;&lt;li&gt;The total sum of the counts to the power of the number of dimensions minus 1.&lt;/li&gt;&lt;/ul&gt;These quantities can be calculated using basic SQL commands.  The following query calculates all the expected values:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;SELECT paymenttype, mon, zip1,&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.......&lt;/span&gt;(dim1.cnt * dim2.cnt * dim3.cnt)/(dimall.cnt*dimall.cnt) as expected&lt;br /&gt;FROM (SELECT paymenttype, SUM(cnt) as cnt&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM d3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;GROUP BY paymenttype) dim1 CROSS JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;(SELECT mon, SUM(cnt) as cnt&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM d3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;GROUP BY mon) dim2 CROSS JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;(SELECT zip1, SUM(cnt) as cnt&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM d3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;GROUP BY zip1) dim3 CROSS JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;(SELECT SUM(cnt) as cnt&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM d3) dimall&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;This query consists of four subqueries, one for each dimension and one for the total count.  Each subquery calculates the appropriate sums along one (or no) dimensions.  The results themselves are combined using &lt;span style="font-family:courier new;"&gt;CROSS JOIN&lt;/span&gt;, to ensure that the query returns results for all possible combinations of dimensions -- even those combinations that do not appear in the original data.&lt;br /&gt;This latter point is an important point.  Expected values are produced even for combinations not in the original data.&lt;br /&gt;&lt;br /&gt;The previous query calculates the expected values.  However, the chi-square calculation requires a bit more work.  One approach is to join the above query to the original table, using a &lt;span style="font-family:courier new;"&gt;LEFT OUTER JOIN&lt;/span&gt; to ensure that no expected values are missing.  The following approach uses simple &lt;span style="font-family:courier new;"&gt;JOIN&lt;/span&gt;s and assumes that the original table has all combinations of the dimensions.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;SELECT paymenttype, mon, zip1, expected, dev,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.......&lt;/span&gt;dev*dev/expected as chi_square&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM (SELECT d3.paymenttype, d3.mon, d3.zip1,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.............&lt;/span&gt;(dim1.cnt * dim2.cnt * dim3.cnt)/(dimall.cnt*dimall.cnt) as expected,&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.............&lt;/span&gt;d3.cnt-(dim1.cnt * dim2.cnt * dim3.cnt)/(dimall.cnt*dimall.cnt) as dev&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM d3 JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;(SELECT paymenttype, SUM(cnt) as cnt&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM d3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;GROUP BY paymenttype) dim1&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;ON d3.paymenttype = dim1.paymenttype JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;(SELECT mon, SUM(cnt) as cnt&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM d3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;GROUP BY mon) dim2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;ON d3.mon = dim2.mon JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;(SELECT zip1, SUM(cnt) as cnt&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM d3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;GROUP BY zip1) dim3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;ON d3.zip1 = dim3.zip1 CROSS JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;(SELECT SUM(cnt) as cnt&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM d3) dimall) a&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;This query joins in each of the subtotals along the dimensions, rather than using the &lt;span style="font-family:courier new;"&gt;CROSS JOIN&lt;/span&gt; to create all combinations.  I suspect that in many databases, this approach has a more efficient execution plan (particularly if there are indexes on the dimensions).  Note that the overall total is included using &lt;span style="font-family:courier new;"&gt;CROSS JOIN&lt;/span&gt;.  I find this a convenient way to include constants in queries.&lt;br /&gt;&lt;br /&gt;This query produces the chi-square value for each cell.  The overall chi-square is the sum of these values.  To interpret this value, we need the number of degrees of freedom, which is the product of the number of different values on each dimension minus one:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;SELECT (COUNT(DISTINCT paymenttype) - 1)*&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;       .......&lt;/span&gt;(COUNT(DISTINCT mon) - 1) *&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.......&lt;/span&gt;(COUNT(DISTINCT zip1) - 1) as dof&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM d3&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Interpreting the value itself requires going outside the world of SQL, since there is no function that converts the chi-square value into a p-value within SQL.  However, Excel does have such a function, &lt;span style="font-family: courier new;"&gt;CHIDIST()&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;It should be obvious how to extend these queries for larger numbers of dimensions.  As discussed earlier, though, the chi-square test becomes less useful in multiple dimensions, especially since there need to be counts for all combinations of dimensions for best results (the heuristic rule is a minimum expected value of 5 in all cells).  Nevertheless, doing the calculation in multiple dimensions is not difficult, and most of the work can be accomplished using basic SQL queries.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-2710186856949191105?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/2710186856949191105/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2009/01/multidimensional-chi-square-expected.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/2710186856949191105'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/2710186856949191105'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2009/01/multidimensional-chi-square-expected.html' title='Multidimensional Chi-Square, Expected Values, Independence, and All That, Part 3'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-5940857116431057898</id><published>2008-12-28T21:33:00.013-05:00</published><updated>2008-12-29T21:59:18.189-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='SQL'/><title type='text'>Multidimensional Chi-Square, Expected Values, Independence, and All That, Part 2</title><content type='html'>This post is a continuation of my previous &lt;a href="http://www.data-miners.com/blog/2008/12/multidimensional-chi-square-expected.html"&gt;post&lt;/a&gt; on extending the chi-square test to more than two dimensions. The standard, two-dimensional chi-square test is explained in Chapter 3 of my book &lt;a style="font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;This post explains what it means to extend chi-square to three dimensions and then to additional dimensions. The key idea in extending the chi-square test is calculating the expected values. The next post discusses how to do the calculations using SQL.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:arial;font-size:180%;color:#009900;"&gt;&lt;strong&gt;Expected Values&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;Assume that we have data that takes on a numeric value (typically a count) and has various dimensions, such as the following with dimensions A, B, and C:&lt;br /&gt;&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 162pt;" border="0" cellpadding="0" cellspacing="0" width="214"&gt;&lt;col style="width: 48pt;" width="64"&gt;  &lt;col style="width: 29pt;" span="3" width="38"&gt;  &lt;col style="width: 27pt;" width="36"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 29pt;" width="38"&gt;A=0&lt;/td&gt;   &lt;td style="width: 29pt;" width="38"&gt;B=0&lt;/td&gt;   &lt;td style="width: 29pt;" width="38"&gt;C=0&lt;/td&gt;   &lt;td style="width: 27pt;" num="" align="right" width="36"&gt;1&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=0&lt;/td&gt;   &lt;td&gt;B=0&lt;/td&gt;   &lt;td&gt;C=1&lt;/td&gt;   &lt;td num="" align="right"&gt;2&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=0&lt;/td&gt;   &lt;td&gt;B=1&lt;/td&gt;   &lt;td&gt;C=0&lt;/td&gt;   &lt;td num="" align="right"&gt;3&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=0&lt;/td&gt;   &lt;td&gt;B=1&lt;/td&gt;   &lt;td&gt;C=1&lt;/td&gt;   &lt;td num="" align="right"&gt;4&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1&lt;/td&gt;   &lt;td&gt;B=0&lt;/td&gt;   &lt;td&gt;C=0&lt;/td&gt;   &lt;td num="" align="right"&gt;5&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1&lt;/td&gt;   &lt;td&gt;B=0&lt;/td&gt;   &lt;td&gt;C=1&lt;/td&gt;   &lt;td num="" align="right"&gt;6&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1&lt;/td&gt;   &lt;td&gt;B=1&lt;/td&gt;   &lt;td&gt;C=0&lt;/td&gt;   &lt;td num="" align="right"&gt;7&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1&lt;/td&gt;   &lt;td&gt;B=1&lt;/td&gt;   &lt;td&gt;C=1&lt;/td&gt;   &lt;td num="" align="right"&gt;8&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;The question that the chi-square test answers is:  how expected or unexpected is this data?&lt;br /&gt;&lt;br /&gt;What does this question even mean?  Well, it means that we have to make some assumptions about the process generating the data -- some reasonable but simple assumptions -- and then measure how well this data matches those expected values.&lt;br /&gt;&lt;br /&gt;One possible process is that each cell is independent of all the others.  In this case, each cell would, on average, get the same count.  To get a total count of 36, each cell would have, on average, a count of 4.5=36/8. Such a uniform distribution does not seem useful, because it does not take into account the structure of the data.  "Structure" here means that the data has three dimensions.&lt;br /&gt;&lt;br /&gt;The assumption used for chi-square takes this structure into account.  It assumes that the process generates values independently along each dimension independently (rather than for each cell or for some arbitrary combination of dimension values).  This assumption has some implications.&lt;br /&gt;&lt;br /&gt;In the original data, there were ten things in the cells where A=0 (10 =1+2+3+4).  The expected values have the same relationship -- the sum of the expected values where A=0 should also be 10.  This is true for each of the values along each of the dimensions.  Note, though, that it is not true for combinations of dimensions.  So, the sum of the expected values where A=0 and B=0 is different (in general) for the expected values and the observed values.&lt;br /&gt;&lt;br /&gt;There is a second implication.  The distribution of values within each layer (or subcube) is the same, for all layers along the dimension.  The following picture illustrates this in three dimensions:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/blog/uploaded_images/cube-layers-710149.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 378px; height: 335px;" src="http://www.data-miners.com/blog/uploaded_images/cube-layers-710147.jpg" alt="" border="0" /&gt;&lt;/a&gt;The three shaded layers each have the property that the sums of the expected values are the same as the sums of the original data.  In addition, the distributions are the same.  This means that the highlighted cell in each layer has the same proportion for all the layers.&lt;br /&gt;&lt;br /&gt;This latter condition is actually quite a strong condition, because it imposes structure between all the cells in different layers.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0); font-family: arial;font-size:180%;" &gt;Calculating Expected Values&lt;/span&gt;&lt;br /&gt;There is actually a simple formula for calulating the expected values.  The calculation starts with the sums of the values of the cells in each possible layer.  The above diagram shows three layers, but this is only along one dimension.  There are an additional three layers (or subcubes) along each of the other two dimensions.  (The choice of 3 here is totally arbitrary; there could be any number along each dimension.)&lt;br /&gt;&lt;br /&gt;The expected value for a cell is the ratio of two numbers:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The product of the sum of the values along each dimension, divided by&lt;/li&gt;&lt;li&gt;The sum in the entire table raised to the power of the number of dimensions minus one.&lt;/li&gt;&lt;/ul&gt;Let us return to the initial data in a table, with three dimensions, A, B, and C and the counts 1 through 8.  What is the expected value for cell A=0, B=0, C=0?&lt;br /&gt;&lt;br /&gt;First, we need to calculate the sums for the three layers:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Asum is the cells where A=0:  10=1+2+3+4&lt;/li&gt;&lt;li&gt;Bsum is the cells where B=0:  14=1+2+5+6&lt;/li&gt;&lt;li&gt;Csum is the cells where C=0:  16=1+3+5+7&lt;/li&gt;&lt;li&gt;The product is 2,240.&lt;/li&gt;&lt;/ul&gt;Second, we need the sum for the whole table, which is 36.  The number of dimensions is 3, so the expected value for the cell is 2,240/36^2 = 1.73.&lt;br /&gt;&lt;br /&gt;The other cells have similar calculations.  The following shows the table with the expected values:&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 246pt;" border="0" cellpadding="0" cellspacing="0" width="330"&gt;&lt;col style="width: 48pt;" width="64"&gt;  &lt;col style="width: 34pt;" span="3" width="46"&gt;  &lt;col style="width: 48pt;" span="2" width="64"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl26" style="width: 34pt; font-weight: bold; text-align: center;" width="46"&gt;A&lt;/td&gt;   &lt;td class="xl26" style="width: 34pt; font-weight: bold; text-align: center;" width="46"&gt;B&lt;/td&gt;   &lt;td class="xl26" style="width: 34pt; font-weight: bold; text-align: center;" width="46"&gt;C&lt;/td&gt;   &lt;td class="xl26" style="width: 48pt; font-weight: bold; text-align: center;" width="64"&gt;Value&lt;/td&gt;   &lt;td class="xl26" style="width: 48pt; font-weight: bold; text-align: center;" width="64"&gt;Expected&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl25"&gt;0&lt;/td&gt;   &lt;td class="xl25"&gt;0&lt;/td&gt;   &lt;td class="xl25"&gt;0&lt;/td&gt;   &lt;td num="" align="right"&gt;1&lt;/td&gt;   &lt;td class="xl24" num="1.728395061728395" align="right"&gt;1.73&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl25"&gt;0&lt;/td&gt;   &lt;td class="xl25"&gt;0&lt;/td&gt;   &lt;td class="xl25"&gt;1&lt;/td&gt;   &lt;td num="" align="right"&gt;2&lt;/td&gt;   &lt;td class="xl24" num="2.1604938271604937" align="right"&gt;2.16&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl25"&gt;0&lt;/td&gt;   &lt;td class="xl25"&gt;1&lt;/td&gt;   &lt;td class="xl25"&gt;0&lt;/td&gt;   &lt;td num="" align="right"&gt;3&lt;/td&gt;   &lt;td class="xl24" num="2.7160493827160495" align="right"&gt;2.72&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl25"&gt;0&lt;/td&gt;   &lt;td class="xl25"&gt;1&lt;/td&gt;   &lt;td class="xl25"&gt;1&lt;/td&gt;   &lt;td num="" align="right"&gt;4&lt;/td&gt;   &lt;td class="xl24" num="3.3950617283950617" align="right"&gt;3.40&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl25"&gt;1&lt;/td&gt;   &lt;td class="xl25"&gt;0&lt;/td&gt;   &lt;td class="xl25"&gt;0&lt;/td&gt;   &lt;td num="" align="right"&gt;5&lt;/td&gt;   &lt;td class="xl24" num="4.4938271604938276" align="right"&gt;4.49&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl25"&gt;1&lt;/td&gt;   &lt;td class="xl25"&gt;0&lt;/td&gt;   &lt;td class="xl25"&gt;1&lt;/td&gt;   &lt;td num="" align="right"&gt;6&lt;/td&gt;   &lt;td class="xl24" num="5.617283950617284" align="right"&gt;5.62&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl25"&gt;1&lt;/td&gt;   &lt;td class="xl25"&gt;1&lt;/td&gt;   &lt;td class="xl25"&gt;0&lt;/td&gt;   &lt;td num="" align="right"&gt;7&lt;/td&gt;   &lt;td class="xl24" num="7.0617283950617287" align="right"&gt;7.06&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl25"&gt;1&lt;/td&gt;   &lt;td class="xl25"&gt;1&lt;/td&gt;   &lt;td class="xl25"&gt;1&lt;/td&gt;   &lt;td num="" align="right"&gt;8&lt;/td&gt;   &lt;td class="xl24" num="8.8271604938271597" align="right"&gt;8.83&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;Here the expected values are pretty close to the original values.  This calculation is available in the accompanying spreadsheet (&lt;a href="http://www.data-miners.com/blog/chi-square-blog.xls"&gt;chi-square-blog.xls&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;The calculation also readily extends to more than two dimensions.  However, the condition that the distrubutions are the same along parallel subcubes becomes more and more restrictive.  In two dimensions, the expected values make intuitive sense.  However, as the number of dimensions grows. they may not be as intuitive.  Also, by combining values along dimensions, it is possible to reduce a multidimensional case to a two-dimensional case (although some information is lost in the process).&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0);font-size:180%;" &gt;&lt;span style="font-weight: bold; font-family: arial;"&gt;From Expected Values to Chi-Square&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;The chi-square calculation itself follows the same procedure as in the two dimensional case.  The chi-square for each cell is the difference between the observed and expected value squared, divided by the expected value.  The chi-square for the whole table is the sum of all the chi-square values.&lt;br /&gt;&lt;br /&gt;The degrees of freedom is calculated in a way similar to the two-dimensional case.  It is the product of the size of each dimension minus 1.  So, in the 2X2X2 case, the degrees of freedom is 1.  In the 3X3X3X3 case, it is 16 (2*2*2*2).&lt;br /&gt;&lt;br /&gt;The next posting will explain how to calculate the expected value using SQL.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div&gt; &lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.data-miners.com/blog/chi-square-blog.xls"&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-5940857116431057898?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/5940857116431057898/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2008/12/multidimensional-chi-square-expected_28.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/5940857116431057898'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/5940857116431057898'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2008/12/multidimensional-chi-square-expected_28.html' title='Multidimensional Chi-Square, Expected Values, Independence, and All That, Part 2'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-6464453933421427662</id><published>2008-12-14T18:19:00.020-05:00</published><updated>2008-12-14T21:45:40.011-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='SQL'/><title type='text'>Multidimensional Chi-Square, Expected Values, Independence, and All That, Part 1</title><content type='html'>When I speak about data mining, I often refer to the chi-square test as my favorite statistical test.  I should be more specific, though, because I am really refering to the two-dimensional chi-square test.  This is described in detail in Chapter 3 of &lt;a style="font-weight: bold; font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;, a book that I do heartily recommend and is the starting point for many ideas that I write about here.&lt;br /&gt;&lt;br /&gt;The chi-square test can be applied to more than two dimensions.  However, the multi-dimensional chi-square behaves a bit differently from the two-dimensional case.  This posting describes why.  The next posting describes the calculation for the multi-dimensional chi-square.  And the third posting in this series will describe how to do the calculations using SQL.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:180%;" &gt;&lt;span style="color: rgb(0, 153, 0);font-family:arial;" &gt;Fast Overview of Chi-Square&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The Chi-Square test is used when we have two or more categorical variables and counts of how often each combination appears.  For instance, the following is a simple set of data in two dimensions:&lt;br /&gt;&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 151pt;" border="0" cellpadding="0" cellspacing="0" width="202"&gt;&lt;col style="width: 48pt;" width="64"&gt;  &lt;col style="width: 38pt;" width="51"&gt;  &lt;col style="width: 33pt;" width="44"&gt;  &lt;col style="width: 32pt;" width="43"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 38pt;" width="51"&gt;A=0&lt;/td&gt;   &lt;td style="width: 33pt;" width="44"&gt;B=0&lt;/td&gt;   &lt;td style="width: 32pt;" num="" align="right" width="43"&gt;1&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=0&lt;/td&gt;   &lt;td&gt;B=1&lt;/td&gt;   &lt;td num="" align="right"&gt;2&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1&lt;/td&gt;   &lt;td&gt;B=0&lt;/td&gt;   &lt;td num="" align="right"&gt;3&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1&lt;/td&gt;   &lt;td&gt;B=1&lt;/td&gt;   &lt;td num="" align="right"&gt;4&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;This data is summarized from ten observations.  The first row says that in one data record, both A and B are zero.  The last row says that in four of them, both A and B are 1.  In practice, when using the chi-square test, we would want higher counts -- and we would get them, because these are counts of customers (say, responders and non-responders by gender).&lt;br /&gt;&lt;br /&gt;In two dimensions, a contingency table is perhaps a better way of looking at the counts:&lt;br /&gt;&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 151pt;" border="0" cellpadding="0" cellspacing="0" width="202"&gt;&lt;col style="width: 48pt;" width="64"&gt;  &lt;col style="width: 38pt;" width="51"&gt;  &lt;col style="width: 33pt;" width="44"&gt;  &lt;col style="width: 32pt;" width="43"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 38pt;" width="51"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl24" style="width: 33pt; font-weight: bold;" width="44"&gt;B=0&lt;/td&gt;   &lt;td class="xl24" style="width: 32pt; font-weight: bold;" width="43"&gt;B=1&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="font-weight: bold;" class="xl24"&gt;A=0&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;1&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;2&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="font-weight: bold;" class="xl24"&gt;A=1&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;3&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;4&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;The chi-square test then asks the question . . . &lt;span style="font-style: italic;"&gt;What is the probability that the counts are produced randomly, assuming that both the A and B are independent?&lt;/span&gt;  To answer this question, we need the expected values assuming independence between A and B.  The following table shows the expected values:&lt;br /&gt;&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 192pt;" border="0" cellpadding="0" cellspacing="0" width="256"&gt;&lt;col style="width: 48pt;" span="4" width="64"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 48pt;" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl25" style="width: 48pt; font-weight: bold;" width="64"&gt;B=0&lt;/td&gt;   &lt;td class="xl25" style="width: 48pt; font-weight: bold;" width="64"&gt;B=1&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="font-weight: bold;" class="xl25"&gt;A=0&lt;/td&gt;   &lt;td class="xl24" num=""&gt;1.2&lt;/td&gt;   &lt;td class="xl24" num=""&gt;1.8&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="font-weight: bold;" class="xl25"&gt;A=1&lt;/td&gt;   &lt;td class="xl24" num=""&gt;2.8&lt;/td&gt;   &lt;td class="xl24" num=""&gt;4.2&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;The expected values have two important properties.  First, the row sums and column sums are the same as the original data.  So, 1+2 = 1.2+1.8 = 3, and so on for both rows and both columns.&lt;br /&gt;&lt;br /&gt;The second property is a little more subtle, but it says that the ratios of values in any column or any row are the same.  So, 1.2/1.8 = 2.8/4.2 = 2/3, and so on.  Of all possible 2X2 matrices, there is only one that has both these properties.&lt;br /&gt;&lt;br /&gt;Now, the chi-square value for any cell is the square of the difference between the actual value and the expected value divided by the expected value.  The chi-square for the matrix is the sum of the chi-square values for all the cells.  These follow a chi-square distribution with one degree of freedom, and this gives us a enough information to determine whether the original counts are likely due to chance.&lt;br /&gt;&lt;br /&gt;Calculating expected values is easy.  The expected value for any cell is the product of the row sum times the column sum divided by the total in the table.  For example, for A=0, B=0, the row sum is 3 and the column sum is 4.  The product is 12, so the expected value is 1.2 = 12/10.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-size:180%;" &gt; &lt;span style="font-family:arial;"&gt;Treating Three Dimensions As Two Dimensions&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;Now, let's assume that the data has three dimensions rather than two.  For example:&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 162pt;" border="0" cellpadding="0" cellspacing="0" width="214"&gt;&lt;col style="width: 48pt;" width="64"&gt;  &lt;col style="width: 29pt;" span="3" width="38"&gt;  &lt;col style="width: 27pt;" width="36"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 29pt;" width="38"&gt;A=0&lt;/td&gt;   &lt;td style="width: 29pt;" width="38"&gt;B=0&lt;/td&gt;   &lt;td style="width: 29pt;" width="38"&gt;C=0&lt;/td&gt;   &lt;td style="width: 27pt;" num="" align="right" width="36"&gt;1&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=0&lt;/td&gt;   &lt;td&gt;B=0&lt;/td&gt;   &lt;td&gt;C=1&lt;/td&gt;   &lt;td num="" align="right"&gt;2&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=0&lt;/td&gt;   &lt;td&gt;B=1&lt;/td&gt;   &lt;td&gt;C=0&lt;/td&gt;   &lt;td num="" align="right"&gt;3&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=0&lt;/td&gt;   &lt;td&gt;B=1&lt;/td&gt;   &lt;td&gt;C=1&lt;/td&gt;   &lt;td num="" align="right"&gt;4&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1&lt;/td&gt;   &lt;td&gt;B=0&lt;/td&gt;   &lt;td&gt;C=0&lt;/td&gt;   &lt;td num="" align="right"&gt;5&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1&lt;/td&gt;   &lt;td&gt;B=0&lt;/td&gt;   &lt;td&gt;C=1&lt;/td&gt;   &lt;td num="" align="right"&gt;6&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1&lt;/td&gt;   &lt;td&gt;B=1&lt;/td&gt;   &lt;td&gt;C=0&lt;/td&gt;   &lt;td num="" align="right"&gt;7&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1&lt;/td&gt;   &lt;td&gt;B=1&lt;/td&gt;   &lt;td&gt;C=1&lt;/td&gt;   &lt;td num="" align="right"&gt;8&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;We can treat this as a contingency table in two dimensions:&lt;br /&gt;&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 154pt;" border="0" cellpadding="0" cellspacing="0" width="204"&gt;&lt;col style="width: 48pt;" span="2" width="64"&gt;  &lt;col style="width: 29pt;" span="2" width="38"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 48pt;" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl25" style="width: 29pt;" width="38"&gt;C=0&lt;/td&gt;   &lt;td class="xl25" style="width: 29pt;" width="38"&gt;C=1&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl24"&gt;A=0,B=0&lt;/td&gt;   &lt;td style="text-align: center;" class="xl25" num=""&gt;1&lt;/td&gt;   &lt;td style="text-align: center;" class="xl25" num=""&gt;5&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=0,B=1&lt;/td&gt;   &lt;td style="text-align: center;" class="xl25" num=""&gt;2&lt;/td&gt;   &lt;td style="text-align: center;" class="xl25" num=""&gt;6&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl24"&gt;A=1,B=0&lt;/td&gt;   &lt;td style="text-align: center;" class="xl25" num=""&gt;3&lt;/td&gt;   &lt;td style="text-align: center;" class="xl25" num=""&gt;7&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1,B=1&lt;/td&gt;   &lt;td style="text-align: center;" class="xl25" num=""&gt;4&lt;/td&gt;   &lt;td style="text-align: center;" class="xl25" num=""&gt;8&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;And from this we can readily calculate the expected values:&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 192pt;" border="0" cellpadding="0" cellspacing="0" width="256"&gt;&lt;col style="width: 48pt;" span="4" width="64"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 48pt;" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl26" style="width: 48pt; text-align: center;" width="64"&gt;C=0&lt;/td&gt;   &lt;td class="xl26" style="width: 48pt; text-align: center;" width="64"&gt;C=1&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl24"&gt;A=0,B=0&lt;/td&gt;   &lt;td style="text-align: center;" class="xl25" num="1.6666666666666667"&gt;1.67&lt;/td&gt;   &lt;td style="text-align: center;" class="xl25" num="4.333333333333333"&gt;4.33&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=0,B=1&lt;/td&gt;   &lt;td style="text-align: center;" class="xl25" num="2.2222222222222223"&gt;2.22&lt;/td&gt;   &lt;td style="text-align: center;" class="xl25" num="5.7777777777777777"&gt;5.78&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl24"&gt;A=1,B=0&lt;/td&gt;   &lt;td style="text-align: center;" class="xl25" num="2.7777777777777777"&gt;2.78&lt;/td&gt;   &lt;td style="text-align: center;" class="xl25" num="7.2222222222222223"&gt;7.22&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1,B=1&lt;/td&gt;   &lt;td style="text-align: center;" class="xl25" num="3.3333333333333335"&gt;3.33&lt;/td&gt;   &lt;td style="text-align: center;" class="xl25" num="8.6666666666666661"&gt;8.67&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;The chi-square calculation follows as in the earlier case.  The chi-square value for each cell is the actual count minus the expected value squared divided by the expected value.  The chi-square value for the entire table is the sum of all the chi-square values for each cell.&lt;br /&gt;&lt;br /&gt;The only difference here is that there are three degrees of freedom.  This affects how to transform the chi-square value into a probability, but it does not affect the computation.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-size:180%;" &gt;&lt;span style="font-family:arial;"&gt;Which Are the Right Expected Values?&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;There are actually two other continency tables that we might produce from the original 2X2X2 data, depending on which dimension we use for the columns:&lt;br /&gt;&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 154pt;" border="0" cellpadding="0" cellspacing="0" width="204"&gt;&lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 48pt;" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 29pt;" width="38"&gt;B=0&lt;/td&gt;   &lt;td style="width: 29pt;" width="38"&gt;B=1&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=0,C=0&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;1&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;2&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=0,C=1&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;5&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;6&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1,C=0&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;3&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;4&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1,C=1&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;7&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;8&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;and&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 154pt;" border="0" cellpadding="0" cellspacing="0" width="204"&gt;&lt;col style="width: 48pt;" span="2" width="64"&gt;  &lt;col style="width: 29pt;" span="2" width="38"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 48pt;" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 29pt;" width="38"&gt;A=0&lt;/td&gt;   &lt;td style="width: 29pt;" width="38"&gt;A=1&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;B=0,C=0&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;1&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;3&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;B=0,C=1&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;5&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;7&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;B=1,C=0&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;2&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;4&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;B=1,C=1&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;6&lt;/td&gt;   &lt;td style="text-align: center;" num=""&gt;8&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;Following the same procedure, we can calcualte the expected values for each of these.&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 192pt;" border="0" cellpadding="0" cellspacing="0" width="256"&gt;&lt;col style="width: 48pt;" span="4" width="64"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 48pt;" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 48pt; text-align: center;" width="64"&gt;B=0&lt;/td&gt;   &lt;td style="width: 48pt; text-align: center;" width="64"&gt;B=1&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=0,C=0&lt;/td&gt;   &lt;td style="text-align: center;" class="xl24" num="1.3333333333333333"&gt;1.33&lt;/td&gt;   &lt;td style="text-align: center;" class="xl24" num="1.6666666666666667"&gt;1.67&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=0,C=1&lt;/td&gt;   &lt;td style="text-align: center;" class="xl24" num="4.8888888888888893"&gt;4.89&lt;/td&gt;   &lt;td style="text-align: center;" class="xl24" num="6.1111111111111107"&gt;6.11&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1,C=0&lt;/td&gt;   &lt;td style="text-align: center;" class="xl24" num="3.1111111111111112"&gt;3.11&lt;/td&gt;   &lt;td style="text-align: center;" class="xl24" num="3.8888888888888888"&gt;3.89&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1,C=1&lt;/td&gt;   &lt;td style="text-align: center;" class="xl24" num="6.666666666666667"&gt;6.67&lt;/td&gt;   &lt;td style="text-align: center;" class="xl24" num="8.3333333333333339"&gt;8.33&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;and&lt;br /&gt;&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 192pt;" border="0" cellpadding="0" cellspacing="0" width="256"&gt;&lt;col style="width: 48pt;" span="4" width="64"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 48pt;" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 48pt; text-align: center;" width="64"&gt;B=0&lt;/td&gt;   &lt;td style="width: 48pt; text-align: center;" width="64"&gt;B=1&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=0,C=0&lt;/td&gt;   &lt;td style="text-align: center;" class="xl24" num="1.7777777777777777"&gt;1.78&lt;/td&gt;   &lt;td style="text-align: center;" class="xl24" num="2.2222222222222223"&gt;2.22&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=0,C=1&lt;/td&gt;   &lt;td style="text-align: center;" class="xl24" num="5.333333333333333"&gt;5.33&lt;/td&gt;   &lt;td style="text-align: center;" class="xl24" num="6.666666666666667"&gt;6.67&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1,C=0&lt;/td&gt;   &lt;td style="text-align: center;" class="xl24" num="2.6666666666666665"&gt;2.67&lt;/td&gt;   &lt;td style="text-align: center;" class="xl24" num="3.3333333333333335"&gt;3.33&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;A=1,C=1&lt;/td&gt;   &lt;td style="text-align: center;" class="xl24" num="6.2222222222222223"&gt;6.22&lt;/td&gt;   &lt;td style="text-align: center;" class="xl24" num="7.7777777777777777"&gt;7.78&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;Oops!.  The three sets of expected values are different from each other.  Which do we use for the 2X2X2 chi-square calculation?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-size:180%;" &gt;&lt;span style="font-family:arial;"&gt;Why Independence is a Strong Condition&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;The answer is none of these.  For the three dimensional data (and higher dimensional as well), the three contingency tables are almost always going to be different, because they mean different things.  This is perhaps best viewed geometrically:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/blog/uploaded_images/chi-square-cube-736394.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 366px;" src="http://www.data-miners.com/blog/uploaded_images/chi-square-cube-736391.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;In this cube, the front face corresponds to C=0 and the hidden face to C=1.  The A values go horizontally and the B's vertically.   The three different contingency tables are formed by cutting the cube in half and then pasting the halves together.  These tables are different.&lt;br /&gt;&lt;br /&gt;For instance, the front face and the back facee are each 2X2 contingency tables.  The expected values for these can be determined just from the information on each face.  We do not need the information along the C dimension for this calculation.  Worse, we cannot even use this information -- so there is no way to ensure that the sums along the "C" dimension add up to the same values in the original data and for the expected values.&lt;br /&gt;&lt;br /&gt;The problem is that the sums along each dimension overspecify the problem.  A given value has three adjacent values along three dimensions.  However, only two of the dimensions are needed to calcualte an expected value, assuming independence along those two dimensions.  The information along the third dimension cannot be incorporated into the calculation.&lt;br /&gt;&lt;br /&gt;The reason?  Independence is a very strong condition.  Remember, it says not only that the sums are the same but also that the ratios within each row (or column or layer) are the same.  Normally, we might think "independent" variables are providing as much flexibility as possible.  However, that is not the case.  In fact, the original counts are the only ones that meet the all the conditions of independence at the level of every row, colum, and level.&lt;br /&gt;&lt;br /&gt;When I think of this situation, I think of a paradox related to the random distribution of stars.  We actually perceive a random distribution as more ordered.  Check out this &lt;a href="http://muller.lbl.gov/teaching/Physics10/old%20physics%2010/chapters%20%28old%29/4-Randomness.htm"&gt;site&lt;/a&gt; for an example.  Similarly, our intuition is that independence among variables is a weak condition.  In fact, it can be quite a strong condition.&lt;br /&gt;&lt;br /&gt;The next posting will explain how expected values work in three and more dimensions.  For now, it is worth explaining that converting a three-dimensional problem into two dimensions is often feasible and reasonable.  This is particularly true when one of the dimensions is a "response" characteristic and the rest are input dimensions.  However, such a 2X2 table is really an approximation.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-6464453933421427662?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/6464453933421427662/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2008/12/multidimensional-chi-square-expected.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/6464453933421427662'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/6464453933421427662'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2008/12/multidimensional-chi-square-expected.html' title='Multidimensional Chi-Square, Expected Values, Independence, and All That, Part 1'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-1691209666239339286</id><published>2008-12-07T21:38:00.017-05:00</published><updated>2008-12-07T23:53:58.585-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='MapReduce'/><category scheme='http://www.blogger.com/atom/ns#' term='SQL'/><title type='text'>MapReduce and SQL Aggregations Using Grouping Sets</title><content type='html'>In an earlier &lt;a href="http://www.data-miners.com/blog/2008/01/mapreduce-and-sql-aggregations.html"&gt;post&lt;/a&gt;, I compared MapReduce functionality and and SQL functionality and made the claim that SQL required two passes through the data to calculate the number of customer starts and stops per month.  (The data used for this is on the companion web site for my book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;.)&lt;br /&gt;&lt;br /&gt;Two of the comments on this post explained SQL syntax that achieves the goal more efficiently.  In particular, the GROUPING SETS keyword, which is part of more recent SQL standards, is an efficient solution that allows SQL to do more of the types of processing made possibly by MapReduce.  This functionality is available in SQL Server and Oracle.  However, it is not yet available in MySQL.&lt;br /&gt;&lt;br /&gt;The following SQL query answers the question at the top of this post using &lt;span style="font-family:courier new;"&gt;FULL OUTER JOIN&lt;/span&gt; (an alternative approach is to use &lt;span style="font-family:courier new;"&gt;UNION ALL&lt;/span&gt;):&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;SELECT m, ISNULL(numstarts, 0), ISNULL(numstops, 0)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM (SELECT MONTH(start_date) as m, COUNT(*) as numstarts&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM customer c&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;GROUP BY MONTH(start_date)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;) start FULL OUTER JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;(SELECT MONTH(stop_date) as m, COUNT(*) as numstops&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM customer c&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;GROUP BY MONTH(stop_date)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;) stop&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.....&lt;/span&gt;ON start.m = stop.m&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Although this query is effective, in most databases, it would require two passes through the data.&lt;br /&gt;&lt;br /&gt;An alternative approach is to use &lt;span style="font-family:courier new;"&gt;GROUPING SETS&lt;/span&gt;.  This keyword is a generalization and imporvement on CUBE functionality.  The generalization is more powerful, because it gives more options for the query optimizaer.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;GROUPING SET&lt;/span&gt; allows a query to return summaries along each grouping dimension, with or without generating the full set of rows that &lt;span style="font-family: courier new;"&gt;GROUP BY &lt;/span&gt;would create. The following query  returns a separate row for each combination of start month and stop month:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;SELECT MONTH(start_date) as start_month, MONTH(stop_date) as stop_month,&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.......&lt;/span&gt;COUNT(*) as cnt&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM customer&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;GROUP BY MONTH(start_date), MONTH(stop_date)&lt;br /&gt;&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;We could imagine row in the result set as being a cell in a big cross tabulation table, with the start months on the rows and the stop months on the columns.  What we really want are the subtotals along the rows and the columns, not the full table.  The following query accomplishes this:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;SELECT COALESCE(start_month, stop_month) as month,&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.......&lt;/span&gt;SUM(CASE WHEN stop_month IS NULL THEN cnt ELSE 0 END) as starts,&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.......&lt;/span&gt;SUM(CASE WHEN start_month IS NULL THEN cnt ELSE 0 END) as stops&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;FROM (SELECT MONTH(start_date) as start_month, MONTH(stop_date) as stop_month,&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt; .............&lt;/span&gt;COUNT(*) as cnt&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;FROM customer&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;......&lt;/span&gt;GROUP BY GROUPING SETS (MONTH(start_date), MONTH(stop_date))&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt; .....&lt;/span&gt;) a&lt;br /&gt;WHERE start_month IS NOT NULL and stop_month IS NOT NULL&lt;br /&gt;&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;The subquery in this query aggregates the data in a special way.  The outer query simply reformats the results to be similar to the earlier query.&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-family: courier new;"&gt;GROUPING SETS &lt;/span&gt;keyword specifies that summaries of the data should be returned, rather than the individual aggregated rows.  This syntax specifies that groups are created for the start month and stop month.   So, the inner query returns rows such as the following:&lt;br /&gt;&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 270pt;" border="0" cellpadding="0" cellspacing="0" width="361"&gt;&lt;col style="width: 48pt;" width="64"&gt;  &lt;col style="width: 85pt;" width="114"&gt;  &lt;col style="width: 89pt;" width="119"&gt;  &lt;col style="width: 48pt;" width="64"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl24" style="width: 85pt; font-weight: bold;" width="114"&gt;Start Month&lt;/td&gt;   &lt;td class="xl24" style="width: 89pt; font-weight: bold;" width="119"&gt;Stop Month&lt;/td&gt;   &lt;td class="xl24" style="width: 48pt; font-weight: bold;" width="64"&gt;Count&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;Jan&lt;/td&gt;   &lt;td&gt;NULL&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;Feb&lt;/td&gt;   &lt;td&gt;NULL&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;. . .&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;NULL&lt;/td&gt;   &lt;td&gt;Jan&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;NULL&lt;/td&gt;   &lt;td&gt;Feb&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;. . .&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;However, the cross-rows generated by the regular group by are not there.   Half the rows have the subtotals for start months; for these the stop month column is NULL.  Half have subtotals for the start month, where the stop month is NULL.  This syntax does not generate the cross-tabulation data, but it does keep the row and column subtotals.&lt;br /&gt;&lt;br /&gt;The &lt;span style="font-family: courier new;"&gt;GROUPING SETS&lt;/span&gt; keyword generates the subtotals for the start months and stop months.   In general, the query optimizer will generate the various grouping aggregations in one pass over the data.  This makes the syntax and performance much more similar to the MapReduce approach.&lt;br /&gt;&lt;br /&gt;However, Map Reduce still has two practical advantages.  The first are the limits on the number of groups in the grouping sets.  In SQL Server, only &lt;a href="http://msdn.microsoft.com/en-us/library/ms177673.aspx"&gt;32 groups are allowed&lt;/a&gt;.  This example only used two.  But more complex examples might breach this limit.&lt;br /&gt;&lt;br /&gt;The other issue is the flexibility of the SQL language.  One of the major uses of MapReduce is to process text.  In this case, we would be extracting many potential features from the text, and then doing subsequent aggregations.  SQL extensions can be used to create the features.  However, such features quickly exceed the limits on the number of groups, limiting the feasbility of this approach.&lt;br /&gt;&lt;br /&gt;One warning about the syntax.  The parentheses in the &lt;span style="font-family:courier new;"&gt;GROUPING SETS&lt;/span&gt; statement are important.  The following version would actually be the equivalant of the regular &lt;span style="font-family:courier new;"&gt;GROUP BY&lt;/span&gt;:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;GROUP BY GROUPING SETS ((MONTH(start_date), MONTH(stop_date))&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;This is because the keyword takes a list of things being grouped.  So, in the original version (one set of parentheses), there are two elements in the list -- the totals for start month and stop  month.  In the second version, there is one element in the  list, so the cross-tabulation between start month and stop month are generated instead.  This cross-product is the equivalent of the regular group by.&lt;br /&gt;&lt;br /&gt;And my final comment is about the &lt;span style="font-family: courier new;"&gt;CUBE&lt;/span&gt; keyword which seems to provide the same functionality.  This keyword generates all the regular aggregation rows in the table, along with additional subtotals for all combinations of dimensions.  In the above example, it would generate the cross-tab table of start month and end month, as well as the summary rows.&lt;br /&gt;&lt;br /&gt;The problem with the &lt;span style="font-family: courier new;"&gt;CUBE &lt;/span&gt;keyword is that the original query does not need all the aggregation rows, so generating them is a waste of time.  Whether this is faster or slower than the original version of the query with the &lt;span style="font-family: courier new;"&gt;FULL OUTER JOIN&lt;/span&gt; depends on the environment.  However, it could be quite inefficient.  In addition, the query optimizer would have a very difficult time determining that these rows are not needed.&lt;br /&gt;&lt;br /&gt;I do not feel that the &lt;span style="font-family: courier new;"&gt;CUBE &lt;/span&gt;keyword provides functionality similar to MapReduce.   However, the &lt;span style="font-family: courier new;"&gt;GROUPING SETS&lt;/span&gt;  keyword does provide  functionality similar to MapReduce, because it produces summaries along dimensions without requiring multiple passes through the data and without generates large cross-tabulations.  In addition, the &lt;span style="font-family: courier new;"&gt;GROUPING SETS &lt;/span&gt;keyword allows the query optimizer to choose from a variety of algorithms for executing the query, taking advantage of large computer systems using SQL syntax instead of programming.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-1691209666239339286?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='related' href='http://www.data-miners.com/blog/2008/01/mapreduce-and-sql-aggregations.html' title='MapReduce and SQL Aggregations Using Grouping Sets'/><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/1691209666239339286/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2008/12/mapreduce-and-sql-aggregations-using.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1691209666239339286'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1691209666239339286'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2008/12/mapreduce-and-sql-aggregations-using.html' title='MapReduce and SQL Aggregations Using Grouping Sets'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-1120805410987247950</id><published>2008-11-22T10:00:00.029-05:00</published><updated>2008-11-22T17:23:18.185-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='hierarchical modeling'/><category scheme='http://www.blogger.com/atom/ns#' term='SQL'/><title type='text'>Accounting for Variation in Variables Between- and Within- Groups</title><content type='html'>Recently, I had occasion to learn about fixed effects and random effects models (as well as the larger subject known as hierchical or multi-level modeling)  in the context of analyzing patient longitudinal data.  This posting is about one particular question that interested me in this work:  For a given variable, how much of the variation in the values is due to within-group effects versus how much is due to between-group effects.&lt;br /&gt;&lt;br /&gt;For the longitudinal patient data, the groups were repeated measurements on the same individual.  For this discussion though, I'll ask questions such as "How much of the variation in zip code population is due to variations within a state versus variations between states?" I leave it to the reader to generalize this to other areas.&lt;br /&gt;&lt;br /&gt;The data used is the census data on the companion web site to my book &lt;a style="font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;.  Also, the spirit of understanding this problem using SQL and charts also comes from the book.&lt;br /&gt;&lt;br /&gt;This posting starts with what I consider to be a simple approach to answering the question.  It is then going to show how to calculate the result in SQL.  Finally, I'm going to discuss the solution Paul Allison prsents in his book, and what I think are its drawbacks.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0);font-family:arial;font-size:130%;"  &gt;&lt;span style="font-weight: bold;"&gt;What Does Within- Versus Between- Group Variation Even Mean?&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I first saw this issue in Paul Allison's book &lt;a style="font-style: italic;" href="http://www.amazon.com/Fixed-Effects-Regression-Methods-Longitudinal/dp/1590475682/ref=si3_rdr_bb_product"&gt;Fixed Effects Regression Methods for Longitudinal Data Analysis Using SAS&lt;/a&gt;, which became something of a bible on the subject while I was trying to do exactly what the title suggested (and I highly, highly recommend the book for people tackling such problems).  On page 40, he has the tantalizing observation "The degree to which the coefficients change under fixed effects estimation as compared with conventional OLS appears to be related to the degree of between- versus within-school variation on the predictor variables."&lt;br /&gt;&lt;br /&gt;This suggests that within-group versus between-group variation can be quite interesting.  And not just for predictor variables.  And not just for schools.&lt;br /&gt;&lt;br /&gt;Let's return to the question of how much variation in a zip code's population is due to the state where the zip code resides, and how much is due to variation within the state.  To answer this question analytically, we need to phrase it in terms of measures.  Or, for this question, how well does the average population of zip codes in a state do at predicting the population of a zip code in the state?&lt;br /&gt;&lt;br /&gt;In answering this question, we are replacing the values of individual zip codes with the averaged values at the group (i.e. state) level.  By eliminating within group variation, the answer will tell us about between-group variation.  We can assume that remaining variation is due to within group variation.&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:130%;"  &gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;"&gt;Using Variation to Answer the Question&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;Variance quantifies the idea that each point -- say the population of each zip code -- differs from the overall average.  The following chart shows a scatter plot of all the zip codes with the overall average (by the way, the zip codes here are ordered by the average zip code population in each state).&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/blog/uploaded_images/overall-variation-723404.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 335px;" src="http://www.data-miners.com/blog/uploaded_images/overall-variation-723399.jpg" alt="" border="0" /&gt;&lt;/a&gt;The grey line is the overall average.  We can see that the populations for zip codes are all over the place; there is not much of a pattern.  As for the variance calculation, imagine a bar from each point to the horizontal  line.  The variance is just the sum of the squared distances from each point to the average.  This sum is the total variance.&lt;br /&gt;&lt;br /&gt;What we want to do is to decompose this variance into two parts, a within-group part and a between-groups part.  I think the second is easier to explain, so let me take that route. To eliminate within group variation, we just substitute the average value in the group for the actual value.  This means that we are looking at the following chart instead:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/blog/uploaded_images/between-groups-775822.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 336px;" src="http://www.data-miners.com/blog/uploaded_images/between-groups-775819.jpg" alt="" border="0" /&gt;&lt;/a&gt;The blue slanted line is the average in each state.  We see visually that much of the variation has gone away, so we would expect most variation to be within a state rather than between states.&lt;br /&gt;&lt;br /&gt;The idea is that we measure the variation using the first approach and we measure the variation using the second approach.  The ratio of these two values tells us how much of the variation is due to between-groups changes.  The remaining variation must be due to within-group variation.  The next section shows the calculation in SQL.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-family:arial;font-size:130%;"  &gt;Doing the Calculation in SQL&lt;/span&gt;&lt;br /&gt;Expressing this in SQL is simply a matter of calculating the various sums of squared differences.  The following SQL statement calculates both the within-group and between-group variation:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-family:courier new;"&gt;SELECT (SUM((g.grpval - a.allval)*(g.grpval - a.allval))/&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;........&lt;/span&gt;SUM((d.val - a.allval)*(d.val - a.allval))&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;.......&lt;/span&gt;) as between_grp,&lt;/span&gt;&lt;span style="font-family:courier new,mon;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;.......&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;(SUM((d.val - g.grpval)*(d.val - g.grpval)) /&lt;/span&gt;&lt;span style="font-family:courier new,mon;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;........&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;SUM((d.val - a.allval)*(d.val - a.allval))&lt;/span&gt;&lt;span style="font-family:courier new,mon;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;.......&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;) as within_grp&lt;/span&gt;&lt;span style="font-family:courier new,mon;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;FROM (SELECT state as grp, population as val&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;......&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;FROM censusfiles.zipcensus zc&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;.....&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;) d JOIN&lt;/span&gt;&lt;span style="font-family:courier new,mon;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;.....&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;(SELECT state as grp, AVG(population) as grpval&lt;/span&gt;&lt;span style="font-family:courier new,mon;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;......&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;FROM censusfiles.zipcensus zc&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;......&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;GROUP BY 1&lt;/span&gt;&lt;span style="font-family:courier new,mon;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;.....&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;) g&lt;/span&gt;&lt;span style="font-family:courier new,mon;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;.....&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;ON d.grp = g.grp CROSS JOIN&lt;br /&gt;&lt;/span&gt;&lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;.....&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;(SELECT AVG(population) as allval&lt;br /&gt;&lt;/span&gt;&lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;......&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;FROM censusfiles.zipcensus zc&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);font-family:courier new;" &gt;.....&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;) a&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;First note that I snuck in the calculation for both within- and between- group variation, even though I only explained the latter.&lt;br /&gt;&lt;br /&gt;The from clause has three subqueries.  Each of these calculates one level of the summary -- the value for each zip, the value for each state, and the overall value.  All the queries rename the fields to some canonical name.  This means that we can change the field we are looking at and not have to modify the outer &lt;span style="font-family:courier new;"&gt;SELECT&lt;/span&gt; clause -- a convenience that reduces the chance of error.&lt;br /&gt;&lt;br /&gt;In addition, the structure of the query makes it fairly easy to use a calculated field rather than just a column.  The same calculation would need to be used for all the fields.&lt;br /&gt;&lt;br /&gt;And finally, if you are using a database that supports window functions -- such as SQL Server or Oracle -- then the statement for the query can be much simpler.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-family:arial;font-size:130%;"  &gt;Discussion of Results&lt;/span&gt;&lt;br /&gt;The results for population say that 12.6% of the variation in zip code population is between states and 87.4% is within states.  This confirms the observation that using the state averages removed much of the variation in the data.  In fact, for most of the census variables, most of the variation is within states.&lt;br /&gt;&lt;br /&gt;There are definitely exceptions to this.  One interesting exception is latitutude (which specifies how far north or south  something is).  The within-state variation for latitude is 5.5% and the between-state is 94.5% -- quite a reversal.  The scatter plot for latitude looks quite different from the scatter plot for population:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/blog/uploaded_images/latitude-scatter-733782.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 336px;" src="http://www.data-miners.com/blog/uploaded_images/latitude-scatter-733749.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;In this scatter plot, we see that the zip code values in light blue all fall quite close to the average for the state -- and in many cases, quite far from the county average.  This makes a lot of sense geographically, and we see that fact both in the scatter plot and in the within-group and between-group variation.&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-family:arial;font-size:130%;"  &gt;&lt;br /&gt;Statistical Approach&lt;/span&gt;&lt;br /&gt;Finally, it is instructive to go back to Paul Allison's book and look at his method for doing the same calculation in SAS.  Although I am going to show SAS code, understanding the idea does not require knowing SAS -- on the other hand, it might require an advanced degree in statistics.&lt;br /&gt;&lt;br /&gt;His proposed method is to run the following statement:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;proc glm data=censusfiles.zipcensus;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;absorb state;&lt;br /&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;model population=;&lt;br /&gt;run;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;And, as he states, "the proportion of variation that is between [states] is just the R-squared from this regression."&lt;br /&gt;&lt;br /&gt;This statement is called a procedure (or proc for short) in SAS.  It is calling the procedure called "glm", which stands for generalized linear model.  Okay, now you can see where the advanced statistics might help.&lt;br /&gt;&lt;br /&gt;The "absorb" option creates a separate indicator for each state.  However, for performance reasons, "abosrb" does not report their values.  (There are other ways to do a similar calculation that do report the individual values, but they take longer to run.)&lt;br /&gt;&lt;br /&gt;The "model" part of the statement says what model to build.  In this case, the model is predicting population, but not using any input variables.  Actually, it is using input variables -- the indicators for each state created on the "absorb" line.&lt;br /&gt;&lt;br /&gt;Doing the calculation using this method has several shortcomings.  First, the results are put into a text file.  They cannot easily be captured into a database table or into Excel.  You have to search through lots of text to find the right metric.  And, you can only run one variable at a time.  In the SQL method, adding more variables is just adding more calculations on the SELECT list.  And the SQL method seems easier to generalize, which I might bring up in another posting.&lt;br /&gt;&lt;br /&gt;However, the biggest shortcoming is conceptual.  Understanding variation between-groups and within-groups is not some fancy statistical procedure that requires in-depth knowledge to use correctly.  Rather, it is a fundamental way of understanding data, and easy to calculate using tools, such as databases, that can readily manipulate data.  The method in SQL should not only perform better on large data sets (particularly using a parallel database), but it requires much less effort to understand.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-1120805410987247950?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/1120805410987247950/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2008/11/accounting-for-variation-in-variables.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1120805410987247950'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/1120805410987247950'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2008/11/accounting-for-variation-in-variables.html' title='Accounting for Variation in Variables Between- and Within- Groups'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-4189322989736959303</id><published>2008-11-12T15:58:00.017-05:00</published><updated>2008-11-12T19:35:43.977-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='Excel'/><title type='text'>Creating Accurate Venn Diagrams in Excel, Part 2</title><content type='html'>This post is an extention of an earlier &lt;a href="http://www.data-miners.com/blog/2008/10/creating-accurate-venn-diagrams-in.html"&gt;post&lt;/a&gt;.  If you are interested in this, you may be interested in my book &lt;a style="font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;This post is about creating a Venn diagram using two circles.  A Venn diagram is used to explain data such as:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Group A has 81 members.&lt;/li&gt;&lt;li&gt;Group B has 25 members.&lt;/li&gt;&lt;li&gt;There are 15 members in both groups A and B.&lt;/li&gt;&lt;/ul&gt;The above data is shown as a Venn diagram as:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/blog/uploaded_images/two-circle-venn-770799.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 314px;" src="http://www.data-miners.com/blog/uploaded_images/two-circle-venn-770795.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Unfortunately, creating a simple Venn diagram is not built into Excel, so we need to create one manually.  This is another example that shows off the power of Excel charting to do unexpected things.&lt;br /&gt;&lt;br /&gt;Specifically, creating the above diagram requires the following capabilities:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;We need to draw a circle with a given radius and a center at any point.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;We need to fill in the circle with appropriate shading.&lt;/li&gt;&lt;li&gt;We need to calculate the appropriate centers and radii given data.&lt;/li&gt;&lt;li&gt;We need to annotate the chart with text.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;Each of these are explained below.  All of the charts and formulas are available in the accompanying Excel file.&lt;br /&gt;&lt;span style=";font-family:arial;font-size:130%;"  &gt; &lt;span style="color: rgb(0, 153, 0); font-weight: bold;"&gt;Drawing a Circle Using Scatter Plots&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To create the circle, we start with a bunch of points, that when connected with smoothed lines will look like a circle.  To get the points, we'll create a table with values from 0 to 360 degrees, and borrow some formulas from trigonometry.  These say:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;X = radius*sin(&lt;angle&gt;) + X-offset&lt;/angle&gt;&lt;/li&gt;&lt;li&gt;Y = radius*cos(&lt;angle&gt;) + Y-offset&lt;br /&gt;&lt;/angle&gt;&lt;/li&gt;&lt;/ul&gt;The only slight complication is that the functions &lt;span style="font-family:courier new;"&gt;SIN()&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;COS()&lt;/span&gt; take their arguments in something called radian rather than degrees.  This makes the formula look like:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;X = radius*sin(&lt;angle&gt;*2*PI/360) + X-offset&lt;/angle&gt;&lt;/li&gt;&lt;li&gt;Y = radius*cos(&lt;angle&gt;*2*PI/360) + Y-offset&lt;br /&gt;&lt;/angle&gt;&lt;/li&gt;&lt;/ul&gt; The following shows the formulas:&lt;br /&gt;&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 419pt;" border="0" cellpadding="0" cellspacing="0" width="557"&gt;&lt;col style="width: 28pt;" width="37"&gt;  &lt;col style="width: 52pt;" width="69"&gt;  &lt;col style="width: 167pt;" width="222"&gt;  &lt;col style="width: 172pt;" width="229"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 28pt;" height="18" width="37"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl26" style="width: 52pt;" width="69"&gt;Degrees&lt;/td&gt;   &lt;td class="xl26" style="width: 167pt;" width="222"&gt;X-Value&lt;/td&gt;   &lt;td class="xl27" style="border-left: medium none; width: 172pt;" width="229"&gt;Y-Value&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl28" style="border-top: medium none;" num=""&gt;0&lt;/td&gt;   &lt;td class="xl24" style="border-top: medium none;" num="0"&gt;=$E$4+SIN(2*PI()*B11/360)*$D$4&lt;/td&gt;   &lt;td class="xl25" style="border-top: medium none; border-left: medium none;" num="9"&gt;=$F$4+COS(2*PI()*B11/360)*$D$4&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl28" style="border-top: medium none;" num=""&gt;5&lt;/td&gt;   &lt;td class="xl24" style="border-top: medium none;" num="0.78440168472892347"&gt;=$E$4+SIN(2*PI()*B12/360)*$D$4&lt;/td&gt;   &lt;td class="xl25" style="border-top: medium none; border-left: medium none;" num="8.9657522828257097"&gt;=$F$4+COS(2*PI()*B12/360)*$D$4&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl28" style="border-top: medium none;" num=""&gt;10&lt;/td&gt;   &lt;td class="xl24" style="border-top: medium none;" num="1.562833599002373"&gt;=$E$4+SIN(2*PI()*B13/360)*$D$4&lt;/td&gt;   &lt;td class="xl25" style="border-top: medium none; border-left: medium none;" num="8.8632697771098723"&gt;=$F$4+COS(2*PI()*B13/360)*$D$4&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl28" style="border-top: medium none;" num=""&gt;15&lt;/td&gt;   &lt;td class="xl24" style="border-top: medium none;" num="2.3293714059226867"&gt;=$E$4+SIN(2*PI()*B14/360)*$D$4&lt;/td&gt;   &lt;td class="xl25" style="border-top: medium none; border-left: medium none;" num="8.6933324366016151"&gt;=$F$4+COS(2*PI()*B14/360)*$D$4&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl28" style="border-top: medium none;" num=""&gt;20&lt;/td&gt;   &lt;td class="xl24" style="border-top: medium none;" num="3.0781812899310186"&gt;=$E$4+SIN(2*PI()*B15/360)*$D$4&lt;/td&gt;   &lt;td class="xl25" style="border-top: medium none; border-left: medium none;" num="8.4572335870731763"&gt;=$F$4+COS(2*PI()*B15/360)*$D$4&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl28" style="border-top: medium none;" num=""&gt;25&lt;/td&gt;   &lt;td class="xl24" style="border-top: medium none;" num="3.8035643556662948"&gt;=$E$4+SIN(2*PI()*B16/360)*$D$4&lt;/td&gt;   &lt;td class="xl25" style="border-top: medium none; border-left: medium none;" num="8.1567700833298495"&gt;=$F$4+COS(2*PI()*B16/360)*$D$4&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl28" style="border-top: medium none;" num=""&gt;30&lt;/td&gt;   &lt;td class="xl24" style="border-top: medium none;" num="4.5"&gt;=$E$4+SIN(2*PI()*B17/360)*$D$4&lt;/td&gt;   &lt;td class="xl25" style="border-top: medium none; border-left: medium none;" num="7.794228634059948"&gt;=$F$4+COS(2*PI()*B17/360)*$D$4&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;Where E4 contains the X-offset; F4 contains the Y-offset; and D4 contains the radius.&lt;br /&gt;&lt;br /&gt;The degree values need to extend all the way to 360 to get a full circle, which can then be plotted as a scatter plot.  When choosing which variety of the scatter plot, choose the option of points connected with smoothed lines.&lt;br /&gt;&lt;br /&gt;The following chart shows the resulting circle with the points highlighted, along with axis labels and grid lines (which should be removed before creating the final version):&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/blog/uploaded_images/circle-on-grid-733893.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 370px;" src="http://www.data-miners.com/blog/uploaded_images/circle-on-grid-733888.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Creating a second circle is as easy as creating one, by just adding a second set of series onto the chart.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-size:130%;" &gt;&lt;span style="font-family:arial;"&gt;Filling in the Circle with Appropriate Shading&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Unfortunately, to Excel, the circle is really just a collection of points, and we cannot fill it with shading.  However, with a clever idea of using error bars, we can put in a pattern, such as:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/blog/uploaded_images/circle-with-shading-734782.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 370px;" src="http://www.data-miners.com/blog/uploaded_images/circle-with-shading-734779.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The idea is to create X error bars for horizontal lines and Y error bars for vertical lines.  To do this. right click on the circle and choose "Format Data Series".  Then go to the "X Error Bars" or "Y Error Bars" tab (whichever is appropriate).  Put 101 in the "Percent" box.&lt;br /&gt;&lt;br /&gt;This adds the error bars.  To format then, double click on one of them.  You can set the color for them and also remove the little line at the edge.&lt;br /&gt;&lt;br /&gt;You will notice that these bars are not evenly spaced.  The spacing is related to the degrees.  With the proper choice of degrees, the points would be evenly spaced.  However, I do not mind the uneven spacing, and have not bothered to figure out a better set of points for even spacing.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:130%;"  &gt; &lt;span style="color: rgb(0, 153, 0); font-weight: bold;"&gt;Calculating Where the Circles Should Be&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Given the area of a circle, calculating the radius is a simple matter of reversing the area formula.  So, we have:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;radius = SQRT(area/PI())&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;So, getting the radii for the two circles is easy.  The questions is:  where should the second circle be place to get the right overlap?&lt;br /&gt;&lt;br /&gt;Unfortunately, there is no easy solution.  First, we have to apply some complicated arithmetic to calculate the overlap between two circles, given a width of the overlap.  Then we have to find the overlap that gives the correct area.&lt;br /&gt;&lt;br /&gt;The first part is solved by finding the area of overlap between two circules, at a site such as  &lt;a href="http://mathworld.wolfram.com/Circle-CircleIntersection.html"&gt;Wolfram Math World&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The second is solved by using the "Goal Seek" functionality under the tools bar.  We simple set up a worksheet that calculates the area of the overlap, given the width of the overlap and the two radii.  One of the cells has the difference between this value and the area that we want.  We then use Goal Seek to set this value to 0.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0);font-family:arial;font-size:130%;"  &gt;&lt;span style="font-weight: bold;"&gt;Annotating the Chart with Text&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;The final step is annotating the chart with text, such as "A Only:  65".  First, we put this string in a cell, using a formula such as:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;="A Only:  "&amp;amp;C4-C6&lt;/li&gt;&lt;/ul&gt;Then, we inlcude this text in the chart by selecting the chart, and typing "=" and followed by the cell address (or using the mouse).&lt;br /&gt;&lt;a id="publishButton" class="cssButton" href="javascript:void(0)" onclick="if (this.className.indexOf(&amp;quot;ubtn-disabled&amp;quot;) == -1) {var e = document['stuffform'].publish;(e.length) ? e[0].click() : e.click(); if (window.event) window.event.cancelBubble = true; return false;}"&gt;&lt;div class="cssButtonOuter"&gt;&lt;div class="cssButtonMiddle"&gt;&lt;div class="cssButtonInner"&gt;Publish Post&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/a&gt;&lt;br /&gt;In the end, we are able to create an accurate Venn diagram with two circles, of any size and overlap.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0);font-family:arial;font-size:130%;"  &gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;a href="http://www.data-miners.com/blog/venn-20080112.xls"&gt;venn-20080112.xls&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-4189322989736959303?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/4189322989736959303/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2008/11/creating-accurate-venn-diagrams-in.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/4189322989736959303'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/4189322989736959303'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2008/11/creating-accurate-venn-diagrams-in.html' title='Creating Accurate Venn Diagrams in Excel, Part 2'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-5227070962548293225</id><published>2008-11-01T15:32:00.003-04:00</published><updated>2008-11-01T15:55:33.276-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Ask a data miner'/><category scheme='http://www.blogger.com/atom/ns#' term='Michael'/><title type='text'>Should model scores be rescaled?</title><content type='html'>&lt;blockquote&gt;&lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt; &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt;Here’s a quick question for your blog;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt; &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt; - background -&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt;I work in a small team of data miners for a telecommunications company.  We usually do ‘typical’ customer churn and mobile (cell-phone) related analysis using call detail records (CDR’s)&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt; &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt;We often use neural nets to create a decimal range score between zero and one (0.0 – 1.0), where zero equals no churn and maximum 1.0 equals highest likelihood of churn.  Another dept then simply sorts an output table in descending order and runs the marketing campaigns using the first 5% (or whatever mailing size they want) of ranked customers.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt; &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt;- problem -&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt;We have differing preferences in the distribution of our prediction score for churn.  Churn occurs infrequently, lets say 2% (it is voluntary churn of good fare paying customers) per month. So 98% of customers have a score of 0.0 and 2% have a score of 1.0.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt; &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt;When I build my predictive model I try to mimic this distribution.  My view that is most of the churn prediction scores would be skewed toward 0.1 or 0.2, say 95% of all predicted customers, and from 0.3 to 1.0 of the churn score would apply to maybe 5% of the customer base.  &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt; &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt;Some of my colleagues re-scale the prediction score so that there are an equal number of customers spread throughout.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt; &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt; - question - &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt;What are your views/preferences on this?&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style=";font-family:Arial;font-size:85%;"  &gt;&lt;span style=";font-family:Arial;font-size:10;"  &gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;I see no reason to rescale the scores.  Of course, if the only use of the scores is to mail the top 5% of the list it makes no difference since the transformation preserves the ordering, but for other applications you want the score to be an estimate of the actual probability of cancellation. &lt;br /&gt;&lt;br /&gt;In general, scores that represent the probability of an event are more useful than scores which only order a list in descending order by probability of the event. For example, in a campaign response model, you can multiply the probability that a particular prospect will respond by the value of that response to get an expected value of making the offer. If the expected value is greater than the cost, the offer should not be made.  Gordon and I discuss this and related issues in our book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0471331236/thedataminers"&gt;Mastering Data Mining&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;This issue often comes up when stratified sampling is used to create a balanced model set of 50% responders and 50% non-responders. For some modeling techniques--notably, decision trees--a balanced model set will produce more and better rules. However, the proportion of responders at each leaf is no longer an estimate of the actual probability of response. The solution is simple: simply apply the model to a test set that has the correct distribution of responders to get correct estimates of the response probability.&lt;br /&gt;&lt;br /&gt;-Michael&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-5227070962548293225?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/5227070962548293225/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2008/11/should-model-scores-be-rescaled.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/5227070962548293225'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/5227070962548293225'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2008/11/should-model-scores-be-rescaled.html' title='Should model scores be rescaled?'/><author><name>Michael J. A. Berry</name><uri>http://www.blogger.com/profile/06077102677195066016</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='14679622169454737233'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-917399637258407920</id><published>2008-10-28T21:08:00.003-04:00</published><updated>2008-10-28T21:40:47.253-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='Ask a data miner'/><category scheme='http://www.blogger.com/atom/ns#' term='SQL'/><title type='text'>Random Samples in SQL</title><content type='html'>&lt;span style="font-style: italic;"&gt;Hi,&lt;/span&gt;&lt;br /&gt; &lt;br /&gt;&lt;span style="font-style: italic;"&gt; How would recommend getting a random sample from a table in SQL?  Thank you!&lt;/span&gt;&lt;br /&gt; &lt;span style="color:#888888;"&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt; Adam&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;This is a good question.  Unfortunately, there is not a good answer, because the concept of a random sample does not really exist in relational algebra (which SQL -- to a greater or lesser extent -- is based on).  There are, however, ways of to arrive at the solution.  This discussion is based partly on the Appendix in &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;&lt;span style="font-style: italic;"&gt;Data Analysis Using SQL and Excel&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The basic idea is assume that there is a function that returns a random number, say uniformly between 0 and 1.  If such a function exists, the SQL code for a random sample might look like:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;SELECT *&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;FROM table t&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;WHERE rand() &lt;&gt;&lt;br /&gt;&lt;br /&gt;The function &lt;span style="font-family: courier new;"&gt;rand()&lt;/span&gt; does actually exist in many databases, such as IBM UDB, Microsoft SQL, and Mysql.&lt;br /&gt;&lt;br /&gt;Does this really work for these databases?  That depends on whether &lt;span style="font-family: courier new;"&gt;rand()&lt;/span&gt; is a deterministic or non-deterministic function.  A deterministic function is essentially evaluated once, when the query is parsed.  If this is the case, then all rows would have the same value, and the query would not return a 10% random sample.  It would return either 0 rows or all of them.&lt;br /&gt;&lt;br /&gt;Fortunately, for these databases, the designers were smart and &lt;span style="font-family: courier new;"&gt;rand()&lt;/span&gt; is non-deterministic, so the above code works as written.&lt;br /&gt;&lt;br /&gt;Oracle has a totally different approach.  It supports the SAMPLE clause.  Using it, the above query would be written as:&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="color:#888888;"&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;&lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;SELECT *&lt;/span&gt;&lt;br /&gt; &lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;FROM table t&lt;/span&gt;&lt;br /&gt; &lt;span style="font-family: courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;SAMPLE (10)&lt;/span&gt;&lt;br /&gt; &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="color:#888888;"&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;Another  approach in Oracle is to use a pseudo-random number generator and &lt;span style="font-family: courier new;"&gt;ROWNUM&lt;/span&gt;.    This approach works in any database that has something similar to &lt;span style="font-family: courier new;"&gt;ROWNUM&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;If you happen to be using SAS proc SQL, then you can do something similar to the first example.  The only difference is that the function is &lt;span style="font-family: courier new;"&gt;RAND('UNIFORM')&lt;/span&gt; rather than just &lt;span style="font-family: courier new;"&gt;RAND()&lt;/span&gt;.&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-917399637258407920?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/917399637258407920/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2008/10/random-samples-in-sql.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/917399637258407920'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/917399637258407920'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2008/10/random-samples-in-sql.html' title='Random Samples in SQL'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-5428242073603297866</id><published>2008-10-24T17:25:00.025-04:00</published><updated>2008-10-25T14:02:29.223-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='Excel'/><title type='text'>Creating Accurate Venn Diagrams in Excel, Part 1</title><content type='html'>This post (and the next) are about creating accurate Venn diagrams using Excel charts.  If you are interested in this, you may be interested in my book &lt;a style="font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Recently, I had occasion to analyze prescriber data for a project at a pharmaceutical company. On of the things we wanted to do was to compare visually the prescribing habits of psychiatrists, by places them into three groups:  those who only prescribe drug A; those who only prescriber drug B; and those who prescribe both.  The resulting chart is:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/blog/uploaded_images/prescriber-venn-716078.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 348px;" src="http://www.data-miners.com/blog/uploaded_images/prescriber-venn-716073.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;This chart is an example of a Venn diagram. Unfortunately, Excel does not have a built-in Venn diagram creator.  And, if you do a google search, you will get many examples, where the circles are placed manually.  Perhaps it is my background in data analysis, but I often prefer accuracy to laziness.  So, I developed a method to create simple but accurate Venn diagrams in Excel.&lt;br /&gt;&lt;br /&gt;Creating such diagrams is, fundamentally, rather simple.  However, there is some math involved.  To simplify the math, this post first describes how to create a Venn diagram where the two shapes are squares.  In the next post, I'll extend the ideas to using circles.&lt;br /&gt;&lt;br /&gt;Creating a Venn diagram requires understanding the following:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Creating shapes in Excel.&lt;/li&gt;&lt;li&gt;Calculating the correct overlap of the shapes.&lt;/li&gt;&lt;li&gt;Putting it all together.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;This post discusses each of these.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 153, 0); font-weight: bold;font-size:180%;" &gt;&lt;span style="font-family:arial;"&gt;Creating a Shape in Excel&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;How does one create a shape using Excel charts.  The simple answer here is using the scatter plot.  If we want to make a square, we can simply plot the four corners of the square and connect them using lines, as in the following example:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/blog/uploaded_images/square-on-grid-726768.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 370px;" src="http://www.data-miners.com/blog/uploaded_images/square-on-grid-726765.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Here the square has an area of 81, so each side is exactly nine units long.  It is created using five data points:&lt;br /&gt;&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 144pt;" border="0" cellpadding="0" cellspacing="0" width="192"&gt;&lt;col style="width: 48pt;" span="3" width="64"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl28" style="width: 48pt; font-weight: bold; text-align: center;" width="64"&gt;X-Value&lt;/td&gt;   &lt;td class="xl29" style="border-left: medium none; width: 48pt; text-align: center; font-weight: bold;" width="64"&gt;Y-Value&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl24" style="border-top: medium none;" num="" align="right"&gt;-4.50&lt;/td&gt;   &lt;td class="xl25" style="border-top: medium none; border-left: medium none;" num="" align="right"&gt;-4.50&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl24" style="border-top: medium none;" num="" align="right"&gt;-4.50&lt;/td&gt;   &lt;td class="xl25" style="border-top: medium none; border-left: medium none;" num="" align="right"&gt;4.50&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl24" style="border-top: medium none;" num="" align="right"&gt;4.50&lt;/td&gt;   &lt;td class="xl25" style="border-top: medium none; border-left: medium none;" num="" align="right"&gt;4.50&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl24" style="border-top: medium none;" num="" align="right"&gt;4.50&lt;/td&gt;   &lt;td class="xl25" style="border-top: medium none; border-left: medium none;" num="" align="right"&gt;-4.50&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.8pt;" height="18"&gt;   &lt;td style="height: 13.8pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl26" style="border-top: medium none;" num="" align="right"&gt;-4.50&lt;/td&gt;   &lt;td class="xl27" style="border-top: medium none; border-left: medium none;" num="" align="right"&gt;-4.50&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;Notice that the first point is repeated twice.  Otherwise, there would be four points, but only three sides.&lt;br /&gt;&lt;br /&gt;A small challenge in doing this is making the chart look like a square instead of a rectangle.  Unfortunately, Excel does not make it easy to adjust the size of a chart, say by right clicking and just entering the width and height.&lt;br /&gt;&lt;br /&gt;One way to make the chart square is to place it in a single cell and then adjust the row height and column height to be equal.  My prefered method is just to eye-ball it.  The above chart has a width of six columns and a height of 21 rows.&lt;br /&gt;&lt;br /&gt;In this case, the square is centered on the origin. There is a reason for this. The temptation is to have the square be positioned at the origin and then pass through the points (0,9), (9,9), and (9,0). However, I find that when Excel draws the square, the axes interfere with the sides of the square, so some are shaded heavier than others. This happens even when I remove the axes.&lt;br /&gt;&lt;br /&gt;As an aside here, you can imagine creating many different types of shapes in Excel besides squares. However, Excel only understands these as lines connecting a scatter plot. In particular, this means that you cannot color the interior of the shape.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-size:180%;" &gt;&lt;span style="font-family:arial;"&gt;Calculating the Overlaps&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;Assume that we have two squares that overlap, one square has an area of 100 (side is 10) and the other 25 (side is 5).  What is the overlap between them?&lt;br /&gt;&lt;br /&gt;There is not enough information to answer this question.  It is clearly between 0 (if the squares do not overlap) and 25 (the size of the smaller square).  If the overlap is 10, how big is the overlap?  In the following picture, the area of C is 10.&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.data-miners.com/blog/uploaded_images/squares-and-regions-730921.jpg"&gt;&lt;img style="cursor: pointer; width: 400px; height: 276px;" src="http://www.data-miners.com/blog/uploaded_images/squares-and-regions-730876.jpg" alt="" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;What are the dimensions of C?  The height is the height of the smaller square -- 5.  So the width must be 2 (=10/5).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; color: rgb(0, 153, 0);font-family:arial;font-size:180%;"  &gt;Putting It Together&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To put this together for a Venn diagram using squares, we simply need to position two squares given the following information:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The sizes of the two squares.&lt;/li&gt;&lt;li&gt;The overlap between them.&lt;/li&gt;&lt;/ul&gt;Consider the original diagram at the top of this posting.  The sizes of the two regions are 13,941 and 11,175 respectively.  The overlap is 9,783.&lt;br /&gt;&lt;br /&gt;The first thing to calculate is the side length for the two squares:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;118.07 for the first square (=sqrt(13,941)).&lt;/li&gt;&lt;li&gt;105.71 for the second (=sqrt(11,175)).&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Then, we need to calculate the width of the overlapping region (we already know its height and area):&lt;br /&gt;&lt;ul&gt;&lt;li&gt;92.54 = 9,783 / 105.71&lt;/li&gt;&lt;/ul&gt;Now we need to calculate the points for the two squares.  The way that I do the calculation is to place square at the origin, and then to add X- and Y- offsets to shift it around the plane.  So, the general formula for the points are:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;(0 + X-offset, 0 + Y-offset)&lt;/li&gt;&lt;li&gt;(side + X-offset, 0 + Y-offset)&lt;/li&gt;&lt;li&gt;(side + X-offset, side + Y-offset)&lt;/li&gt;&lt;li&gt;(0 + X-offset, side + Y-offset)&lt;/li&gt;&lt;li&gt;(0 + X-offset, 0 + Y-offset)&lt;/li&gt;&lt;/ul&gt;Since we know the side lengths of the two squares, I only need to calculate the offset values.  The first square is centered at the origin (rather than starting there), so the offset is - side/2 for both X and Y.&lt;br /&gt;&lt;br /&gt;The second square is centered vertically, so its Y-offset is also - side/2.  The X-offset is the bigger challenge.  In order to get the correct overlap, it is:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;(side-first - X-offset-first) - overlap-width&lt;/li&gt;&lt;/ul&gt;The attached spreadsheet has these calculations.  The data table on the spreadsheet looks like:&lt;br /&gt;&lt;br /&gt;&lt;table str="" style="border-collapse: collapse; width: 288pt;" border="0" cellpadding="0" cellspacing="0" width="384"&gt;&lt;col style="width: 48pt;" span="6" width="64"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 48pt;" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 48pt; text-align: center;" width="64"&gt;Area&lt;/td&gt;   &lt;td style="width: 48pt; text-align: center;" width="64"&gt;Side&lt;/td&gt;   &lt;td style="width: 48pt; text-align: center;" width="64"&gt;X Offset&lt;/td&gt;   &lt;td style="width: 48pt; text-align: center;" width="64"&gt;Y Offset&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;Left&lt;/td&gt;   &lt;td class="xl25" num="13941" fmla="=C4+4158" align="right"&gt;13,941&lt;/td&gt;   &lt;td class="xl24" num="118.07201192492656" fmla="=SQRT(C2)" align="right"&gt;118.07&lt;/td&gt;   &lt;td class="xl24" num="-59.036005962463278" fmla="=-D2/2" align="right"&gt;-59.04&lt;/td&gt;   &lt;td class="xl24" num="-59.036005962463278" fmla="=-D2/2" align="right"&gt;-59.04&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;Right&lt;/td&gt;   &lt;td class="xl25" num="11175" fmla="=C4+1392" align="right"&gt;11,175&lt;/td&gt;   &lt;td class="xl24" num="105.71187255932988" fmla="=SQRT(C3)" align="right"&gt;105.71&lt;/td&gt;   &lt;td class="xl24" num="-33.507998444509795" fmla="=E2+D2-D4" align="right"&gt;-33.51&lt;/td&gt;   &lt;td class="xl24" num="52.855936279664938" fmla="=D3/2" align="right"&gt;52.86&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;Overlap&lt;/td&gt;   &lt;td class="xl25" num="9783" align="right"&gt;9,783&lt;/td&gt;   &lt;td class="xl24" num="92.544004406973073" fmla="=C4/MIN(D2:D3)" align="right"&gt;92.54&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td colspan="2" style=""&gt;big square&lt;/td&gt;   &lt;td colspan="2" style=""&gt;little square&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl24" num="-59.036005962463278" fmla="=E2" align="right"&gt;-59.04&lt;/td&gt;   &lt;td class="xl24" num="-59.036005962463278" fmla="=F2" align="right"&gt;-59.04&lt;/td&gt;   &lt;td class="xl24" num="72.203874114820081" fmla="=D3+E3" align="right"&gt;72.20&lt;/td&gt;   &lt;td class="xl24" num="52.855936279664938" fmla="=F3" align="right"&gt;52.86&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl24" num="-59.036005962463278" fmla="=E2" align="right"&gt;-59.04&lt;/td&gt;   &lt;td class="xl24" num="59.036005962463278" fmla="=F2+D2" align="right"&gt;59.04&lt;/td&gt;   &lt;td class="xl24" num="-33.507998444509795" fmla="=E3" align="right"&gt;-33.51&lt;/td&gt;   &lt;td class="xl24" num="52.855936279664938" fmla="=F3" align="right"&gt;52.86&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl24" num="59.036005962463278" fmla="=E2+D2" align="right"&gt;59.04&lt;/td&gt;   &lt;td class="xl24" num="59.036005962463278" fmla="=F2+D2" align="right"&gt;59.04&lt;/td&gt;   &lt;td class="xl24" num="-33.507998444509795" fmla="=E3" align="right"&gt;-33.51&lt;/td&gt;   &lt;td class="xl24" num="-52.855936279664938" fmla="=-F3" align="right"&gt;-52.86&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl24" num="59.036005962463278" fmla="=E2+D2" align="right"&gt;59.04&lt;/td&gt;   &lt;td class="xl24" num="-59.036005962463278" fmla="=E2" align="right"&gt;-59.04&lt;/td&gt;   &lt;td class="xl24" num="72.203874114820081" fmla="=D3+E3" align="right"&gt;72.20&lt;/td&gt;   &lt;td class="xl24" num="-52.855936279664938" fmla="=-F3" align="right"&gt;-52.86&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td class="xl24" num="-59.036005962463278" fmla="=E2" align="right"&gt;-59.04&lt;/td&gt;   &lt;td class="xl24" num="-59.036005962463278" fmla="=E2" align="right"&gt;-59.04&lt;/td&gt;   &lt;td class="xl24" num="72.203874114820081" fmla="=D3+E3" align="right"&gt;72.20&lt;/td&gt;   &lt;td class="xl24" num="52.855936279664938" fmla="=F3" align="right"&gt;52.86&lt;/td&gt;   &lt;td&gt;&lt;br /&gt;&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;The points are listed under "big square" and "little square".  The first column is the X value for the big square, the second is the Y value; the third is the X value for the little square and the fourth is the Y value.&lt;br /&gt;&lt;br /&gt;After creating the chart, you need to beautify it.  I remove the axes and axis' labels, thicken the lines around the squares, and adjust the height and width to make the shape look like a square.&lt;br /&gt;&lt;br /&gt;The attached .xls file (&lt;a href="http://www.data-miners.com/blog/venn-20081025.xls"&gt;venn-20081025.xls&lt;/a&gt;) contains all the examples in this post.&lt;br /&gt;&lt;br /&gt;The next post extends these ideas to creating Venn diagrams with circles, which are the more typical shape for them.  It also shows one way to put some color in the shapes to highlight the different regions.&lt;a href="http://www.data-miners.com/blog/venn-20081025.xls"&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-5428242073603297866?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/5428242073603297866/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2008/10/creating-accurate-venn-diagrams-in.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/5428242073603297866'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/5428242073603297866'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2008/10/creating-accurate-venn-diagrams-in.html' title='Creating Accurate Venn Diagrams in Excel, Part 1'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-3821828639416100880</id><published>2008-10-19T20:53:00.012-04:00</published><updated>2008-10-20T09:03:37.766-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><title type='text'>Rolling and Unrolling Correlated Subqueries in SQL</title><content type='html'>The subject of correlated subqueries arose recently in a data mining class I was teaching. A student inquired about improving the performance of a particular query, which happened to have a correlated subquery. This posting discusses unrolling correlated subqueries to improve performance as well as the rarer need to use correlated subqueries to increase performance.&lt;br /&gt;&lt;br /&gt;Correlated subqueries are SQL queries that contain a nested subquery, where the nested query refers to one or more outside tables. The definition sounds complicated, but an example is worth a thousand words.&lt;br /&gt;&lt;br /&gt;My book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;&lt;em&gt;Data Analysis Using SQL and Excel&lt;/em&gt;&lt;/a&gt; includes a database of customers, orders, and transactions (which can be downloaded). From such data, we might ask a question such as "What products did customer X order on her or his earliest order date?" A typical way to answer this is with a corrrelated subquery.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;SELECT ol.ProductID&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM orders o JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:Courier New;"&gt;&lt;span style="color:#ffffff;"&gt;.....&lt;/span&gt;orderline ol&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:Courier New;"&gt;&lt;span style="color:#ffffff;"&gt;.....&lt;/span&gt;ON o.OrderID = ol.OrderID AND&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:Courier New;"&gt;&lt;span style="color:#ffffff;"&gt;.....&lt;/span&gt;o.CustomerID = X&lt;x&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;WHERE o.OrderDate = (SELECT MIN(OrderDate)&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;.....................&lt;/span&gt;FROM orders o2&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;.....................&lt;/span&gt;WHERE o2.CustomerID = o.CustomerID)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Since this is standard SQL, all reasonable relational databases should support this syntax. One syntax note: the subquery could optionally contain a "&lt;span style="font-family:courier new;"&gt;GROUP BY o2.CustomerID&lt;/span&gt;" clause.&lt;br /&gt;&lt;br /&gt;What is the query doing? It is joining two tables together (orders and orderline) and then restricting the results to a single customer. However, the query is about the products in a particular order, so the &lt;span style="font-family:courier new;"&gt;WHERE&lt;/span&gt; clause selects the particular order -- as the one with the smallest &lt;span style="font-family:courier new;"&gt;OrderDate&lt;/span&gt;. Voila. The query answers the question.&lt;br /&gt;&lt;br /&gt;The correlated subquery is in the &lt;span style="font-family:courier new;"&gt;WHERE&lt;/span&gt; clause, buried in the subquery in the line &lt;span style="font-family:courier new;"&gt;o.OrderID = o2.OrderID&lt;/span&gt;. This is placing a restriction on the values in the subquery based on the results of an outer query. Do note that if the &lt;span style="font-family:courier new;"&gt;WHERE &lt;/span&gt;clause were instead &lt;span style="font-family:courier new;"&gt;o.CustomerID = &lt;x&gt;&lt;/span&gt;, then the subquery would not be correlated, since there would be no connection to the outer tables.&lt;br /&gt;&lt;br /&gt;So far so good. When we think of how the query runs, we think of iterating through every row in the &lt;span style="font-family:courier new;"&gt;o2&lt;/span&gt; table and looking to match it to the current value in the &lt;span style="font-family:courier new;"&gt;o&lt;/span&gt; table. If there is an index, so much the better because the query engine can use the index to access the &lt;span style="font-family:courier new;"&gt;o2&lt;/span&gt; table.&lt;br /&gt;&lt;br /&gt;This conceptual approach is, in fact, how most (if not all) query engines optimize such a query. For now, I'm leaving open the question of whether this is a good thing, in order to present the idea of unrolling the subquery.&lt;br /&gt;&lt;br /&gt;There are other ways to answer the original question ("What products did Customer X order on his or her earliest order date?"). The following query shows an alternative approach:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;SELECT ProductID&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;FROM orders o JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color:#ffffff;"&gt;.....&lt;/span&gt;orderlines ol&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color:#ffffff;"&gt;.....&lt;/span&gt;ON o.OrderID = ol.OrderID JOIN&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color:#ffffff;"&gt;.....&lt;/span&gt;(SELECT CustomerID, MIN(OrderDate) as minOrderDate&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color:#ffffff;"&gt;......&lt;/span&gt;FROM orders&lt;br /&gt;&lt;span style="color:#ffffff;"&gt;......&lt;/span&gt;GROUP BY CustomerID) omin&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color:#ffffff;"&gt;.....&lt;/span&gt;ON o.OrderDate = omin.minOrderDate AND&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color:#ffffff;"&gt;........&lt;/span&gt;o.CustomerID = omin.CustomerID&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;WHERE o.CustomerID = X&lt;x&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This version of the query unrolls the subquery, by creating a summary table with the earliest order date for all customers. The link to the other table is made through an explicit join condition between this summary table and the orders table.&lt;br /&gt;&lt;br /&gt;Note that in this particular query, the &lt;span style="font-family:courier new;"&gt;WHERE&lt;/span&gt; clause that chooses the customer could be in the subquery, because the columns in the &lt;span style="font-family:courier new;"&gt;WHERE&lt;/span&gt; clause are in the subquery. However, in the general case, the filter could be using columns not available in the subquery -- such as getting all products that start with the letter "A".&lt;br /&gt;&lt;br /&gt;There is a big difference in how this query gets executed versus the earlier version. The big difference is that now the orders need to be grouped to find the earliest order date &lt;em&gt;for all orders&lt;/em&gt;. The correlated subquery could use an index and only look at the handful of rows for a given customer. So, the correlated subquery seems to be more efficient.&lt;br /&gt;&lt;br /&gt;If the correlated subquery is more efficient, then why do I personally avoid using them? One reason is the explicitness of the joins. I find it much easier to understand the unrolled version. However, ease of understanding is less important than performance. In many cases, the unrolled version does execute faster.&lt;br /&gt;&lt;br /&gt;Notice that both these queries are looking for data about one particular customer -- a small subset of the overall data. For queries that are looking for such needles in the haystack, then correlated subqueries are fine.&lt;br /&gt;&lt;br /&gt;However, decision support queries are usually looking to sift through the whole haystack and not look for just the needle. If we changed the question to "What products are ordered on the earliest order date?" then the queries lose the restrictive clause limiting them to one customer. Now what happens?&lt;br /&gt;&lt;br /&gt;In the case of the correlated subquery, query engines essentially execute the joins in one of two ways: (1) by repeatedly looping through one table (typically the one in the inner join) or (2) using indexes. In terms of join algorithms, these are nested loop joins and index-based joins -- two perfectly good join algorithms. But, I might add, two out of many algorithms that could be used.&lt;br /&gt;&lt;br /&gt;On the other hand, doing the explicit join as in the second example allows the query engine to execute the different steps it needs to execute, and then to decide on the best strategies. In particular, when the data is partitioned for simultaneous access on multiple processors, most query engines would forget the parallel possibilities and simply execute the correlated subquery on a single processor.&lt;br /&gt;&lt;br /&gt;On the other hand, most parallel query engines would correctly parallelize the second version of the query. The &lt;span style="font-family:courier new;"&gt;GROUP BY&lt;/span&gt; would execute in parallel, as would the rest of the joins. The query optimizer would use table statistics to generate the best query plan.&lt;br /&gt;&lt;br /&gt;Correlated subqueries are a tool used when designing queries. In all cases, though, the subqueries can be unrolled using more traditional aggregation and join operations. However, query optimizers generally do not perform this operation.&lt;br /&gt;&lt;br /&gt;Correlated subqueries are often the most efficient approach when looking for a few rows from a table, particularly when the optimizer can use indexes for the join. On the other hand, unrolling the subqueries is often more efficient when there is a large amount of data, because the optimizer can do full query optimization, making use of parallelism and table statistics.&lt;br /&gt;&lt;br /&gt;Currently, most query optimizers do not know how to unrolls correlated subqueries -- or how to roll them back up. So, we need to make such decisions when writing the queries ourselves.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-3821828639416100880?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/3821828639416100880/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2008/10/rolling-and-unrolling-correlated.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/3821828639416100880'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/3821828639416100880'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2008/10/rolling-and-unrolling-correlated.html' title='Rolling and Unrolling Correlated Subqueries in SQL'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-5856355693605138748</id><published>2008-10-02T17:01:00.003-04:00</published><updated>2008-10-02T17:28:54.976-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Ask a data miner'/><title type='text'>Decision Trees and Clustering</title><content type='html'>&lt;blockquote&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 102, 255);"&gt;Hi,&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 102, 255);"&gt;I started to write my master thesis and i chose a data mining topic.What I have to do is to analyze the bookings of an airline company and to observe  for which markets,time periods and clients the bookings can be trusted and for which not.(The bookings can anytime be canceled or modified ).&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 102, 255);"&gt;I decided to use the decision trees as a classification method but I somehow wonder if clustering would have been more appropriate in this situation.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 102, 255);"&gt;Thanks and best regards,&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 102, 255);" &gt;Iuliana&lt;/span&gt;&lt;/blockquote&gt;&lt;span style="color: rgb(51, 102, 255);" &gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;When choosing between decision trees and clustering, remember that decision trees are themselves a clustering method. The leaves of a decision tree contain clusters of records that are similar to one another and dissimilar from records in other leaves. The difference between the clusters found with a decision tree and the clusters found using other methods such as K-means, agglomerative algorithms, or self-organizing maps is that decision trees are &lt;span style="font-style: italic;"&gt;directed&lt;/span&gt; while the other techniques I mentioned are &lt;span style="font-style: italic;"&gt;undirected&lt;/span&gt;. Decision trees are appropriate when there is a target variable for which all records in a cluster should have a similar value. Records in a cluster will also be similar in other ways since they are all described by the same set of rules, but the target variable drives the process. People often use undirected clustering techniques when a directed technique would be more appropriate. In your case, I think you made the correct choice because you can easily come up with a target variable such as the percentage cancelations, alterations and no-shows in a market.&lt;br /&gt;&lt;br /&gt;You can make a model set that has one row per market. One column, the target, will be the percentage of reservations that get changed or cancelled. The other columns will contain everything you know about the market--number of flights, number of connections, ratio of business to leasure travelers, number of carriers, ratio of transit passengers to origin or destination passengers, percentage of same day bookings, same week bookings, same month bookings, and whatever else comes to mind.  A decision tree will produce some leaves with trustworthy bookings and some with untrustworthy bookings and the paths from the root to these leaves will be descriptions of the clusters.&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-5856355693605138748?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/5856355693605138748/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2008/10/decision-trees-and-clustering.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/5856355693605138748'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/5856355693605138748'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2008/10/decision-trees-and-clustering.html' title='Decision Trees and Clustering'/><author><name>Michael J. A. Berry</name><uri>http://www.blogger.com/profile/06077102677195066016</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='14679622169454737233'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-3057850507112192444</id><published>2008-09-30T15:25:00.003-04:00</published><updated>2008-09-30T16:25:01.983-04:00</updated><title type='text'>A question about decision trees</title><content type='html'>&lt;div&gt;&lt;/div&gt;&lt;blockquote style="color: rgb(51, 51, 255);"&gt;&lt;div&gt;Hi,&lt;/div&gt; &lt;div&gt; &lt;/div&gt; &lt;div&gt;In your experience with decision trees, do you prefer to use a small set of core variables in order to make the model more elegant and/or understandable?  At what point do you feel a tree has grown too large and complicated?  What are the indicators that typically tell you that you need to do some pruning?&lt;/div&gt; &lt;div&gt; &lt;/div&gt; &lt;div&gt;Thank you!&lt;/div&gt; &lt;div&gt; &lt;/div&gt; &lt;div&gt;-Adam&lt;/div&gt;&lt;/blockquote&gt;&lt;div&gt;&lt;br /&gt;&lt;br /&gt;Elegance and ease of understanding may or may not be important depending on your model's intended purpose. There are certainly times when it is important to come up with a small set of simple rules. In our book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0471331236/thedataminers"&gt;Mastering Data Mining&lt;/a&gt; we give an example of a decision tree model used to produce rules that were printed on a poster next to a printing press so the press operators could avoid a particular printing defect. When a decision tree is used for customer segmentation, it is unlikely that your marketing department is equipped to handle more than a handful of segments and the segments should be described in terms of a few famiar variables. In both of these cases, the decision tree is meant to be descriptive.&lt;br /&gt;&lt;br /&gt;On the other hand, many (I would guess most) decision trees are not intended as descriptions; they are intended to produce scores of some kind. If the point of the model is to give each prospect a probability of response, then I see no reason to be concerned about having hundreds or even thousands of leaves so long as each one receives sufficient training records that the proportion of responders at the leaf is a statistically confident estimate of the response probability. A very nice feature of decision tree models is that one need not grok the entire tree in order to interpret any particular rule it generates. Even in a very complex tree, the path from the root to a particular leaf of interest gives a fairly simple description of records contained in that leaf. &lt;br /&gt;&lt;br /&gt;For trees used to estimate some continuous quantity, an abundance of leaves is very desirable. As estimators, regression trees have the desirable quality of never making truly unreasonable estimates (as a linear regression, for example, might do) because every estimate is an average of a large number of actual observed values. The downside is that it cannot produce any more distinct values than it has leaves. So, the more leaves the better.&lt;br /&gt;&lt;br /&gt;The need for pruning usually arises when leaves are allowed to become too small. This leads to splits that are not statistically significant.  Apply each split rule to your training set and a validation set drawn from the same population. You should see the same distribution of target classes in both training and validation data. If you do not, your model has overfit the training data. Many software tools have absurdly low default minimum leaf sizes--probably because they were developed on toy datasets such as the ubiquitous irises. I routinely set the minimum leaf size to something like 500 so overfitting is not an issue and pruning is unnecesary.&lt;br /&gt;&lt;br /&gt;I have focused on the number of leaves rather than the number of variables since I think that is a better measure of tree complexity. You actually asked about the number of variables. I recommend a two-stage approach. In the first stage,  do not worry about how many variables there are or which variable from each family of related variables gets picked by the model. One of the great uses for a decision tree is to pick a small subset of useful variables out of hundreds or thousands of candidates. At a later stage, look at the variables that were picked and think about what concept each of them is getting at. Then pick a set of variables that express those concepts neatly and perhaps even elegantly. You might find, for example, that the customer ID is a good predictor and appears in many rules because customer IDs were assigned serially and long-time customers behave differently than recent customers. Even though this makes perfect sense, it would be hard to explain so you would replace it with a more transparent indication of customer tenure such as "months since first purchase."&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-3057850507112192444?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/3057850507112192444/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2008/09/question-about-decision-trees.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/3057850507112192444'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/3057850507112192444'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2008/09/question-about-decision-trees.html' title='A question about decision trees'/><author><name>Michael J. A. Berry</name><uri>http://www.blogger.com/profile/06077102677195066016</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='14679622169454737233'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-7609930347080658499</id><published>2008-09-29T13:01:00.003-04:00</published><updated>2008-09-30T11:26:45.794-04:00</updated><title type='text'>Three Questions</title><content type='html'>&lt;p style="color: rgb(51, 51, 255);"&gt;&lt;span style="font-size:10;"&gt;&lt;/span&gt;&lt;/p&gt;&lt;blockquote&gt;&lt;p style="color: rgb(51, 51, 255);"&gt;&lt;span style="font-size:10;"&gt;Hi Gordon &amp;amp; Michael,&lt;/span&gt;&lt;/p&gt;    &lt;p style="color: rgb(51, 51, 255);"&gt;&lt;span style="font-size:10;"&gt; I have a few questions, hope you can help me!&lt;/span&gt;&lt;/p&gt;    &lt;p style="text-indent: -0.25in; color: rgb(51, 51, 255);"&gt;&lt;span style="font-size:10;"&gt;&lt;span&gt;1.&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size:10;"&gt;While modeling, if we don’t have a very specific client requirement, at what accuracy should we usually stop? Should we stop at 75%, or 80%? Are there standard accuracy requirements based on the industry? &lt;span&gt; &lt;/span&gt;For example, in drug research &amp;amp; development, model accuracy is required to be very high.&lt;span style="color: rgb(0, 0, 0);"&gt;&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p style="text-indent: -0.25in; color: rgb(51, 51, 255);"&gt;&lt;span style="font-size:10;"&gt;&lt;span&gt;2.&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size:10;"&gt;What is the best approach for selecting records/training dataset when the client doesn’t have info on the cut-off/valid ranges for certain numeric columns? If it’s something like Age, there is no problem. But when it’s client/business specific columns, it’s not that easy to figure out the valid ranges. What I usually do for such problems is – 1. &lt;span&gt;do&lt;/span&gt; some research on the web to have an understanding on all the values that the specific column can take 2. &lt;span&gt;see&lt;/span&gt; the data distribution of that column and select values based on the percentiles. &lt;span&gt;E.g&lt;/span&gt; if values from 10 to 60 (for that column) represent 80% of all the records, I exclude all records having values outside this range. Is this a good approach? Are there other alternatives?&lt;/span&gt;&lt;/p&gt;  &lt;p style="text-indent: -0.25in; color: rgb(51, 51, 255);"&gt;&lt;span style="font-size:10;"&gt;&lt;span&gt;3.&lt;span&gt;    &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size:10;"&gt;Generally, I see model accuracy (predictive/risk/churn models) getting better when I recode/transform continuous variables into categorical variables through binning/grouping. But this also results in loss of information. How do we strike a balance here? I believe the business/domain should only decide whether I should use continuous or categorical values, and not the statistics. Is that correct?&lt;/span&gt;&lt;/p&gt;  &lt;p style="color: rgb(51, 51, 255);"&gt;&lt;span style="font-size:10;"&gt; &lt;/span&gt;&lt;/p&gt;  &lt;p style="color: rgb(51, 51, 255);"&gt;&lt;span style="font-size:10;"&gt;Will check your blog regularly for the answers&lt;/span&gt;&lt;span style=";font-family:Wingdings;font-size:10;"  &gt;&lt;span&gt;J&lt;/span&gt;&lt;/span&gt;&lt;span style="font-size:10;"&gt;&lt;/span&gt;&lt;/p&gt;  &lt;p style="color: rgb(51, 51, 255);"&gt;&lt;span style="font-size:10;"&gt; &lt;/span&gt;&lt;/p&gt;  &lt;p style="color: rgb(51, 51, 255);"&gt;&lt;span style="font-size:10;"&gt;Thanks,&lt;/span&gt;&lt;/p&gt;  &lt;p style="color: rgb(51, 51, 255);"&gt;&lt;b&gt;&lt;span style="font-size:10;"&gt;Romakanta Irungbam&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;span&gt;These three questions have something in common: There is no single right answer since so much depends on the business context (in the first two cases) or the modeling context (in the third case).&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(153, 0, 0);"&gt;First Question&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;My statement about no right answers is especially true of the question regarding accuracy. There are contexts where a 95% error rate is perfectly acceptable. I am thinking of response modeling for direct mail. If a model is used to choose people likely to respond to an offer and only 5% of those chosen actually respond, then the error rate is 95%. How could that be acceptable? Well, if a 4% response rate is required for profitability and the response rate for a randomly selected control group is 3% then the model--despite its apparently terrible error rate--has heroically turned a money-losing campaign into a profitable one. Success is measured in dollars (or rupees or yen, but you know what I mean) not by error rates.&lt;br /&gt;&lt;br /&gt;In other contexts, much better accuracy is required. A model for credit-card fraud cannot afford a high false-positive rate because this will result in legitimate transactions not being approved. The result is unhappy card holders canceling their accounts. Even if your client cannot provide an explicit requirement for accuracy, you may be able to derive one from the business context.&lt;br /&gt;&lt;br /&gt;Absent any other constraints, I tend to stop trying to improve a model when I reach the point of diminishing returns. When a large effort on my part yields only a minor improvement, my time will probably be better spent on some other problem.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(153, 0, 0);"&gt;Second Question&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This question is really about when to throw out data. I see know reason to discard data just because it happens to be in the tails of the distribution. To use your example where 80% of the records have values between 10 and 60, it may be that all the best customers have a value of 75 or more. It may make sense to throw out records which contain clearly impossible values, but even in that case, I would want to understand how the impossible values were generated. If all the records with impossibly high ages were generated in the same geographic region or from the same distribution channel, throwing them out will bias your sample.&lt;br /&gt;&lt;br /&gt;Often, unusual values have some fairly simple explanation. When looking at loyalty card data for a supermarket, we found that there were a few cards that had seemingly impossibly large numbers of orders. The explanation was that when people checked out without their card and were therefore in danger of missing out on a discount, the nice checkout lady took pity on them and used her own card to get them the discount. Understanding that mechanism meant we could safely ignore data for those cards since they did not represent the actual shopping habits of any real customer.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(153, 0, 0);"&gt;Third Question&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Whether or not binning continuous variables is helpful or harmful will depend very much on the particular modeling algorithm you are using and on how the binning is performed. I do not agree that, as a general rule, models are improved by binning continuous variables. As you note, this process destroys information. As an extreme example, suppose you have a relationship that is completely determined by a continuous (or discrete, but with small increments) relationship--a tax of a constant amount per liter, say. The more accurately you can measure the number of liters sold, the more accurately you can estimate the tax revenue. In such a case, binning could only be harmful.&lt;br /&gt;&lt;br /&gt;When binning tends to be helpful is when the relationship between the explanatory variable and the thing you are trying to explain is more complex than the particular modeling technique you have chosen can handle. For example, you have chosen a linear model and the relationship is non-linear. I once modeled household penetration for my local newspaper, the Boston Globe. One of my explanatory variables was distance from Boston. Clearly, this should have some effect, but there is only a low level of linear correlation. This is because penetration goes up as a function of distance as you travel out to the first ring of suburbs where penetration is highest, but then goes down again as you continue to travel farther from Boston. So a linear model could not make good use of the untransformed variable, but it could make use of three variables in the form within_three, three_to_ten, and beyond_ten (assuming that 3 and 10 are the right bin boundaries).  Of course, binning is not the only transformation that could help and linear models are not the only choice of model.&lt;/span&gt;&lt;b&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-7609930347080658499?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/7609930347080658499/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2008/09/three-questions.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/7609930347080658499'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/7609930347080658499'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2008/09/three-questions.html' title='Three Questions'/><author><name>Michael J. A. Berry</name><uri>http://www.blogger.com/profile/06077102677195066016</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='14679622169454737233'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-4328254584684768008</id><published>2008-09-05T11:59:00.018-04:00</published><updated>2008-09-05T17:04:41.792-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='Excel'/><title type='text'>Sorting Cells in Excel Using Formulas, Part 2</title><content type='html'>In a &lt;a href="http://www.data-miners.com/blog/2008/08/sorting-cells-in-excel-using-formulas.html"&gt;previous post&lt;/a&gt;, I described how to create a new table in Excel from an existing table where the cells in the new table are sorted by some column in the existing table.  In addition, the new table is automatically updated when the values in the original table are modified.&lt;br /&gt;&lt;br /&gt;The previously described approach, alas, has some shortcomings:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Only one column can be used for the sort key.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The column must be numeric.&lt;/li&gt;&lt;li&gt;The column cannot have any duplicate values.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;This post generalizes on the earlier method by fixing these problems.&lt;br /&gt;&lt;br /&gt;If you are interested in this post, you may be interested in my book &lt;a style="font-weight: bold; font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic; font-weight: bold;font-size:130%;" &gt;&lt;span style="color: rgb(0, 102, 0);font-family:arial;" &gt;Overview of Simpler Method&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The simpler method described in the earlier post recognizes that creating a live sorted table connect to another table consists of the following steps:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Ranking the rows in the table by the column to be sorted.&lt;/li&gt;&lt;li&gt;Using the rank with the &lt;span style="font-family:courier new;"&gt;OFFSET()&lt;/span&gt; function to create the resulting table.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;For Step (1), the method uses the built-in &lt;span style="font-family:courier new;"&gt;RANK()&lt;/span&gt; function provided by Excel.  This introduces the limitations described above, because &lt;span style="font-family:courier new;"&gt;RANK()&lt;/span&gt; only works on numeric values and produces the same value for duplicates.&lt;br /&gt;&lt;br /&gt;The key to fixing these problems is to replace the &lt;span style="font-family:courier new;"&gt;RANK()&lt;/span&gt; function with more general purpose functions.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt; &lt;span style="font-style: italic; color: rgb(0, 102, 0); font-weight: bold;font-family:arial;" &gt;Instead of RANK()&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;RANK()&lt;/span&gt; determines whether a value is the largest, second largest, third largest, or so on with respect to a list (or smallest, if we are going in the opposite order, which is determined by an optional third argument).  One way to think of what it does is that it sorts the values in the list and determines the position of the original value.&lt;br /&gt;&lt;br /&gt;An alternative but equivalent way of thinking about the calculation is that it tells us how many values are larger than (or smaller than) the given value.   This alternative definition suggests other ways of arriving at the same rankings, such as:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;=COUNTIF(data!B$2:B$55, "&gt;="&amp;amp;data!B2)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This formula can be placed alongside the original table (or anywhere else) and then copied down.  It works by counting the number of values that are less than or equal to each value.  The resulting ranking is from smallest value to largest value.  To reverse the order, simply use "&lt;=" instead.  This solves one of the original problems, because the &lt;span style="font-family:courier new;"&gt;COUNTIF() &lt;/span&gt;function works with string data as well as numeric data.&lt;br /&gt;&lt;br /&gt;An almost equivalent formulation is to use array functions.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;{=SUM(if(data!B$2:B$55&gt;=data!B2, 1, 0)}&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;(If you are not familiar with array functions, check out Excel documentation or &lt;a style="font-weight: bold; font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;.)&lt;br /&gt;&lt;br /&gt;This is very similar to the &lt;span style="font-family:courier new;"&gt;COUNTIF()&lt;/span&gt; method, although the array functions have one advantage.  The conditional logic can be more complicated, so we can do the ranking by multiple columns at the same time.&lt;br /&gt;&lt;br /&gt;Using our own version of the rank function fixes two of the three problems.  At this point, duplicates still get the same rank value.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 102, 0);font-size:130%;" &gt;&lt;span style="font-weight: bold; font-style: italic;font-family:arial;" &gt;Handling Duplicates&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The problem with duplicate values is that all these methods assign the same ranking when two different rows have the same value.  This makes it impossible to distinguish between the two rows, so one will be included in the sorted table multiple times.&lt;br /&gt;&lt;br /&gt;The solution is to fix this problem by adding an offset.  If the highest value is repeated multiple times, then all of those rows will have a ranking equal to the number of duplicates.  In the following little table, the second column contains the rankings as calculated by either of the above two methods (&lt;span style="font-family:courier new;"&gt;RANK()&lt;/span&gt; does not work because the first column is not numeric):&lt;br /&gt;&lt;br /&gt; &lt;table str="" style="border-collapse: collapse; width: 102pt;" border="0" cellpadding="0" cellspacing="0" width="136"&gt;&lt;col style="width: 48pt;" width="64"&gt;  &lt;col style="width: 27pt;" span="2" width="36"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 27pt;" width="36"&gt;a&lt;/td&gt;   &lt;td style="width: 27pt;" num="" align="right" width="36"&gt;3&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;a&lt;/td&gt;   &lt;td num="" align="right"&gt;3&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;a&lt;/td&gt;   &lt;td num="" align="right"&gt;3&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;b&lt;/td&gt;   &lt;td num="" align="right"&gt;5&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;b&lt;/td&gt;   &lt;td num="" align="right"&gt;5&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;What we want, though, is to have distinct values in the second column:&lt;br /&gt;&lt;br /&gt; &lt;table str="" style="border-collapse: collapse; width: 102pt;" border="0" cellpadding="0" cellspacing="0" width="136"&gt;&lt;col style="width: 48pt;" width="64"&gt;  &lt;col style="width: 27pt;" span="2" width="36"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 27pt;" width="36"&gt;a&lt;/td&gt;   &lt;td style="width: 27pt;" num="" align="right" width="36"&gt;1&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;a&lt;/td&gt;   &lt;td num="" align="right"&gt;2&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;a&lt;/td&gt;   &lt;td num="" align="right"&gt;3&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;b&lt;/td&gt;   &lt;td num="" align="right"&gt;4&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;b&lt;/td&gt;   &lt;td num="" align="right"&gt;5&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;The solution is to subtract a value from the calculated ranking.  This is the number of values that we have already seen that are equal to the value in question.  Once again, this can be accomplished with either &lt;span style="font-family:courier new;"&gt;COUNTIF()&lt;/span&gt; or array functions:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;=COUNTIF(data!B$2:B$55, "&gt;="&amp;amp;data!B2) + COUNTIF(data!B$2:B2, "="&amp;amp;data!B2)-1&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;or&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="color: rgb(255, 255, 255);"&gt;....&lt;/span&gt;{=SUM(IF(data!B$2:B$55&gt;=data!B2, 1, 0)) + SUM(IF(data!B$2:B2=data!B2, 1, 0))-1}&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;These formulations consist of two parts.  The first part calculates a ranking, where duplicates get the same value.  The second part subtracts the number of duplicates already seen in the list.  For the simple example above, the results are actually:&lt;br /&gt;&lt;br /&gt; &lt;table str="" style="border-collapse: collapse; width: 102pt;" border="0" cellpadding="0" cellspacing="0" width="136"&gt;&lt;col style="width: 48pt;" width="64"&gt;  &lt;col style="width: 27pt;" span="2" width="36"&gt;  &lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt; width: 48pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td style="width: 27pt;" width="36"&gt;a&lt;/td&gt;   &lt;td style="width: 27pt;" num="" align="right" width="36"&gt;3&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;a&lt;/td&gt;   &lt;td num="" align="right"&gt;2&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;a&lt;/td&gt;   &lt;td num="" align="right"&gt;1&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;b&lt;/td&gt;   &lt;td num="" align="right"&gt;5&lt;/td&gt;  &lt;/tr&gt;  &lt;tr style="height: 13.2pt;" height="18"&gt;   &lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;   &lt;td&gt;b&lt;/td&gt;   &lt;td num="" align="right"&gt;4&lt;/td&gt;  &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;This works just as well, although it does not preserve the original ordering.&lt;br /&gt;&lt;br /&gt;Note that these formulas are all structured so they can be copied down cells and continue working.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style=";font-family:arial;font-size:130%;"  &gt;&lt;span style="font-weight: bold; font-style: italic; color: rgb(0, 102, 0);"&gt;What It All Looks Like Together&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This method is perhaps best explained by seeing an example.  The file &lt;a href="http://www.data-miners.com/blog/sort-in-place.xls"&gt;sort-in-place.xls&lt;/a&gt; contains random information about the fifty states (latitude, longitude, population, and capital for example) on the "data" worksheet.  The "data-sorted" worksheet shows the states abbreviations by rank order for each of the columns.  For instance, for the size column Alaska is first, followedy by Texas, California, and Montana.  For the population columns, the ordering is California, Texas, New York, and Florida.  This worksheet using the rankings on the "ranking-countif()" worksheet.&lt;br /&gt;&lt;br /&gt;The three worksheets called "ranking-&lt;something&gt;" illustrate the three different methods of doing the rankings -- using &lt;span style="font-family:courier new;"&gt;RANK()&lt;/span&gt;, using &lt;span style="font-family:courier new;"&gt;COUNTIF()&lt;/span&gt;, and using array functions.  Note that the &lt;span style="font-family:courier new;"&gt;RANK()&lt;/span&gt; method cannot handle text columns, so it does not work in this case.&lt;br /&gt;&lt;br /&gt;If you like, you can change the data on the "data" tab and see the rankings change on the sorted tab.  Voila!  A sorted table connected by formulas to the original table!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/something&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-4328254584684768008?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/4328254584684768008/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2008/09/sorting-cells-in-excel-using-formulas.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/4328254584684768008'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/4328254584684768008'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2008/09/sorting-cells-in-excel-using-formulas.html' title='Sorting Cells in Excel Using Formulas, Part 2'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-6425999227432628783</id><published>2008-08-26T16:19:00.005-04:00</published><updated>2008-08-26T16:41:38.838-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='In The News'/><category scheme='http://www.blogger.com/atom/ns#' term='MapReduce'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><title type='text'>MapReduce Functionality in Commercial Databases</title><content type='html'>If you use &lt;a href="http://www.linkedin.com"&gt;LinkedIn&lt;/a&gt;, then you have probably been impressed by their "People you may know" feature.  I know that I have.  From old friends and colleagues to an occasional person I don't necessarily want to see again, the list often contains quite familiar names.&lt;br /&gt;&lt;br /&gt;LinkedIn is basically a large graph of connections among people, enhanced with information such as company names, date of link, and so on.  We can imagine how they determine whether someone might be in someones "People you may know category", by using common names, companies, and even common paths (people who know each other).&lt;br /&gt;&lt;br /&gt;However, trying to imagine how they might determine this information using SQL is more challenging.  SQL provides the ability to store a graph of connections.  However, traversing the graph is rather complicated in standard SQL.  Furthermore, much of the information that LinkedIn maintains is complicated data -- names of companies, job titles, and dates, for instance.&lt;br /&gt;&lt;br /&gt;It is not surprising, then, that they are using MapReduce to develop this information.  The surprise, though, is that their data is being stored in a relational database, which provides full transactional-integrity and SQL querying capabilities.  The commercial database software that supports both is provided by a company called &lt;a href="http://www.greenplum.com"&gt;Greenplum&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Greenplum has distringuished itself from other next-generation database vendors by incorporating MapReduce into its database engine.  Basically, Greenplum developed a parallel framework for managing data, ported Postgres into this framework, and now has ported MapReduce as well.  This is a strong distinction from other database vendors that provide parallel Postgres solutions, and particularly well suited to complex datatypes encountered on the web.&lt;br /&gt;&lt;br /&gt;I do want to point out that the integration of MapReduce is at the programming level.  In other words, they have not changed SQL; they have added a programming layer that allows data in the database to be readily accessed using MapReduce primitives.&lt;br /&gt;&lt;br /&gt;As I've discussed in other posts, MapReduce and SQL are complementary technologies, each with their own strengths.  MapReduce can definitely benefit from SQL functionality, since SQL has proven its ability for data storage and access.  On the other hand, MapReduce has functionality that is not present in SQL databases.&lt;br /&gt;&lt;br /&gt;Now that a database vendor has fully incorporated MapReduce into its database engine, we need to ask:  Should MapReduce remain a programming paradigm or should it be incorporated into the SQL query language?   What additional keywords and operators and so on are needed to enhance SQL functionality to include MapReduce?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-6425999227432628783?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='related' href='http://www.greenplum.com/news/91/182/Greenplum-Brings-MapReduce-to-the-Enterprise/d,press-releases/' title='MapReduce Functionality in Commercial Databases'/><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/6425999227432628783/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2008/08/mapreduce-functionality-in-commercial.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/6425999227432628783'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/6425999227432628783'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2008/08/mapreduce-functionality-in-commercial.html' title='MapReduce Functionality in Commercial Databases'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3366935554564939610.post-8443643990548651678</id><published>2008-08-11T12:26:00.009-04:00</published><updated>2008-09-13T11:56:28.068-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gordon'/><category scheme='http://www.blogger.com/atom/ns#' term='Excel'/><title type='text'>Sorting Cells in Excel Using Formulas</title><content type='html'>This post describes how to create a new table in Excel from an existing table, where the cells are sorted, and to do this using only formulas. As a result, modifying a value in the original table results in rearranging the sorted table. And, this is accomplished without macros and without using the "sort" menu option.&lt;br /&gt;&lt;br /&gt;The material in this post is generalized &lt;a href="http://www.data-miners.com/blog/2008/09/sorting-cells-in-excel-using-formulas.html"&gt;in another post&lt;/a&gt;. Also, if you are interested in this post, you may be interested in my book &lt;a style="font-weight: bold; font-style: italic;" href="http://www.amazon.com/exec/obidos/ASIN/0470099518/thedataminers"&gt;Data Analysis Using SQL and Excel&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Consider a table with two columns:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;table style="width: 144pt; border-collapse: collapse;" str="" border="0" cellpadding="0" cellspacing="0" width="192"&gt;&lt;colgroup&gt;&lt;col style="width: 48pt;" span="3" width="64"&gt;&lt;/colgroup&gt;&lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="width: 48pt; height: 13.2pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" style="width: 48pt;" num="" width="64"&gt;12&lt;/td&gt;&lt;td class="xl24" style="width: 48pt;" width="64"&gt;B&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;14&lt;/td&gt;&lt;td class="xl24"&gt;D&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;13&lt;/td&gt;&lt;td class="xl24"&gt;C&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;11&lt;/td&gt;&lt;td class="xl24"&gt;A&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;The sorted table would be automatically calculated as:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;table style="width: 144pt; border-collapse: collapse;" str="" border="0" cellpadding="0" cellspacing="0" width="192"&gt;&lt;colgroup&gt;&lt;col style="width: 48pt;" span="3" width="64"&gt;&lt;/colgroup&gt;&lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="width: 48pt; height: 13.2pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" style="width: 48pt;" num="" width="64"&gt;11&lt;/td&gt;&lt;td class="xl24" style="width: 48pt;" width="64"&gt;A&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;12&lt;/td&gt;&lt;td class="xl24"&gt;B&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;13&lt;/td&gt;&lt;td class="xl24"&gt;C&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;14&lt;/td&gt;&lt;td class="xl24"&gt;D&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;If the first value were changed to 5, then the sorted table would automatically recalculate as:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;table style="width: 144pt; border-collapse: collapse;" str="" border="0" cellpadding="0" cellspacing="0" width="192"&gt;&lt;colgroup&gt;&lt;col style="width: 48pt;" span="3" width="64"&gt;&lt;/colgroup&gt;&lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="width: 48pt; height: 13.2pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" style="width: 48pt;" num="" width="64"&gt;11&lt;/td&gt;&lt;td class="xl24" style="width: 48pt;" width="64"&gt;A&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;13&lt;/td&gt;&lt;td class="xl24"&gt;C&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;14&lt;/td&gt;&lt;td class="xl24"&gt;D&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;15&lt;/td&gt;&lt;td class="xl24"&gt;B&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;There are two typical approaches to sorting cells in Excel. The first is to select a region and to sort it using menu options. This does not work when the cells are protected, part of a pivot table, and sometimes when they are calculated. This might also be a bad idea when the data is copied from another location or loaded by accessing a database.&lt;br /&gt;&lt;br /&gt;A common alternative is to resort to writing a macro. However, Visual Basic macros are beyond the capabilities of even many experienced Excel users.&lt;br /&gt;&lt;br /&gt;The approach described here is much simpler, since it only uses formulas. I should mention that the method described in this post only works for numeric data that has no duplicates. I will remedy that in the next post, where the ideas are extended both to data with duplicates and to character data.&lt;br /&gt;&lt;br /&gt;Three Excel functions are the key to the idea: &lt;span style="font-family:courier new;"&gt;RANK()&lt;/span&gt;,&lt;span style="font-family:courier new;"&gt; MATCH()&lt;/span&gt;, and &lt;span style="font-family:courier new;"&gt;OFFSET()&lt;/span&gt;. The first function ranks numbers in a list. The second allows us to use this info to sort the list.&lt;br /&gt;&lt;br /&gt;The following shows the effect of the &lt;span style="font-family:courier new;"&gt;RANK() &lt;/span&gt;function:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;table style="width: 192pt; border-collapse: collapse;" str="" border="0" cellpadding="0" cellspacing="0" width="256"&gt;&lt;colgroup&gt;&lt;col style="width: 48pt;" span="4" width="64"&gt;&lt;/colgroup&gt;&lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="width: 48pt; height: 13.2pt;" height="18" width="64"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl25" style="font-weight: bold; width: 48pt;" width="64"&gt;Value&lt;/td&gt;&lt;td class="xl25" style="font-weight: bold; width: 48pt;" width="64"&gt;Label&lt;/td&gt;&lt;td class="xl25" style="font-weight: bold; width: 48pt;" width="64"&gt;Rank&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;15&lt;/td&gt;&lt;td class="xl24"&gt;B&lt;/td&gt;&lt;td num="" fmla="=RANK(B2, $B$2:$B$5, 1)" align="right"&gt;4&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;14&lt;/td&gt;&lt;td class="xl24"&gt;D&lt;/td&gt;&lt;td num="" fmla="=RANK(B3, $B$2:$B$5, 1)" align="right"&gt;3&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;13&lt;/td&gt;&lt;td class="xl24"&gt;C&lt;/td&gt;&lt;td num="" fmla="=RANK(B4, $B$2:$B$5, 1)" align="right"&gt;2&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;11&lt;/td&gt;&lt;td class="xl24"&gt;A&lt;/td&gt;&lt;td num="" fmla="=RANK(B5, $B$2:$B$5, 1)" align="right"&gt;1&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;The function itself looks like:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;table style="width: 299pt; border-collapse: collapse;" str="" border="0" cellpadding="0" cellspacing="0" width="398"&gt;&lt;colgroup&gt;&lt;col style="width: 54pt;" width="72"&gt;&lt;col style="width: 56pt;" width="75"&gt;&lt;col style="width: 46pt;" width="61"&gt;&lt;col style="width: 143pt;" width="190"&gt;&lt;/colgroup&gt;&lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="width: 54pt; height: 13.2pt;" height="18" width="72"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl25" style="font-weight: bold; width: 56pt;" width="75"&gt;Value&lt;/td&gt;&lt;td class="xl25" style="font-weight: bold; width: 46pt;" width="61"&gt;Label&lt;/td&gt;&lt;td class="xl25" style="font-weight: bold; width: 143pt;" width="190"&gt;Rank&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;15&lt;/td&gt;&lt;td class="xl24"&gt;B&lt;/td&gt;&lt;td num="4" fmla="=RANK(B2, $B$2:$B$5, 1)"&gt;=RANK(C5, $C$5:$C$8, 1)&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;14&lt;/td&gt;&lt;td class="xl24"&gt;D&lt;/td&gt;&lt;td num="3" fmla="=RANK(B3, $B$2:$B$5, 1)"&gt;=RANK(C6, $C$5:$C$8, 1)&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;13&lt;/td&gt;&lt;td class="xl24"&gt;C&lt;/td&gt;&lt;td num="2" fmla="=RANK(B4, $B$2:$B$5, 1)"&gt;=RANK(C7, $C$5:$C$8, 1)&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;11&lt;/td&gt;&lt;td class="xl24"&gt;A&lt;/td&gt;&lt;td num="1" fmla="=RANK(B5, $B$2:$B$5, 1)"&gt;=RANK(C8, $C$5:$C$8, 1)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The first argument is the value to be ranked. The cell C5 contains the value "15". The second argument is the range to use for h the ranking -- these are all the values in the cell. And the third is the direction, which means the lowest values get the lowest rankings. In this case, "15" is the largest value of four, so its rank is 4.&lt;br /&gt;&lt;br /&gt;The following shows the formulas for the table:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;table style="width: 583pt; border-collapse: collapse;" str="" border="0" cellpadding="0" cellspacing="0" width="778"&gt;&lt;colgroup&gt;&lt;col style="width: 54pt;" width="72"&gt;&lt;col style="width: 56pt;" width="75"&gt;&lt;col style="width: 232pt;" width="309"&gt;&lt;col style="width: 241pt;" width="322"&gt;&lt;/colgroup&gt;&lt;tbody&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="width: 54pt; height: 13.2pt;" height="18" width="72"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl25" style="font-weight: bold; width: 56pt;" width="75"&gt;Rank&lt;/td&gt;&lt;td class="xl25" style="font-weight: bold; width: 232pt;" width="309"&gt;Value&lt;/td&gt;&lt;td class="xl25" style="font-weight: bold; width: 241pt;" width="322"&gt;Lable&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;1&lt;/td&gt;&lt;td class="xl24" num="11"&gt;=OFFSET(C$4, MATCH(C11, $E$5:$E$8, 0), 0)&lt;/td&gt;&lt;td class="xl24" str="A"&gt;=OFFSET(D$4, MATCH(C11, $E$5:$E$8, 0), 0)&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;2&lt;/td&gt;&lt;td class="xl24" num="13"&gt;=OFFSET(C$4, MATCH(C12, $E$5:$E$8, 0), 0)&lt;/td&gt;&lt;td class="xl24" str="C"&gt;=OFFSET(D$4, MATCH(C12, $E$5:$E$8, 0), 0)&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;3&lt;/td&gt;&lt;td class="xl24" num="14"&gt;=OFFSET(C$4, MATCH(C13, $E$5:$E$8, 0), 0)&lt;/td&gt;&lt;td class="xl24" str="D"&gt;=OFFSET(D$4, MATCH(C13, $E$5:$E$8, 0), 0)&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 13.2pt;" height="18"&gt;&lt;td style="height: 13.2pt;" height="18"&gt;&lt;br /&gt;&lt;/td&gt;&lt;td class="xl24" num=""&gt;4&lt;/td&gt;&lt;td class="xl24" num="15"&gt;=OFFSET(C$4, MATCH(C14, $E$5:$E$8, 0), 0)&lt;/td&gt;&lt;td class="xl24" str="B"&gt;=OFFSET(D$4, MATCH(C14, $E$5:$E$8, 0), 0)&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;The first column is the desired ranking column. This is simply the numbers starting at 1 and incrementing by 1. The function &lt;span style="font-family:courier new;"&gt;MATCH(C11, $E$5:$E$8, 0)&lt;/span&gt; simply looks up the ranks in the column of calculated ranks. So, the value in C11 is "1". In the previous column, this is the fourth value. The &lt;span style="font-family:courier new;"&gt;OFFSET()&lt;/span&gt; function then finds the fourth value in the C column for the value and the fourth value in the D column for the label.&lt;br /&gt;&lt;br /&gt;The result is that the sorted table is tied to the original table by formulas, so changing values in the original table will result in recalculating the sorted table, automatically.&lt;br /&gt;&lt;br /&gt;The overall approach is simple to describe. First, we need to calculate the ranking of each row in the original table based on the column that we want to sort. This ranking takes on the values 1 to N fo rthe values in the table. Then, we create a new sorted table that has the rankings in order. This table looks up the appropriate row for each ranking using the &lt;span style="font-family:courier new;"&gt;MATCH()&lt;/span&gt; function. Finally, the &lt;span style="font-family:courier new;"&gt;OFFSET()&lt;/span&gt; function is used to lookup the appropriate values from the appropriate row. The result is a table that is sorted with a "live" connection to another table.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3366935554564939610-8443643990548651678?l=www.data-miners.com%2Fblog'/&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/8443643990548651678/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.data-miners.com/blog/2008/08/sorting-cells-in-excel-using-formulas.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/8443643990548651678'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3366935554564939610/posts/default/8443643990548651678'/><link rel='alternate' type='text/html' href='http://www.data-miners.com/blog/2008/08/sorting-cells-in-excel-using-formulas.html' title='Sorting Cells in Excel Using Formulas'/><author><name>Gordon S. Linoff</name><uri>http://www.blogger.com/profile/02341184075032239786</uri><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='08609345144953141014'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>5</thr:total></entry></feed>