Thursday, February 25, 2010

Agglomerative Variable Clustering

Lately, I've been thinking about the topic of reducing the number of variables, and how this is a lot like clustering variables (rather than clustering rows). This post is about a method that seems intuitive to me, although I haven't found any references to it. Perhaps a reader will point me to references and a formal name. This method using Pearson correlation and principal components to agglomeratively cluster the variables.

Agglomerative clustering is the process of assigning records to clusters, starting with the records that are closest to each other. This process is repeated, until all records are placed into a single cluster. The advantage of agglomerative clustering is that it creates a structure for the records, and the user can see different numbers of clusters. Divisive clustering, such as implemented by SAS's varclus proc, produces something similar, but from the top-down.

Agglomerative variable clustering works the same way. Two variables are put into the same cluster, based on their proximity. The cluster then needs to be defined in some manner, by combining information in the cluster.

The natural measure for proximity is the square of the (Pearson) correlation between the variables. This is a value between 0 and 1 where 0 is totally uncorrelated and 1 means the values are colinear. For those who are more graphically inclined, this statistic has an easy interpretation when there are two variables. It is the R-square value of the first principal component of the scatter plot.

Combining two variables into a cluster requires creating a single variable to represent the cluster. The natural variable for this is the first principal component.

My proposed clustering method repeatedly does the following:
  1. Finds the two variables with the highest correlation.
  2. Calculates the principal component for these variables and adds it into the data.
  3. Maintains the information that the two variables have been combined.
The attached SAS code (available at sas-var-hierarchical-clustering-v01.sas) does exactly this, although not in the most efficient and robust way. The bulk of the code is a macro, called buildcolumns, that appends the new cluster variables to the data set and maintains another table called columns which has the information about the rows. After I run this code, I can select different numbers of variables using the expression:

proc sql;
....select colname
....from columns
....where counter <= [some number] <>

These variables can then be used for predictive models or visualization purposes.

The inner loop of the code works by doing the following:
  1. Calling proc corr to calculate the correlation of all variables not already in a cluster.
  2. Transposing the correlations into a table with three columns, two for the variables and one for the correlation using proc transpose.
  3. Finding the pair of variables with the largest correlation.
  4. Calculating the first principal component for these variables.
  5. Appending this principal component to the data set.
  6. Updating the columns data set with information about the new cluster.
The data set referred to in the code comes from the companion site for . The code will fail (by running an infinite loop) if any variables are missing or if two variables are exactly correlated.

Wednesday, February 10, 2010

Why there is always a J window open on my desktop

People often ask me what tools I use for data analysis. My usual answer is SQL and I explain that just as Willie Sutton robbed banks because "that's where the money is," I use SQL because that is where the data is. But sometimes, it gets so frustrating trying to figure out how to get SQL to do something as seemingly straight forward as a running total or running maximum, that I let the data escape from the confines of its relational tables and into J where it can be free. I assume that most readers have never heard of J, so I'll give you a little taste of it here.  It's a bit like R only a lot more general and more powerful. It's even more like APL, of which it is a direct descendant, but those of us who remember APL are getting pretty old these days.

The question that sent me to J this time came from a client who had just started collection sales data from a web site and wanted to know how long they would have to wait before being able to make some statistically valid conclusions about whether spending differences between two groups who had received different marketing treatments were statistically significant. One thing I wanted to look at was how much various measures such as average order size and total revenue fluctuate from day to day and how many days does it take before the overall measures settle down near their long-term means. For example, I'd like to calculate the average order size with just one day's worth of purchases, then two day's worth, then three day's worth, and so on. This sort of operation, where a function is applied to successively longer and longer prefixes is called a scan.

A warning: J looks really weird when you first see it. One reason is that many things that are treated as a single token are spelled with two characters. I remember when I first saw Dutch, there were all these impossible looking words with "ij" in them--ijs and rijs, for example. Well, it turns out that in Dutch "ij" is treated like a single letter that makes a sound a bit like the English "eye." So ijs is ice and rijs is rice and the Rijn is a famous big river. In J, the second character of these two-character symbols is usually a '.' or a ':'.

=: is assignment. <. is lesser of. >. is greater of. And so on. You should also know that anything following NB. on a line is comment text.

   x=: ? 100#10                        NB. One hundred random integers between 0 and 9

   +/ x                                      NB. Like putting a + between every pair of x--the sum of x.
424
   <. / x                                    NB. Smallest x
0
   >. / x                                    NB. Largest x
9
   mean x
4.24
   ~. x                                      NB. Nub of x. (Distinct elements.)
3 0 1 4 6 2 8 7 5 9
   # ~. x                                    NB. Number of distinct elements.
10
    x # /. x                                  NB. How many of each distinct element. ( /. is like SQL GROUP BY.)
6 10 15 13 15 9 9 12 6 5
   +/ \ x                                      NB. Running total of x.
3 3 4 8 12 13 19 23 25 33 41 48 54 56 61 67 69 72 73 74 75 . . .
   >./ \ x                                     NB. Running maximum of x.
3 3 3 4 4 4 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 . . .
   mean \ x                                  NB. Running mean of x.
3 1.5 1.33333 2 2.4 2.16667 2.71429 2.875 2.77778 3.3 3.72727 . . .
   plot mean \ x                            NB. Plot running mean of x.


   plot var \ x                               NB. Plot running variance of x.

 
 
J is available for free from J software. Other than as a fan, I have no relationship with that organization.

Creating DDL For An Entire Database In SQL Server 2008

Recently, I started a new project which has a database component. I looked around for some visual data modeling tools, and I settled on just using the diagrams capability of SQL Server. Since the client is using SQL Server, it was simple to download SQL Server Express and get started using their diagramming tool.

After creating a bunch of tables, I learned that SQL Server Database Diagrams do not produce the Data Definition Language (DDL) to create the database. Instead, the tables are created in sync with the diagram. Furthermore, SQL Server does not have a command that creates the DDL for an entire database. Right clicking on two dozen tables is cumbersome. But even worse, it would not provide complete DDL, since the table DDL does not include index definitions.

I have seen some debate on the web about the merits of graphical tools versus text DDL. Each has their advantages, and, personally, I believe that a decent database tool should allow users to switch between the two. The graphical environment lets me see the tables and their relationships. The text allows me to make global changes, such as:
  • Changing all the SMALLDATETIME data types to DATE when I go to a commercial version of SQL Server. The Expression version does not support DATE, alas.
  • Adding auditing columns -- such as user, creation date, and update date -- to almost all tables.
  • Adding table-specific comments.
Doing these types of actions in a point-and-click environment is cumbersome, inefficient, and prone to error. At the same time, the GUI environment is great for designing the tables and visualizing their relationships.

So, I searched on the web for a DDL program that would allow me to create the DDL for an entire SQL Server database. Because I did not find any, I decided that I had to write something myself. The attached file contains script-all-tables.sql contains my script.

This script uses SQL to generate SQL code -- a trick that I talk about in my book . The script generates code for the following:
  1. Dropping all tables in the database, if they exist.
  2. Creating new versions of the tables, taking into account primary keys, data types, and identity columns.
  3. Creating foreign key constraints on the table.
  4. Creating indexes on the table.
This is a very common subset of DDL used for databases. And, importantly, it seems to cover almost all that you can do using Database Diagrams. However, the list of what it is missing from fully re-creating any database is very, very long, ranging from user defined types, functions, and procedures, to the storage architecture, replication, and triggers.

The script uses the view in the sys schema rather than in Information_Schema simply because I found it easier to find the information that I needed to put the SQL together.

Tuesday, February 2, 2010

Simpson's Paradox and Marketing

A reader asked the following question:

Hi Michael/Gordon,
In campaign measurements, it's possible to get a larger lift at the overall level compared to all the individual decile level lifts or vice versa, because of the differences in sample size across the deciles, and across Test & Control.
According to wikipedia, it's known as Simpson's paradox (or the Yule-Simpson effect) and is explained as an apparent paradox in which the successes in different groups seem to be reversed when the groups are combined.
In such scenarios, how do you calculate the overall lift? Which methods are commonly used in the industry?
Thanks,
Datalligence
http://datalligence.blogspot.com/

Simpson's Paradox is an interesting phenomenon, where results about subgroups of a population do not generalize to the overall population. I think the simplest version that I've heard is an old joke . . . "I heard you moved from Minnesota to Iowa, raising the IQ of both states."

How could this happen? For the joke to work, the average IQ in Minnesota must be higher than the average IQ in Iowa. And, the person who moves must have an IQ between these two values. Voila, you can get the paradox that the averages in both states go up, although they are based on exactly the same population.

I didn't realize that this paradox has a name (or, if I did, then I had forgotten). Wikipedia has a very good article on Simpson's Paradox, which includes real world examples from baseball, medical studies, and an interesting discussion of a gender discrimination lawsuit at Berkeley. In the gender discrimination lawsuit, women were accepted at a much lower rate than men overall. However, department by department, women were typically accepted at a higher rate than men. The difference is that women applied to more competitive departments than men. These departments have lower rates of acceptance, lowering the overall rate for women.

Simpson's Paradox arises when we are taking weighted averages of evidence from different groups. Different weightings can produce very different, even counter-intuitive results. The results become much less paradoxical when we see the actual counts rather than just the percentages.

The specific question is how to relate this paradox to lift, and understanding marketing campaigns. Assume there is a marketing campaign, where one group receives a particular treatment and another group does not. The ratio of performance between these two groups is the lift of the marketing campaign.

To avoid Simpson's paradox, you need to ensure that the groups are as similar as possible, except for what's being tested. If the test is for the marketing message, there is no problem, both groups can be pulled from the same population. If, instead, the test is for the marketing group itself (say high value customers), then Simpson's Paradox is not an issue, since we care about how the group performs rather than how the entire population performs.

As a final comment, I could imagine finding marketing results where Simpson's Paradox has surfaced, because the original groups were not well chosen. Simpson's Paradox arises because the sizes of the test groups are not proportional to their sizes in the overall population. In this case, I would be tempted to weight the results from each group based on the expected size in the overall population to calculate the overall response and lift.

Tuesday, January 19, 2010

Oracle load scripts now avalable for Data Analysis Using SQL and Excel

Classes started this week for the spring semester at Boston College where I am teaching a class on marketing analytics to MBA students at the Carroll School of Management.  The class makes heavy use of Gordon's book, Data Analysis Using SQL and Excel and the data that accompanies it. Since the local database is Oracle, I have at long last added Oracle load scripts to the book's companion page.

Due to laziness, my method of creating the Oracle script was to use the existing MySQL script and edit bits that didn't work in Oracle.  As it happens, the MySQL scripts worked pretty much as-is to load the tab-delimited data into Oracle tables using Oracle's sqlldr utility. One case that did not work taught me something about the danger of mixing tab-delimited data with input formats in sqlldr.  Even though it has nothing to do with data mining, as a public service, that will be the topic of my next post.

Preview: Something that works perfectly well when your field delimiter is comma, fails mysteriously when it is tab.

Saturday, January 9, 2010

Hadoop and Parallel Dataflow Programming

Over the past three months, I have been teaching myself enough Hadoop to get comfortable with using the environment for analytic purposes.

There has been a lot of commentary about Hadoop/MapReduce versus relational databases (such as the articles referenced in my previous post on the subject). I actually think this discussion is misplaced because comparing open-source software with commercial software aligns people on "religious" grounds. Some people will like anything that is open-source. Some people will attack anything that is open-source (especially people who work for commercial software vendors). And, the merits of real differences get lost. Both Hadoop and relational databases are powerful systems for analyzing data, and each has its own distinct set of advantages and disadvantages.

Instead, I think that Hadoop should be compared to a parallel dataflow style of programming. What is a dataflow style of programming? It is a style where we watch the data flow through different operations, forking and combining along the way, to achieve the desired goal. Not only is a dataflow a good way to understand relational databases (which is why I introduce it in Chapter 1 of Data Analysis Using SQL and Excel), but the underlying engines that run SQL queries are dataflow engines.

Parallel dataflows extend dataflow processing to grid computing. To my knowledge, the first commercial tool that implements parallel dataflows was developed by Ab Initio. This company was a spin-off from a bleeding edge parallel supercomputer vendor called Thinking Machines that went bankrupt in 1994. As a matter of full disclosure: Ab Initio was actually formed from the group that I worked for at Thinking Machines. Although they are very, very, very resistant to sharing information about their technology, I am rather familiar it. I believe that the only publicly available information about them (including screen shots) is published in our book Mastering Data Mining: The Art and Science of Customer Relationship Management.

I am confident that Apache has at least one dataflow project, since when I google "dataflow apache" I get a pointer to the Dapper project. My wish, however, is that Hadoop were the parallel dataflow project.

Much of what Hadoop does goes unheralded by the typical MapReduce user. On a massively parallel system, Hadoop keeps track of the different parts of an HDFS file and, when the file is being used for processing, Hadoop does its darndest to keep the processing local to each file part being processed. This is great, since data locality is key to achieving good performance.

Hadoop also keeps track of which processors and disk systems are working. When there is a failure, Hadoop tries again, insulating the user from sporadic hardware faults.

Hadoop also does a pretty good job of shuffling data around, between the map and reduce operations. The shuffling method -- sorting, send, and sort again -- may not be the most efficient but it is quite general.

Alas, there are several things that Hadoop does not do, at least when accessed through the MapReduce interface. Supporting these features would allow it move beyond the MapReduce paradigm, giving it the power to support more general parallel dataflow constructs.

The first thing that bothers me about Hadoop is that I cannot easily take a text file and just copy it with the Map/Reduce primitives. Copying a file seems like something that should be easy. The problem is that a key gets generated during the map processing. The original data gets output with a key prepended, unless I do a lot of work to parse out the first field and use it as a key.

Could the context.write() function be overloaded with a version that does not output a key? Perhaps this would only be possible in the reduce phase, since I understand the importance of the key for going from map to reduce.

A performance issue with Hadoop is the shuffle phase between the map and the reduce. As I mentioned earlier, the sort-send-sort process is quite general. Alas, though, it requires a lot of work. An alternative that often works well is simply hashing. To maintain the semantics of map-reduce, I think this would be hash-send-combine or hash-send-sort. The beauty of using hashing is that the data can be sent to its destination while the map is still processing it. This allows concurrent use of the processing and network during this operation.

And, speaking of performance, why does the key have to go before the data? Why can't I just point to a sequence of bytes and use that for the key? This would enable a programming style that doesn't spend so much time parsing keys and duplicating information between values and keys.

Perhaps the most frustrating aspect of Hadoop is the MapReduce framework itself. The current version allows processing like (M+)(R)(M*). What this notation means is that the processing starts with one or more map jobs, goes to a reduce, and continues with zero or more map jobs.

THIS IS NOT GENERAL ENOUGH! I would like to have an arbitrary number of maps and reduces connected however I like. So, one map could feed two different reduces, each having different keys. At the same time, one of the reduces could feed another reduce without having to go through an intermediate map phase.

This would be a big step toward parallel dataflow parallel programming, since Map and Reduce are two very powerful primitives for this purpose.

There are some other primitives that might be useful. One would be broadcast. This would take the output from one processing node during one phase and send it to all the other nodes (in the next phase). Let's just say that using broadcast, it would be much easier to send variables around for processing. No more defining weird variables using "set" in the main program, and then parsing them in setup() functions. No more setting up temporary storage space, shared by all the processors. No more using HDFS to store small serial files, local to only one node. Just send data through a broadcast, and it goes everywhere. (If the broadcast is running on more than one node, then the results would be concatenated together, everywhere.)

And, if I had a broadcast, then my two-pass row number code (here) would only require one pass.

I think Hadoop already supports having multiple different input files into one reduce operator. This is quite powerful, and a much superior way of handling join processing.

It would also be nice to have a final sort operator. In the real world, people often do want sorted results.

In conclusion, parallel dataflows are a very powerful, expressive, and efficient way of implementing complex data processing tasks. Relational databases use dataflow engines for their processing. Using non-procedural languages such as SQL, the power of dataflows are hidden from the user -- and, some relatively simple dataflow constructs can be quite difficult to express in SQL.

Hadoop is a powerful system that emulates parallel dataflow programming. Any step in a dataflow can be implemented using a MapReduce pass -- but this requires reading, writing, sorting, and sending the data multiple times. With a few more features, Hadoop could efficiently implement parallel dataflows. I feel this would be a big boost to both performance and utility, and it would leverage the power already provided by the Hadoop framework.

Tuesday, January 5, 2010

MapReduce versus Relational Databases?

The current issue of Communications of the ACM has articles on MapReduce and relational databases. One, MapReduce a Flexible Data Processing Tool, explains the utility of MapReduce by two Google fellows -- appropriate authors, since Google invented the parallel MapReduce paradigm.

The second article, MapReduce and Parallel DBMSs: Friend or Foe, is written by a team of authors, with Michael Stonebraker listed as the first author. I am uncomfortable with this article, because the article purports to show the superiority of a particular database system, Vertica, without mentioning -- anywhere -- that Michael Stonebraker is listed as the CTO and Co-Founder on Vertica's web site. For this reason, I believe that this article should be subject to much more scrutiny.

Before starting, let me state that I personally have no major relationships with any of the database vendors or with companies in the Hadoop/MapReduce space. I am an advocate of using relational databases for data analysis and have written a book called Data Analysis Using SQL and Excel. And, over the past three months, I have been learning Hadoop and MapReduce, as attested to by numerous blog postings on the subject. Perhaps because I am a graduate of MIT ('85), I am upset that Michael Stonebraker uses his MIT affiliation for this article, without mentioning his Vertica affiliation.

The first thing I notice about the article is the number of references to Vertica. In the main text, I count nine references to Vertica, as compared to thirteen mentions of other databases:
  • Aster (twice)
  • DataAllegro (once)
  • DB2 (twice)
  • Greenplum (twice)
  • Netezza (once)
  • ParAccel (once)
  • PostgreSQL (once)
  • SQL Server (once)
  • Teradata (once)
The paper describes a study which compares Vertica, another database, and Hadoop on various tasks. The paper never explains how these databases were chosen for this purpose. Configuration issues for the other database and Hadoop are mentioned. The configuration and installation of Vertica -- by the absence of problems -- one assumes is easy and smooth. I have not (yet) read the paper cited, which describes the work in more detail.

Also, the paper never describes costs for the different system, which is a primary driver of MapReduce. The software is free and runs on cheap clusters of computers, rather than expensive servers and hardware. For a given amount of money, MapReduce may provide a much faster solution, since it can support much larger hardware environments.

The paper never describes issues in the loading of data. I assume this is a significant cost for the databases. Loading the data for Hadoop is much simpler . . . since it just reads text files, which is a common format.

From what I can gather, the database systems were optimized specifically for the tasks at hand, although this is not explicitly mentioned anywhere. For instance, the second tasks is a GROUP BY, and I suspect that the data is hash partitioned by the GROUP BY clause.

There are a few statements that I basically disagree with.

"Lastly, the reshuffle that occurs between the Map and Reduce tasks in MR is equivalent to a GROUP BY operation in SQL." The issue here at first seems like a technicality. In a relational database, an input row can only into one group. MR can output multiple records in the map stage, so a single row can go into multiple "groups". This functionality is important for the word count example, which is the canonical MapReduce example. I find it interesting that this example is not included in the benchmark.

"Given this, parallel DBMSs provide the same computing model as MR, with the added benefit of using a declarative language (SQL)." This is not true in several respects. First, MapReduce does have associated projects for supporting declarative languages. Second, in order for SQL to support the level of functionality that the authors claim, they need to use user defined functions. Is that syntax declarative?

More importantly, though, is that the computing model really is not exactly the same. Well, with SQL extensions such as GROUPING SETs and window functions, the functionality does come close. But, consider the ways that you can add a row number to data (assuming that you have no row number function built-in) using MapReduce versus traditional SQL. Using MapReduce you can follow the two-phase program that I described in an earlier posting. With traditional SQL, you have to do a non-equi-self join. MapReduce has a much richer set of built-in functions and capabilities, simply because it uses java, an established programming language with many libraries.

On the other hand, MapReduce does not have a concept of "null" built-in (although users can define their own data types and semantics). And, MapReduce handles non-equijoins poorly, because the key is used to direct both tables to the same node. In effect, you have to limit the MapReduce job to one node. SQL can still parallelize such queries.

"[MapReduce] still requires user code to parse the value portion of the record if it contains multiple attributes." Well, parse is the wrong term, since a Writable class supports binary representations of data types. I describe how to create such types here.

I don't actually feel qualified to comment on many of the operational aspects of optimizing Hadoop code. I do note that the authors do not explain the main benefit of Vertica, which is the support of column partitioning. Each column is stored separate, which makes it possible to apply very strong compression algorithms to the data. In many cases, the Vertica data will fit in memory. This is a huge performance boost (and one that another vendor, Paracel takes advantage of).

In the end, the benchmark may be comparing the in-memory performance of a database to general performance for MapReduce. The benchmark may not be including the ETL time for loading the data, partitioning data, and building indexes. The benchmark may not have allocated optimal numbers of map and reduce jobs for the purpose. And, it is possible that the benchmark is unbiased and relational databases really are better.

A paper that leaves out the affiliations between its authors and the vendors used for a benchmark is only going to invite suspicion.